Text Extraction

Витяг тексту — Aspose.Note FOSS для Python

Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText вузол. Кожен RichText містить як plain-text .Text рядок і .TextRuns список індивідуально стилізованих TextRun сегментів. Ця сторінка документує всі доступні шаблони вилучення тексту.

Витягнути весь простий текст

Найшвидший спосіб отримати весь текст з документа — GetChildNodes(RichText), який виконує рекурсивний обхід у глибину по всьому DOM:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

Зберіть у список і об’єднайте:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
all_text = "\n".join(
    rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)

Витягнути текст по сторінках

Організуйте витягнутий текст за заголовком сторінки:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"\n=== {title} ===")
    for rt in page.GetChildNodes(RichText):
        if rt.Text:
            print(rt.Text)

Перевірте ділянки форматування

RichText.TextRuns є списком TextRun об’єкти. Кожен фрагмент охоплює безперервний діапазон символів з однорідним TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        style = run.Style
        parts = []
        if style.IsBold:          parts.append("bold")
        if style.IsItalic:        parts.append("italic")
        if style.IsUnderline:     parts.append("underline")
        if style.IsStrikethrough: parts.append("strikethrough")
        if style.IsSuperscript:   parts.append("superscript")
        if style.IsSubscript:     parts.append("subscript")
        if style.FontName:      parts.append(f"font={style.FontName!r}")
        if style.FontSize:      parts.append(f"size={style.FontSize}pt")
        label = ", ".join(parts) if parts else "plain"
        print(f"[{label}] {run.Text!r}")

Посилання на властивості TextStyle

Властивість	Тип	Опис
`IsBold`	`bool`	Жирний текст
`IsItalic`	`bool`	Курсивний текст
`IsUnderline`	`bool`	Підкреслений текст
`IsStrikethrough`	`bool`	Текст з перекресленням
`IsSuperscript`	`bool`	Верхній індекс
`IsSubscript`	`bool`	Нижній індекс
`FontName`	`str	None`
`FontSize`	`float	None`
`FontColor`	`int	None`
`Highlight`	`int	None`
`Language`	`int	None`
`IsHyperlink`	`bool`	Чи є цей фрагмент гіперпосиланням
`HyperlinkAddress`	`str	None`

Витягнути гіперпосилання

Гіперпосилання зберігаються на TextRun рівень. Перевірте Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"  {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")

Витягнути жирний та підсвічений текст

Фільтруйте ділянки за властивостями форматування, щоб виділити конкретний вміст:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsBold and run.Text.strip():
            print(f"  {run.Text.strip()!r}")

print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.Highlight is not None and run.Text.strip():
            color = f"#{run.Style.Highlight & 0xFFFFFF:06X}"
            print(f"  [{color}] {run.Text.strip()!r}")

Витягнути текст із блоків заголовків

Назви сторінок є RichText вузли всередині Title об’єкт. Вони не повертаються верхнім рівнем GetChildNodes(RichText) на сторінці, якщо ви не включите the Title піддерево. Доступайте до них безпосередньо:

from aspose.note import Document, Page

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    if page.Title:
        if page.Title.TitleText:
            print("Title text:", page.Title.TitleText.Text)
        if page.Title.TitleDate:
            print("Title date:", page.Title.TitleDate.Text)
        if page.Title.TitleTime:
            print("Title time:", page.Title.TitleTime.Text)

Витягнути текст з таблиць

Комірки таблиці містять RichText нащадки. Використовуйте вкладені GetChildNodes виклики:

from aspose.note import Document, Table, TableRow, TableCell, RichText

doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
    for row in table.GetChildNodes(TableRow):
        row_values = []
        for cell in row.GetChildNodes(TableCell):
            cell_text = " ".join(
                rt.Text for rt in cell.GetChildNodes(RichText)
            ).strip()
            row_values.append(cell_text)
        print(row_values)

Операції з текстом у пам’яті

Замінити текст

RichText.Replace(old_value, new_value) замінює текст у пам’яті під час усіх запусків:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported

Додати текстовий фрагмент

from aspose.note import Document, RichText, TextStyle

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Append(" [reviewed]")  # appends with default style
    break  # just the first node in this example

Зберегти витягнутий текст у файл

import sys
from aspose.note import Document, RichText

if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Extracted {len(lines)} text blocks.")

Поради

GetChildNodes(RichText) на Document шукає весь дерево, що включає всі сторінки, контури та елементи контуру. Викличте його на конкретному Page для обмеження області.
Завжди перевіряйте rt.Text (або if rt.Text:) перед друком, оскільки порожні RichText вузли існують у деяких документах.
У Windows, переналаштуйте sys.stdout на UTF-8, щоб уникнути UnicodeEncodeError при друкуванні символів, які не входять у системну кодову сторінку.
TextRun має лише Text і Style поля. Немає Start/End властивостей зсуву; щоб знайти текст ділянки у батьківському RichText.Text, шукайте run.Text у rt.Text вручну.