Text Extraction

文本提取 — Aspose.Note FOSS for Python

Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText 节点。每个 RichText 同时包含纯文本 .Text 字符串和一个 .TextRuns 单独样式化的列表 TextRun 段落。本页记录了所有可用的文本提取模式。.

提取所有纯文本

获取文档中所有文本的最快方法是 GetChildNodes(RichText),，它对整个 DOM 执行递归深度优先遍历：:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

收集到列表并连接：:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
all_text = "\n".join(
    rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)

按页面提取文本

按页面标题组织提取的文本：:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"\n=== {title} ===")
    for rt in page.GetChildNodes(RichText):
        if rt.Text:
            print(rt.Text)

检查格式化运行

RichText.TextRuns 是一个列表 TextRun 对象。每个运行覆盖一段连续的字符范围，具有统一的 TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        style = run.Style
        parts = []
        if style.IsBold:          parts.append("bold")
        if style.IsItalic:        parts.append("italic")
        if style.IsUnderline:     parts.append("underline")
        if style.IsStrikethrough: parts.append("strikethrough")
        if style.IsSuperscript:   parts.append("superscript")
        if style.IsSubscript:     parts.append("subscript")
        if style.FontName:      parts.append(f"font={style.FontName!r}")
        if style.FontSize:      parts.append(f"size={style.FontSize}pt")
        label = ", ".join(parts) if parts else "plain"
        print(f"[{label}] {run.Text!r}")

TextStyle 属性参考

属性	类型	描述
`IsBold`	`bool`	粗体文本
`IsItalic`	`bool`	斜体文本
`IsUnderline`	`bool`	下划线文本
`IsStrikethrough`	`bool`	删除线文本
`IsSuperscript`	`bool`	上标
`IsSubscript`	`bool`	下标
`FontName`	`str	None`
`FontSize`	`float	None`
`FontColor`	`int	None`
`Highlight`	`int	None`
`Language`	`int	None`
`IsHyperlink`	`bool`	此运行是否为超链接
`HyperlinkAddress`	`str	None`

提取超链接

超链接存储在 TextRun 级别。检查 Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"  {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")

提取加粗和高亮文本

通过格式属性过滤运行，以隔离特定内容：:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsBold and run.Text.strip():
            print(f"  {run.Text.strip()!r}")

print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.Highlight is not None and run.Text.strip():
            color = f"#{run.Style.Highlight & 0xFFFFFF:06X}"
            print(f"  [{color}] {run.Text.strip()!r}")

从标题块提取文本

页面标题是 RichText 节点位于 Title 对象中。它们不会通过顶层 GetChildNodes(RichText) 在页面上返回，除非您包含 Title 子树。直接访问它们：:

from aspose.note import Document, Page

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    if page.Title:
        if page.Title.TitleText:
            print("Title text:", page.Title.TitleText.Text)
        if page.Title.TitleDate:
            print("Title date:", page.Title.TitleDate.Text)
        if page.Title.TitleTime:
            print("Title time:", page.Title.TitleTime.Text)

从表格中提取文本

表格单元格包含 RichText 子项。使用嵌套 GetChildNodes 调用：:

from aspose.note import Document, Table, TableRow, TableCell, RichText

doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
    for row in table.GetChildNodes(TableRow):
        row_values = []
        for cell in row.GetChildNodes(TableCell):
            cell_text = " ".join(
                rt.Text for rt in cell.GetChildNodes(RichText)
            ).strip()
            row_values.append(cell_text)
        print(row_values)

内存中文本操作

替换文本

RichText.Replace(old_value, new_value) 在所有运行中以内存方式替换文本：:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported

追加文本块

from aspose.note import Document, RichText, TextStyle

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Append(" [reviewed]")  # appends with default style
    break  # just the first node in this example

将提取的文本保存到文件

import sys
from aspose.note import Document, RichText

if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Extracted {len(lines)} text blocks.")

提示

GetChildNodes(RichText) 在一个 Document 搜索整个树，包括所有页面、提纲和提纲元素。在特定的 Page 以限制范围。.
始终检查 rt.Text （或 if rt.Text:） RichText 节点在某些文档中存在。.
在 Windows 上，重新配置 sys.stdout 为 UTF-8，以避免 UnicodeEncodeError 在打印系统代码页之外的字符时。.
TextRun 仅有 Text 和 Style 字段。没有 Start/End 偏移属性；要在父级中定位运行的文本 RichText.Text,， run.Text 在…内部 rt.Text 手动。.