文本提取 — Aspose.Note FOSS for Python
Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText 节点。每个 RichText 同时包含纯文本 .Text 字符串和一个 .TextRuns 单独样式化的列表 TextRun 段落。本页记录了所有可用的文本提取模式。.
提取所有纯文本
获取文档中所有文本的最快方法是 GetChildNodes(RichText),,它对整个 DOM 执行递归深度优先遍历::
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
if rt.Text:
print(rt.Text)收集到列表并连接::
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
all_text = "\n".join(
rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)按页面提取文本
按页面标题组织提取的文本::
from aspose.note import Document, Page, RichText
doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
title = (
page.Title.TitleText.Text
if page.Title and page.Title.TitleText
else "(untitled)"
)
print(f"\n=== {title} ===")
for rt in page.GetChildNodes(RichText):
if rt.Text:
print(rt.Text)检查格式化运行
RichText.TextRuns 是一个列表 TextRun 对象。每个运行覆盖一段连续的字符范围,具有统一的 TextStyle:
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
for run in rt.TextRuns:
style = run.Style
parts = []
if style.IsBold: parts.append("bold")
if style.IsItalic: parts.append("italic")
if style.IsUnderline: parts.append("underline")
if style.IsStrikethrough: parts.append("strikethrough")
if style.IsSuperscript: parts.append("superscript")
if style.IsSubscript: parts.append("subscript")
if style.FontName: parts.append(f"font={style.FontName!r}")
if style.FontSize: parts.append(f"size={style.FontSize}pt")
label = ", ".join(parts) if parts else "plain"
print(f"[{label}] {run.Text!r}")TextStyle 属性参考
| 属性 | 类型 | 描述 |
|---|---|---|
IsBold | bool | 粗体文本 |
IsItalic | bool | 斜体文本 |
IsUnderline | bool | 下划线文本 |
IsStrikethrough | bool | 删除线文本 |
IsSuperscript | bool | 上标 |
IsSubscript | bool | 下标 |
FontName | `str | None` |
FontSize | `float | None` |
FontColor | `int | None` |
Highlight | `int | None` |
Language | `int | None` |
IsHyperlink | bool | 此运行是否为超链接 |
HyperlinkAddress | `str | None` |
提取超链接
超链接存储在 TextRun 级别。检查 Style.IsHyperlink:
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
for run in rt.TextRuns:
if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
print(f" {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")提取加粗和高亮文本
通过格式属性过滤运行,以隔离特定内容::
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
for run in rt.TextRuns:
if run.Style.IsBold and run.Text.strip():
print(f" {run.Text.strip()!r}")
print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
for run in rt.TextRuns:
if run.Style.Highlight is not None and run.Text.strip():
color = f"#{run.Style.Highlight & 0xFFFFFF:06X}"
print(f" [{color}] {run.Text.strip()!r}")从标题块提取文本
页面标题是 RichText 节点位于 Title 对象中。它们不会通过顶层 GetChildNodes(RichText) 在页面上返回,除非您包含 Title 子树。直接访问它们::
from aspose.note import Document, Page
doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
if page.Title:
if page.Title.TitleText:
print("Title text:", page.Title.TitleText.Text)
if page.Title.TitleDate:
print("Title date:", page.Title.TitleDate.Text)
if page.Title.TitleTime:
print("Title time:", page.Title.TitleTime.Text)从表格中提取文本
表格单元格包含 RichText 子项。使用嵌套 GetChildNodes 调用::
from aspose.note import Document, Table, TableRow, TableCell, RichText
doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
for row in table.GetChildNodes(TableRow):
row_values = []
for cell in row.GetChildNodes(TableCell):
cell_text = " ".join(
rt.Text for rt in cell.GetChildNodes(RichText)
).strip()
row_values.append(cell_text)
print(row_values)内存中文本操作
替换文本
RichText.Replace(old_value, new_value) 在所有运行中以内存方式替换文本::
from aspose.note import Document, RichText
doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported追加文本块
from aspose.note import Document, RichText, TextStyle
doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
rt.Append(" [reviewed]") # appends with default style
break # just the first node in this example将提取的文本保存到文件
import sys
from aspose.note import Document, RichText
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]
with open("extracted.txt", "w", encoding="utf-8") as f:
f.write("\n".join(lines))
print(f"Extracted {len(lines)} text blocks.")提示
GetChildNodes(RichText)在一个Document搜索 整个 树,包括所有页面、提纲和提纲元素。在特定的Page以限制范围。.- 始终检查
rt.Text(或if rt.Text:)RichText节点在某些文档中存在。. - 在 Windows 上,重新配置
sys.stdout为 UTF-8,以避免UnicodeEncodeError在打印系统代码页之外的字符时。. TextRun仅有Text和Style字段。没有Start/End偏移属性;要在父级中定位运行的文本RichText.Text,,run.Text在…内部rt.Text手动。.