Text Extraction — Aspose.Note FOSS for Python

Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText node. Each RichText holds both a plain-text .Text string and a .Runs list of individually styled TextRun segments. This page documents every available text extraction pattern.


Extract All Plain Text

The fastest way to get all text from a document is GetChildNodes(RichText), which performs a recursive depth-first traversal across the entire DOM:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

Collect into a list and join:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
all_text = "\n".join(
    rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)

Extract Text Per Page

Organize extracted text by page title:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"\n=== {title} ===")
    for rt in page.GetChildNodes(RichText):
        if rt.Text:
            print(rt.Text)

Inspect Formatting Runs

RichText.Runs is a list of TextRun objects. Each run covers a contiguous range of characters with a uniform TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        style = run.Style
        parts = []
        if style.Bold:          parts.append("bold")
        if style.Italic:        parts.append("italic")
        if style.Underline:     parts.append("underline")
        if style.Strikethrough: parts.append("strikethrough")
        if style.Superscript:   parts.append("superscript")
        if style.Subscript:     parts.append("subscript")
        if style.FontName:      parts.append(f"font={style.FontName!r}")
        if style.FontSize:      parts.append(f"size={style.FontSize}pt")
        label = ", ".join(parts) if parts else "plain"
        print(f"[{label}] {run.Text!r}")

TextStyle Property Reference

PropertyTypeDescription
BoldboolBold text
ItalicboolItalic text
UnderlineboolUnderlined text
StrikethroughboolStrikethrough text
SuperscriptboolSuperscript
SubscriptboolSubscript
FontNamestr | NoneFont family name
FontSizefloat | NoneFont size in points
FontColorint | NoneFont color as ARGB integer
HighlightColorint | NoneBackground highlight color as ARGB integer
LanguageIdint | NoneLanguage identifier (LCID)
IsHyperlinkboolWhether this run is a hyperlink
HyperlinkAddressstr | NoneURL when IsHyperlink is True

Extract Hyperlinks

Hyperlinks are stored at the TextRun level. Check Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"  {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")

Extract Bold and Highlighted Text

Filter runs by formatting properties to isolate specific content:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        if run.Style.Bold and run.Text.strip():
            print(f"  {run.Text.strip()!r}")

print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        if run.Style.HighlightColor is not None and run.Text.strip():
            color = f"#{run.Style.HighlightColor & 0xFFFFFF:06X}"
            print(f"  [{color}] {run.Text.strip()!r}")

Extract Text from Title Blocks

Page titles are RichText nodes inside the Title object. They are not returned by a top-level GetChildNodes(RichText) on the page unless you include the Title subtree. Access them directly:

from aspose.note import Document, Page

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    if page.Title:
        if page.Title.TitleText:
            print("Title text:", page.Title.TitleText.Text)
        if page.Title.TitleDate:
            print("Title date:", page.Title.TitleDate.Text)
        if page.Title.TitleTime:
            print("Title time:", page.Title.TitleTime.Text)

Extract Text from Tables

Table cells contain RichText children. Use nested GetChildNodes calls:

from aspose.note import Document, Table, TableRow, TableCell, RichText

doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
    for row in table.GetChildNodes(TableRow):
        row_values = []
        for cell in row.GetChildNodes(TableCell):
            cell_text = " ".join(
                rt.Text for rt in cell.GetChildNodes(RichText)
            ).strip()
            row_values.append(cell_text)
        print(row_values)

In-Memory Text Operations

Replace text

RichText.Replace(old_value, new_value) substitutes text in-memory across all runs:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported

Append a text run

from aspose.note import Document, RichText, TextStyle

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Append(" [reviewed]")  # appends with default style
    break  # just the first node in this example

Save Extracted Text to File

import sys
from aspose.note import Document, RichText

if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Extracted {len(lines)} text blocks.")

Tips

  • GetChildNodes(RichText) on a Document searches the entire tree including all pages, outlines, and outline elements. Call it on a specific Page to limit scope.
  • Always check rt.Text (or if rt.Text:) before printing, as empty RichText nodes exist in some documents.
  • On Windows, reconfigure sys.stdout to UTF-8 to avoid UnicodeEncodeError when printing characters outside the system code page.
  • TextRun.Start and TextRun.End indicate character offsets within the parent RichText.Text string and can be used to map formatted runs to the full text string.

See Also