Text Extraction

텍스트 추출 — Aspose.Note FOSS for Python

Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText node. 각각 RichText plain-text를 모두 포함합니다 .Text string과 .TextRuns 개별 스타일이 적용된 목록 TextRun segments. 이 페이지에서는 사용 가능한 모든 텍스트 추출 패턴을 문서화합니다.

전체 일반 텍스트 추출

문서에서 모든 텍스트를 가져오는 가장 빠른 방법은 GetChildNodes(RichText), 전체 DOM을 가로질러 재귀적인 깊이 우선 탐색을 수행하는:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

리스트에 수집한 뒤 결합합니다:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
all_text = "\n".join(
    rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)

페이지별 텍스트 추출

추출된 텍스트를 페이지 제목별로 정리합니다:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"\n=== {title} ===")
    for rt in page.GetChildNodes(RichText):
        if rt.Text:
            print(rt.Text)

포맷 실행 검사

RichText.TextRuns 는 목록입니다 TextRun objects. 각 실행은 균일한 TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        style = run.Style
        parts = []
        if style.IsBold:          parts.append("bold")
        if style.IsItalic:        parts.append("italic")
        if style.IsUnderline:     parts.append("underline")
        if style.IsStrikethrough: parts.append("strikethrough")
        if style.IsSuperscript:   parts.append("superscript")
        if style.IsSubscript:     parts.append("subscript")
        if style.FontName:      parts.append(f"font={style.FontName!r}")
        if style.FontSize:      parts.append(f"size={style.FontSize}pt")
        label = ", ".join(parts) if parts else "plain"
        print(f"[{label}] {run.Text!r}")

TextStyle 속성 참조

속성	유형	설명
`IsBold`	`bool`	굵은 텍스트
`IsItalic`	`bool`	기울임 텍스트
`IsUnderline`	`bool`	밑줄 텍스트
`IsStrikethrough`	`bool`	취소선 텍스트
`IsSuperscript`	`bool`	위첨자
`IsSubscript`	`bool`	아래첨자
`FontName`	`str	None`
`FontSize`	`float	None`
`FontColor`	`int	None`
`Highlight`	`int	None`
`Language`	`int	None`
`IsHyperlink`	`bool`	이 실행이 하이퍼링크인지 여부
`HyperlinkAddress`	`str	None`

하이퍼링크 추출

하이퍼링크는 TextRun 레벨에 저장됩니다. 확인 Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"  {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")

굵게 및 강조된 텍스트 추출

특정 콘텐츠를 분리하기 위해 포맷 속성으로 실행을 필터링합니다:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsBold and run.Text.strip():
            print(f"  {run.Text.strip()!r}")

print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.Highlight is not None and run.Text.strip():
            color = f"#{run.Style.Highlight & 0xFFFFFF:06X}"
            print(f"  [{color}] {run.Text.strip()!r}")

제목 블록에서 텍스트 추출

페이지 제목은 RichText 노드 내부에 Title 객체. 최상위 수준에서 반환되지 않습니다 GetChildNodes(RichText) 페이지에 포함되지 않으며, 포함하지 않는 한 Title 하위 트리. 직접 접근:

from aspose.note import Document, Page

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    if page.Title:
        if page.Title.TitleText:
            print("Title text:", page.Title.TitleText.Text)
        if page.Title.TitleDate:
            print("Title date:", page.Title.TitleDate.Text)
        if page.Title.TitleTime:
            print("Title time:", page.Title.TitleTime.Text)

표에서 텍스트 추출

표 셀은 RichText 자식. 중첩된 GetChildNodes 호출:

from aspose.note import Document, Table, TableRow, TableCell, RichText

doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
    for row in table.GetChildNodes(TableRow):
        row_values = []
        for cell in row.GetChildNodes(TableCell):
            cell_text = " ".join(
                rt.Text for rt in cell.GetChildNodes(RichText)
            ).strip()
            row_values.append(cell_text)
        print(row_values)

메모리 내 텍스트 작업

텍스트 교체

RichText.Replace(old_value, new_value) 모든 실행에 걸쳐 메모리 내 텍스트를 대체합니다:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported

텍스트 실행 추가

from aspose.note import Document, RichText, TextStyle

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Append(" [reviewed]")  # appends with default style
    break  # just the first node in this example

추출된 텍스트를 파일에 저장

import sys
from aspose.note import Document, RichText

if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Extracted {len(lines)} text blocks.")

팁

GetChildNodes(RichText) 하나에 Document 검색합니다 전체 모든 페이지, 개요 및 개요 요소를 포함하는 트리. 특정 Page 범위를 제한합니다.
항상 확인하십시오 rt.Text (또는 if rt.Text:) RichText 노드가 일부 문서에 존재합니다.
Windows에서는 재구성하십시오 sys.stdout UTF-8로 설정하여 피하십시오 UnicodeEncodeError 시스템 코드 페이지 외의 문자를 출력할 때.
TextRun 는 오직 Text 와 Style 필드만 있습니다. 다음은 없습니다 Start/End 오프셋 속성; 상위 요소 내에서 실행 텍스트를 찾으려면 RichText.Text, 검색하십시오 run.Text 내에서 rt.Text 수동으로.