Text Extraction

استخراج متن — Aspose.Note FOSS برای Python

Aspose.Note FOSS for Python exposes the full text content of every OneNote page through the RichText گره. هر RichText حاوی هر دو متن ساده .Text رشته و یک .TextRuns لیست از بخش‌های به‌صورت جداگانه استایل‌دار TextRun بخش‌ها. این صفحه تمام الگوهای استخراج متن موجود را مستند می‌کند.

استخراج تمام متن ساده

سریع‌ترین روش برای دریافت تمام متن از یک سند این است که GetChildNodes(RichText),، که یک پیمایش بازگشتی عمق‑اول در تمام DOM انجام می‌دهد:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

در یک لیست جمع‌آوری کنید و سپس به هم بپیوندید:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
all_text = "\n".join(
    rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text
)

استخراج متن به‌صورت صفحه به صفحه

متن استخراج‌شده را بر اساس عنوان صفحه سازماندهی کنید:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"\n=== {title} ===")
    for rt in page.GetChildNodes(RichText):
        if rt.Text:
            print(rt.Text)

بررسی بخش‌های قالب‌بندی

RichText.TextRuns یک فهرست از TextRun اشیاء. هر اجرا یک بازه پیوسته از کاراکترها را با یک TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        style = run.Style
        parts = []
        if style.IsBold:          parts.append("bold")
        if style.IsItalic:        parts.append("italic")
        if style.IsUnderline:     parts.append("underline")
        if style.IsStrikethrough: parts.append("strikethrough")
        if style.IsSuperscript:   parts.append("superscript")
        if style.IsSubscript:     parts.append("subscript")
        if style.FontName:      parts.append(f"font={style.FontName!r}")
        if style.FontSize:      parts.append(f"size={style.FontSize}pt")
        label = ", ".join(parts) if parts else "plain"
        print(f"[{label}] {run.Text!r}")

مرجع ویژگی TextStyle

ویژگی	نوع	توضیح
`IsBold`	`bool`	متن پررنگ
`IsItalic`	`bool`	متن کج
`IsUnderline`	`bool`	متن زیرخط‌دار
`IsStrikethrough`	`bool`	متن خط خورده
`IsSuperscript`	`bool`	بالانویس
`IsSubscript`	`bool`	پایین‌نویس
`FontName`	`str	None`
`FontSize`	`float	None`
`FontColor`	`int	None`
`Highlight`	`int	None`
`Language`	`int	None`
`IsHyperlink`	`bool`	آیا این اجرا یک پیوند است
`HyperlinkAddress`	`str	None`

استخراج پیوندهای ابرمتنی

پیوندها در TextRun سطح. بررسی کنید Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"  {run.Text!r:40s} -> {run.Style.HyperlinkAddress}")

استخراج متن بولد و برجسته‌شده

بخش‌ها را بر اساس ویژگی‌های قالب‌بندی فیلتر کنید تا محتوای خاصی را جدا کنید:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
print("=== Bold segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsBold and run.Text.strip():
            print(f"  {run.Text.strip()!r}")

print("\n=== Highlighted segments ===")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.Highlight is not None and run.Text.strip():
            color = f"#{run.Style.Highlight & 0xFFFFFF:06X}"
            print(f"  [{color}] {run.Text.strip()!r}")

استخراج متن از بلوک‌های عنوان

عنوان‌های صفحه هستند RichText گره‌های داخل the Title شی. آن‌ها توسط سطح بالایی بازگردانده نمی‌شوند GetChildNodes(RichText) در صفحه مگر اینکه شما شامل کنید Title زیردرخت. به‌صورت مستقیم به آن‌ها دسترسی پیدا کنید:

from aspose.note import Document, Page

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    if page.Title:
        if page.Title.TitleText:
            print("Title text:", page.Title.TitleText.Text)
        if page.Title.TitleDate:
            print("Title date:", page.Title.TitleDate.Text)
        if page.Title.TitleTime:
            print("Title time:", page.Title.TitleTime.Text)

استخراج متن از جداول

سلول‌های جدول شامل RichText فرزندان. از تو در تو استفاده کنید GetChildNodes فراخوانی‌ها:

from aspose.note import Document, Table, TableRow, TableCell, RichText

doc = Document("MyNotes.one")
for table in doc.GetChildNodes(Table):
    for row in table.GetChildNodes(TableRow):
        row_values = []
        for cell in row.GetChildNodes(TableCell):
            cell_text = " ".join(
                rt.Text for rt in cell.GetChildNodes(RichText)
            ).strip()
            row_values.append(cell_text)
        print(row_values)

عملیات متنی در حافظه

جایگزینی متن

RichText.Replace(old_value, new_value) متن را در حافظه و در تمام اجراها جایگزین می‌کند:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Replace("TODO", "DONE")
##Changes are in-memory only; saving back to .one is not supported

یک run متنی را اضافه کنید

from aspose.note import Document, RichText, TextStyle

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    rt.Append(" [reviewed]")  # appends with default style
    break  # just the first node in this example

ذخیره متن استخراج‌شده در فایل

import sys
from aspose.note import Document, RichText

if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Extracted {len(lines)} text blocks.")

نکات

GetChildNodes(RichText) روی یک Document جستجو می‌کند کل درخت شامل تمام صفحات، طرح‌ها و عناصر طرح. آن را بر روی یک خاص Page برای محدود کردن دامنه.
همیشه بررسی کنید rt.Text (یا if rt.Text:) قبل از چاپ، به عنوان خالی RichText گره‌ها در برخی اسناد وجود دارند.
در ویندوز، پیکربندی مجدد sys.stdout به UTF-8 برای جلوگیری از UnicodeEncodeError هنگام چاپ کاراکترهایی که خارج از صفحه کد سیستم هستند.
TextRun فقط دارد Text و Style فیلدها. هیچ Start/End ویژگی‌های offset؛ برای یافتن متن یک run در والد RichText.Text,، جستجو کنید برای run.Text درون rt.Text به‌صورت دستی.