Text Extraction

Text Extraction

Aspose.PDF FOSS for .NET provides several mechanisms for extracting text from PDF pages. TextFragmentAbsorber is the primary tool — it scans page content and returns structured TextFragment objects with position, font, and style information.


Basic text extraction

TextAbsorber extracts all text from a page or document as a single string.

using var doc = Document.Open(pdfBytes);

var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);

string pageText = absorber.Text;

Extracting text fragments

TextFragmentAbsorber provides structured results. Each TextFragment includes the extracted text, its position on the page, and font details.

using var doc = Document.Open(pdfBytes);

var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"Text: {fragment.Text}");
    Console.WriteLine($"Position: ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
    Console.WriteLine($"Font: {fragment.TextState.Font.FontName}");
}

Searching with regular expressions

Pass a regex pattern to TextFragmentAbsorber to find specific text.

var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber);  // Search all pages

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"Found: {fragment.Text}");
}

Text segments

Each TextFragment may contain multiple TextSegment objects when the text spans different font runs.

foreach (var fragment in absorber.TextFragments)
{
    foreach (var segment in fragment.Segments)
    {
        Console.WriteLine($"Segment: {segment.Text}, Font: {segment.TextState.Font.FontName}");
    }
}

Font management

FontRepository provides methods to find and load fonts.

var font = FontRepository.FindFont("Helvetica");

Building text paragraphs

TextParagraph allows constructing multi-line text blocks for insertion into a page.

var paragraph = new TextParagraph();
paragraph.AppendLine(new TextFragment("First line"));
paragraph.AppendLine(new TextFragment("Second line"));

Tips and Best Practices

  • Use TextFragmentAbsorber when you need position and font data; use TextAbsorber for plain-text extraction.
  • Accept the absorber on a single page for faster results, or on doc.Pages to search the entire document.
  • Regular expression search is case-sensitive by default — use (?i) prefix for case-insensitive matching.
  • Check TextState.Font.IsEmbedded to determine whether a font is embedded in the PDF.
  • For large documents, process pages one at a time to manage memory usage.

Common Issues

IssueCauseFix
Extracted text is garbledPDF uses a non-standard encoding or CID font mappingCheck font embedding; some scanned PDFs require OCR
No text fragments foundPage content is an image, not textUse an OCR tool to convert image pages to text first
Regex returns no matchesPattern does not account for whitespace inserted by PDF text layoutNormalize whitespace or use a looser pattern
TextState.Font is nullFont resource is missing from the PDFHandle null checks when inspecting font properties

FAQ

Can I extract text from a specific region of a page?

Yes. Set TextFragmentAbsorber.TextSearchOptions.Rectangle to limit the search to a specific area of the page.

Does text extraction preserve reading order?

The library returns text in content-stream order. For multi-column layouts, you may need to sort fragments by position.

Can I extract text from all pages at once?

Yes. Call doc.Pages.Accept(absorber) to search every page in the document.


API Reference Summary

Class / MethodDescription
TextAbsorberExtracts all text as a single string
TextFragmentAbsorberExtracts structured text fragments with position and font data
TextFragmentA text run with position, text state, and segments
TextFragment.TextThe extracted text string
TextFragment.PositionCoordinates on the page
TextFragment.TextStateFont, size, and color information
TextFragment.SegmentsSub-runs within the fragment
TextParagraphBuilder for multi-line text blocks
TextStateFont name, size, color, and style properties
FontRepositoryStatic font lookup and loading
FontRepository.FindFontFind a font by name

See Also