Text Extraction

Aspose.PDF FOSS for .NET provides several mechanisms for extracting text from PDF pages. TextFragmentAbsorber is the primary tool — it scans page content and returns structured TextFragment objects with position, font, and style information.

Basic text extraction

TextAbsorber extracts all text from a page or document as a single string.

using var doc = Document.Open(pdfBytes);

var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);

string pageText = absorber.Text;

Extracting text fragments

TextFragmentAbsorber provides structured results. Each TextFragment includes the extracted text, its position on the page, and font details.

using var doc = Document.Open(pdfBytes);

var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"Text: {fragment.Text}");
    Console.WriteLine($"Position: ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
    Console.WriteLine($"Font: {fragment.TextState.Font.FontName}");
}

Searching with regular expressions

Pass a regex pattern to TextFragmentAbsorber to find specific text.

var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber);  // Search all pages

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"Found: {fragment.Text}");
}

Text segments

Each TextFragment may contain multiple TextSegment objects when the text spans different font runs.

foreach (var fragment in absorber.TextFragments)
{
    foreach (var segment in fragment.Segments)
    {
        Console.WriteLine($"Segment: {segment.Text}, Font: {segment.TextState.Font.FontName}");
    }
}

Font management

FontRepository provides methods to find and load fonts.

var font = FontRepository.FindFont("Helvetica");

Building text paragraphs

TextParagraph allows constructing multi-line text blocks for insertion into a page.

var paragraph = new TextParagraph();
paragraph.AppendLine(new TextFragment("First line"));
paragraph.AppendLine(new TextFragment("Second line"));

Tips and Best Practices

Use TextFragmentAbsorber when you need position and font data; use TextAbsorber for plain-text extraction.
Accept the absorber on a single page for faster results, or on doc.Pages to search the entire document.
Regular expression search is case-sensitive by default — use (?i) prefix for case-insensitive matching.
Check TextState.Font.IsEmbedded to determine whether a font is embedded in the PDF.
For large documents, process pages one at a time to manage memory usage.

Common Issues

Issue	Cause	Fix
Extracted text is garbled	PDF uses a non-standard encoding or CID font mapping	Check font embedding; some scanned PDFs require OCR
No text fragments found	Page content is an image, not text	Use an OCR tool to convert image pages to text first
Regex returns no matches	Pattern does not account for whitespace inserted by PDF text layout	Normalize whitespace or use a looser pattern
`TextState.Font` is null	Font resource is missing from the PDF	Handle null checks when inspecting font properties

FAQ

Can I extract text from a specific region of a page?

Yes. Set TextFragmentAbsorber.TextSearchOptions.Rectangle to limit the search to a specific area of the page.

Does text extraction preserve reading order?

The library returns text in content-stream order. For multi-column layouts, you may need to sort fragments by position.

Can I extract text from all pages at once?

Yes. Call doc.Pages.Accept(absorber) to search every page in the document.

API Reference Summary

Class / Method	Description
`TextAbsorber`	Extracts all text as a single string
`TextFragmentAbsorber`	Extracts structured text fragments with position and font data
`TextFragment`	A text run with position, text state, and segments
`TextFragment.Text`	The extracted text string
`TextFragment.Position`	Coordinates on the page
`TextFragment.TextState`	Font, size, and color information
`TextFragment.Segments`	Sub-runs within the fragment
`TextParagraph`	Builder for multi-line text blocks
`TextState`	Font name, size, color, and style properties
`FontRepository`	Static font lookup and loading
`FontRepository.FindFont`	Find a font by name

Text Extraction

Text Extraction

Basic text extraction

Extracting text fragments

Searching with regular expressions

Text segments

Font management

Building text paragraphs

Tips and Best Practices

Common Issues

FAQ

Can I extract text from a specific region of a page?

Does text extraction preserve reading order?

Can I extract text from all pages at once?

API Reference Summary

See Also