Text Extraction
Text Extraction
Aspose.PDF FOSS for .NET provides several mechanisms for extracting text from
PDF pages. TextFragmentAbsorber is the primary tool — it scans page content
and returns structured TextFragment objects with position, font, and style
information.
Basic text extraction
TextAbsorber extracts all text from a page or document as a single string.
using var doc = Document.Open(pdfBytes);
var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);
string pageText = absorber.Text;Extracting text fragments
TextFragmentAbsorber provides structured results. Each TextFragment
includes the extracted text, its position on the page, and font details.
using var doc = Document.Open(pdfBytes);
var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);
foreach (var fragment in absorber.TextFragments)
{
Console.WriteLine($"Text: {fragment.Text}");
Console.WriteLine($"Position: ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
Console.WriteLine($"Font: {fragment.TextState.Font.FontName}");
}Searching with regular expressions
Pass a regex pattern to TextFragmentAbsorber to find specific text.
var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber); // Search all pages
foreach (var fragment in absorber.TextFragments)
{
Console.WriteLine($"Found: {fragment.Text}");
}Text segments
Each TextFragment may contain multiple TextSegment objects when the text
spans different font runs.
foreach (var fragment in absorber.TextFragments)
{
foreach (var segment in fragment.Segments)
{
Console.WriteLine($"Segment: {segment.Text}, Font: {segment.TextState.Font.FontName}");
}
}Font management
FontRepository provides methods to find and load fonts.
var font = FontRepository.FindFont("Helvetica");Building text paragraphs
TextParagraph allows constructing multi-line text blocks for insertion into
a page.
var paragraph = new TextParagraph();
paragraph.AppendLine(new TextFragment("First line"));
paragraph.AppendLine(new TextFragment("Second line"));Tips and Best Practices
- Use
TextFragmentAbsorberwhen you need position and font data; useTextAbsorberfor plain-text extraction. - Accept the absorber on a single page for faster results, or on
doc.Pagesto search the entire document. - Regular expression search is case-sensitive by default — use
(?i)prefix for case-insensitive matching. - Check
TextState.Font.IsEmbeddedto determine whether a font is embedded in the PDF. - For large documents, process pages one at a time to manage memory usage.
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| Extracted text is garbled | PDF uses a non-standard encoding or CID font mapping | Check font embedding; some scanned PDFs require OCR |
| No text fragments found | Page content is an image, not text | Use an OCR tool to convert image pages to text first |
| Regex returns no matches | Pattern does not account for whitespace inserted by PDF text layout | Normalize whitespace or use a looser pattern |
TextState.Font is null | Font resource is missing from the PDF | Handle null checks when inspecting font properties |
FAQ
Can I extract text from a specific region of a page?
Yes. Set TextFragmentAbsorber.TextSearchOptions.Rectangle to limit the search
to a specific area of the page.
Does text extraction preserve reading order?
The library returns text in content-stream order. For multi-column layouts, you may need to sort fragments by position.
Can I extract text from all pages at once?
Yes. Call doc.Pages.Accept(absorber) to search every page in the document.
API Reference Summary
| Class / Method | Description |
|---|---|
TextAbsorber | Extracts all text as a single string |
TextFragmentAbsorber | Extracts structured text fragments with position and font data |
TextFragment | A text run with position, text state, and segments |
TextFragment.Text | The extracted text string |
TextFragment.Position | Coordinates on the page |
TextFragment.TextState | Font, size, and color information |
TextFragment.Segments | Sub-runs within the fragment |
TextParagraph | Builder for multi-line text blocks |
TextState | Font name, size, color, and style properties |
FontRepository | Static font lookup and loading |
FontRepository.FindFont | Find a font by name |