Logical Structure
Tagged PDF Structure
PDF documents can include a logical structure tree (ISO 32000-1:2008, §14.7) that maps document content to semantic elements such as headings, paragraphs, and tables. This structure is used by screen readers and accessibility tools.
StructTreeRoot
StructTreeRoot is the root of the structure tree. It is accessible via the document
catalog and provides methods for navigating the parent tree:
try (Document doc = new Document("tagged.pdf")) {
COSDictionary catalog = doc.getCatalog();
// Access structure tree via catalog MarkInfo and StructTreeRoot entries
}findElementByMcid() resolves a Marked Content ID (MCID) to the corresponding
structure element in the parent tree.
Structure Elements
Structure elements are represented as COSDictionary instances with a type tag
(e.g., H1, P, Table). The parent tree maps MCID integers to structure elements,
enabling content extraction tools to associate content marks with their semantic roles.