Logical Structure

Tagged PDF Structure

PDF documents can include a logical structure tree (ISO 32000-1:2008, §14.7) that maps document content to semantic elements such as headings, paragraphs, and tables. This structure is used by screen readers and accessibility tools.

StructTreeRoot

StructTreeRoot is the root of the structure tree. It is accessible via the document catalog and provides methods for navigating the parent tree:

try (Document doc = new Document("tagged.pdf")) {
    COSDictionary catalog = doc.getCatalog();
    // Access structure tree via catalog MarkInfo and StructTreeRoot entries
}

findElementByMcid() resolves a Marked Content ID (MCID) to the corresponding structure element in the parent tree.

Structure Elements

Structure elements are represented as COSDictionary instances with a type tag (e.g., H1, P, Table). The parent tree maps MCID integers to structure elements, enabling content extraction tools to associate content marks with their semantic roles.

See Also