Working with DOCX Reader

The DocumentReader class reads Office Open XML (.docx) files and produces an abstracted data structure — the Light Document Model (LDM). It handles paragraphs, tables, shapes, images, and formatting, making the content available for manipulation or conversion.

Reading a DOCX File

The simplest approach is using the Document constructor, which automatically selects DocumentReader for .docx files:

import aspose.words_foss as aw

doc = aw.Document("input.docx")
print(f"Sections: {len(doc.sections)}")
for para in doc.all_paragraphs:
    print(para.text)

DocumentReader Architecture

DocumentReader uses two mixins to split responsibilities:

Component	Role
`DocumentReader`	Main DOCX parser — reads XML parts from the OPC package
`LdmBuilderMixin`	Constructs the Light Document Model from parsed data
`ShapeParserMixin`	Parses drawing and shape elements embedded in the document

All three work together internally — you interact with the result through the Document API.

Extracting Text

Use get_text() on a loaded document to extract plain text without saving to a file:

import aspose.words_foss as aw

doc = aw.Document("contract.docx")
text = doc.get_text()
print(text[:500])

Working with Paragraphs and Formatting

Access paragraph-level properties through the LDM:

import aspose.words_foss as aw

doc = aw.Document("styled.docx")
for para in doc.all_paragraphs:
    for run in para.runs:
        if run.font.bold:
            print(f"Bold: {run.text}")

Each Run carries font properties (bold, italic, size, color) that were parsed from the DOCX XML.

Handling Shapes and Images

The ShapeParserMixin extracts drawing elements during parsing. Shapes appear in the LDM as ShapeNode objects within paragraph content:

import aspose.words_foss as aw

doc = aw.Document("with-images.docx")
for para in doc.all_paragraphs:
    for node in para.content_sequence:
        if hasattr(node, 'shape_type'):
            print(f"Shape: {node.shape_type}")

Tips and Best Practices

Use Document("file.docx") for standard workflows — it handles reader selection, package validation, and LDM construction automatically.
For stream-based input (web uploads), pass a file-like object to the Document constructor.
For files with tables, access them via doc.tables or doc.all_tables.
For batch processing, reuse the same Python process to avoid repeated import overhead.

Common Issues

Issue	Cause	Fix
File not loading (exception)	Corrupt or incomplete ZIP archive	Verify the file is a valid `.docx` (ZIP-based OPC package)
Missing images in output	Image parts not embedded in the package	Check that the DOCX contains embedded (not linked) images
Missing content in output	Required XML parts absent from package	The file may be truncated; re-export from the source application

FAQ

What is the Light Document Model (LDM)?

The LDM is the internal representation shared across all readers. It abstracts away format-specific details (OPC XML, OLE2 binary) into a common structure of sections, paragraphs, runs, tables, and shapes.

Can DocumentReader handle DOCM (macro-enabled) files?

Macro-enabled .docm files use the same OPC structure. The reader can parse the document content, but macros are not preserved or executed.

How does the reader handle large documents?

The reader processes the XML parts sequentially. Memory usage scales with document size. For very large documents, consider processing sections individually.

API Reference Summary

Class / Method	Description
`DocumentReader`	Reads DOCX documents and produces abstracted data structures
`LdmBuilderMixin`	Mixin providing Light Document Model construction methods
`ShapeParserMixin`	Mixin providing drawing/shape parsing methods
`Document`	Represents a Word document (the LDM root)
`create_reader(path)`	Factory that returns the correct reader for a file extension

Working with DOCX Reader