Working with DOCX Reader
Working with DOCX Reader
The DocumentReader class reads Office Open XML (.docx) files and produces an abstracted data structure — the Light Document Model (LDM). It handles paragraphs, tables, shapes, images, and formatting, making the content available for manipulation or conversion.
Reading a DOCX File
The simplest approach is using the Document constructor, which automatically selects DocumentReader for .docx files:
import aspose.words_foss as aw
doc = aw.Document("input.docx")
print(f"Sections: {len(doc.sections)}")
for para in doc.all_paragraphs:
print(para.text)DocumentReader Architecture
DocumentReader uses two mixins to split responsibilities:
| Component | Role |
|---|---|
DocumentReader | Main DOCX parser — reads XML parts from the OPC package |
LdmBuilderMixin | Constructs the Light Document Model from parsed data |
ShapeParserMixin | Parses drawing and shape elements embedded in the document |
All three work together internally — you interact with the result through the Document API.
Extracting Text
Use get_text() on a loaded document to extract plain text without saving to a file:
import aspose.words_foss as aw
doc = aw.Document("contract.docx")
text = doc.get_text()
print(text[:500])Working with Paragraphs and Formatting
Access paragraph-level properties through the LDM:
import aspose.words_foss as aw
doc = aw.Document("styled.docx")
for para in doc.all_paragraphs:
for run in para.runs:
if run.font.bold:
print(f"Bold: {run.text}")Each Run carries font properties (bold, italic, size, color) that were parsed from the DOCX XML.
Handling Shapes and Images
The ShapeParserMixin extracts drawing elements during parsing. Shapes appear in the LDM as ShapeNode objects within paragraph content:
import aspose.words_foss as aw
doc = aw.Document("with-images.docx")
for para in doc.all_paragraphs:
for node in para.content_sequence:
if hasattr(node, 'shape_type'):
print(f"Shape: {node.shape_type}")Tips and Best Practices
- Use
Document("file.docx")for standard workflows — it handles reader selection, package validation, and LDM construction automatically. - For stream-based input (web uploads), pass a file-like object to the
Documentconstructor. - For files with tables, access them via
doc.tablesordoc.all_tables. - For batch processing, reuse the same Python process to avoid repeated import overhead.
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| File not loading (exception) | Corrupt or incomplete ZIP archive | Verify the file is a valid .docx (ZIP-based OPC package) |
| Missing images in output | Image parts not embedded in the package | Check that the DOCX contains embedded (not linked) images |
| Missing content in output | Required XML parts absent from package | The file may be truncated; re-export from the source application |
FAQ
What is the Light Document Model (LDM)?
The LDM is the internal representation shared across all readers. It abstracts away format-specific details (OPC XML, OLE2 binary) into a common structure of sections, paragraphs, runs, tables, and shapes.
Can DocumentReader handle DOCM (macro-enabled) files?
Macro-enabled .docm files use the same OPC structure. The reader can parse the document content, but macros are not preserved or executed.
How does the reader handle large documents?
The reader processes the XML parts sequentially. Memory usage scales with document size. For very large documents, consider processing sections individually.
API Reference Summary
| Class / Method | Description |
|---|---|
DocumentReader | Reads DOCX documents and produces abstracted data structures |
LdmBuilderMixin | Mixin providing Light Document Model construction methods |
ShapeParserMixin | Mixin providing drawing/shape parsing methods |
Document | Represents a Word document (the LDM root) |
create_reader(path) | Factory that returns the correct reader for a file extension |