Working with DOCX Reader

Working with DOCX Reader

The DocumentReader class reads Office Open XML (.docx) files and produces an abstracted data structure — the Light Document Model (LDM). It handles paragraphs, tables, shapes, images, and formatting, making the content available for manipulation or conversion.


Reading a DOCX File

The simplest approach is using the Document constructor, which automatically selects DocumentReader for .docx files:

import aspose.words_foss as aw

doc = aw.Document("input.docx")
print(f"Sections: {len(doc.sections)}")
for para in doc.all_paragraphs:
    print(para.text)

DocumentReader Architecture

DocumentReader uses two mixins to split responsibilities:

ComponentRole
DocumentReaderMain DOCX parser — reads XML parts from the OPC package
LdmBuilderMixinConstructs the Light Document Model from parsed data
ShapeParserMixinParses drawing and shape elements embedded in the document

All three work together internally — you interact with the result through the Document API.


Extracting Text

Use get_text() on a loaded document to extract plain text without saving to a file:

import aspose.words_foss as aw

doc = aw.Document("contract.docx")
text = doc.get_text()
print(text[:500])

Working with Paragraphs and Formatting

Access paragraph-level properties through the LDM:

import aspose.words_foss as aw

doc = aw.Document("styled.docx")
for para in doc.all_paragraphs:
    for run in para.runs:
        if run.font.bold:
            print(f"Bold: {run.text}")

Each Run carries font properties (bold, italic, size, color) that were parsed from the DOCX XML.


Handling Shapes and Images

The ShapeParserMixin extracts drawing elements during parsing. Shapes appear in the LDM as ShapeNode objects within paragraph content:

import aspose.words_foss as aw

doc = aw.Document("with-images.docx")
for para in doc.all_paragraphs:
    for node in para.content_sequence:
        if hasattr(node, 'shape_type'):
            print(f"Shape: {node.shape_type}")

Tips and Best Practices

  • Use Document("file.docx") for standard workflows — it handles reader selection, package validation, and LDM construction automatically.
  • For stream-based input (web uploads), pass a file-like object to the Document constructor.
  • For files with tables, access them via doc.tables or doc.all_tables.
  • For batch processing, reuse the same Python process to avoid repeated import overhead.

Common Issues

IssueCauseFix
File not loading (exception)Corrupt or incomplete ZIP archiveVerify the file is a valid .docx (ZIP-based OPC package)
Missing images in outputImage parts not embedded in the packageCheck that the DOCX contains embedded (not linked) images
Missing content in outputRequired XML parts absent from packageThe file may be truncated; re-export from the source application

FAQ

What is the Light Document Model (LDM)?

The LDM is the internal representation shared across all readers. It abstracts away format-specific details (OPC XML, OLE2 binary) into a common structure of sections, paragraphs, runs, tables, and shapes.

Can DocumentReader handle DOCM (macro-enabled) files?

Macro-enabled .docm files use the same OPC structure. The reader can parse the document content, but macros are not preserved or executed.

How does the reader handle large documents?

The reader processes the XML parts sequentially. Memory usage scales with document size. For very large documents, consider processing sections individually.


API Reference Summary

Class / MethodDescription
DocumentReaderReads DOCX documents and produces abstracted data structures
LdmBuilderMixinMixin providing Light Document Model construction methods
ShapeParserMixinMixin providing drawing/shape parsing methods
DocumentRepresents a Word document (the LDM root)
create_reader(path)Factory that returns the correct reader for a file extension