Building a Light Document Model with DocumentReader

DocumentReader is the entry point for reading DOCX files in Aspose.Words FOSS. It accepts input as a file path, an I/O stream, or raw bytes, then exposes the document as a light_document_model.Document (LDM) — a structured Python object tree that is easy to inspect and transform.

Prerequisites

Requirement	Detail
Python	3.9 or later
Package	`aspose-words-foss` (MIT-licensed)
Input	A `.docx` file, stream, or `bytes`

pip install aspose-words-foss

Loading from a File Path

DocumentReader.load_file(filepath) reads a DOCX file by path. Call to_light_document() afterwards to obtain the LDM representation.

from aspose.words_foss import DocumentReader

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()

print("Page count:", doc.page_count)
print("Sections:", len(doc.sections))

Loading from a Stream

DocumentReader.load_stream(stream) accepts any file-like object with a read() method. This is useful for data coming from HTTP responses, ZIP archives, or in-memory buffers.

from aspose.words_foss import DocumentReader

reader = DocumentReader()
with open("data/report.docx", "rb") as fh:
    reader.load_stream(fh)

doc = reader.to_light_document()
print("Paragraphs:", len(doc.all_paragraphs))

Loading from Bytes

DocumentReader.load_bytes(data) accepts a bytes or bytearray object. Use this when the DOCX content is already in memory (e.g. fetched from a database or constructed programmatically).

from aspose.words_foss import DocumentReader

with open("data/report.docx", "rb") as fh:
    data = fh.read()

reader = DocumentReader()
reader.load_bytes(data)
doc = reader.to_light_document()
print("Text preview:", doc.text[:200])

Navigating the LDM Structure

After calling to_light_document(), the returned Document object exposes the full document tree:

from aspose.words_foss import DocumentReader

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()

# Sections
for section in doc.sections:
    print("Section paragraphs:", len(section.paragraphs))

# All paragraphs (flat list across all sections)
for para in doc.all_paragraphs:
    if para.text.strip():
        print(" -", para.text[:80])

# Headings
for heading in doc.headings(max_level=2):
    print("H:", heading.text)

Reading Full Document Text

Document.text returns the complete plain-text content of the document, concatenating all paragraph text across all sections:

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()
print(doc.text[:500])

Tips and Best Practices

Always call load_file(), load_stream(), or load_bytes() before calling to_light_document() — calling to_light_document() on an unloaded reader may raise an error or return an empty document.
Use load_bytes() when you control the document lifecycle in memory; it avoids opening file handles.
Use doc.all_paragraphs for a flat iteration over all paragraphs; use doc.sections when you need to work section-by-section.
DocumentReader implements LdmBuilderMixin, which means to_light_document() is available as a mixin method — the same interface is shared with other reader classes in the library.

Common Issues

Issue	Cause	Fix
Empty `doc.sections` after load	File is not a valid DOCX	Verify the file opens in Word; check it is not password-protected
`doc.all_paragraphs` is empty	Document has no body text (e.g. only headers/footers)	Check `doc.header_paragraphs` and `doc.footer_paragraphs` instead
`AttributeError` on `to_light_document()`	Load method not called before conversion	Call one of `load_file()`, `load_stream()`, or `load_bytes()` first

FAQ

What is the LDM?

The light document model (light_document_model.Document) is an abstracted, Python-native object tree representing a Word document. It separates document logic from the DOCX binary format and is designed for easy programmatic manipulation.

Can I modify the LDM and write it back to DOCX?

Yes. After modifying the doc object, pass it to LdmDocxWriter.write(doc, path) to produce an updated DOCX file. See the LdmDocxWriter guide for details.

What does `LdmBuilderMixin` add?

LdmBuilderMixin provides the to_light_document() method. DocumentReader inherits from it so you always access this method via the reader instance.

API Reference Summary

Class / Method	Description
`DocumentReader`	Reads DOCX input and builds the LDM representation
`DocumentReader.load_file(filepath)`	Loads a DOCX file by path
`DocumentReader.load_stream(stream)`	Loads from a file-like stream
`DocumentReader.load_bytes(data)`	Loads from a `bytes` object
`DocumentReader.to_light_document()`	Returns the loaded document as an LDM `Document`
`LdmBuilderMixin`	Mixin that provides `to_light_document()`
`Document.sections`	List of sections in the document
`Document.all_paragraphs`	Flat list of all paragraphs
`Document.text`	Full plain-text content of the document
`Document.page_count`	Number of pages
`Document.headings(max_level)`	List of heading paragraphs up to `max_level`

Building a Light Document Model with DocumentReader