Building a Light Document Model with DocumentReader
Building a Light Document Model with DocumentReader
DocumentReader is the entry point for reading DOCX files in Aspose.Words FOSS.
It accepts input as a file path, an I/O stream, or raw bytes, then exposes the
document as a light_document_model.Document (LDM) — a structured Python object
tree that is easy to inspect and transform.
Prerequisites
| Requirement | Detail |
|---|---|
| Python | 3.9 or later |
| Package | aspose-words-foss (MIT-licensed) |
| Input | A .docx file, stream, or bytes |
pip install aspose-words-fossLoading from a File Path
DocumentReader.load_file(filepath) reads a DOCX file by path. Call
to_light_document() afterwards to obtain the LDM representation.
from aspose.words_foss import DocumentReader
reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()
print("Page count:", doc.page_count)
print("Sections:", len(doc.sections))Loading from a Stream
DocumentReader.load_stream(stream) accepts any file-like object with a read()
method. This is useful for data coming from HTTP responses, ZIP archives, or
in-memory buffers.
from aspose.words_foss import DocumentReader
reader = DocumentReader()
with open("data/report.docx", "rb") as fh:
reader.load_stream(fh)
doc = reader.to_light_document()
print("Paragraphs:", len(doc.all_paragraphs))Loading from Bytes
DocumentReader.load_bytes(data) accepts a bytes or bytearray object. Use
this when the DOCX content is already in memory (e.g. fetched from a database
or constructed programmatically).
from aspose.words_foss import DocumentReader
with open("data/report.docx", "rb") as fh:
data = fh.read()
reader = DocumentReader()
reader.load_bytes(data)
doc = reader.to_light_document()
print("Text preview:", doc.text[:200])Navigating the LDM Structure
After calling to_light_document(), the returned Document object exposes the
full document tree:
from aspose.words_foss import DocumentReader
reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()
# Sections
for section in doc.sections:
print("Section paragraphs:", len(section.paragraphs))
# All paragraphs (flat list across all sections)
for para in doc.all_paragraphs:
if para.text.strip():
print(" -", para.text[:80])
# Headings
for heading in doc.headings(max_level=2):
print("H:", heading.text)Reading Full Document Text
Document.text returns the complete plain-text content of the document,
concatenating all paragraph text across all sections:
reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()
print(doc.text[:500])Tips and Best Practices
- Always call
load_file(),load_stream(), orload_bytes()before callingto_light_document()— callingto_light_document()on an unloaded reader may raise an error or return an empty document. - Use
load_bytes()when you control the document lifecycle in memory; it avoids opening file handles. - Use
doc.all_paragraphsfor a flat iteration over all paragraphs; usedoc.sectionswhen you need to work section-by-section. DocumentReaderimplementsLdmBuilderMixin, which meansto_light_document()is available as a mixin method — the same interface is shared with other reader classes in the library.
Common Issues
| Issue | Cause | Fix |
|---|---|---|
Empty doc.sections after load | File is not a valid DOCX | Verify the file opens in Word; check it is not password-protected |
doc.all_paragraphs is empty | Document has no body text (e.g. only headers/footers) | Check doc.header_paragraphs and doc.footer_paragraphs instead |
AttributeError on to_light_document() | Load method not called before conversion | Call one of load_file(), load_stream(), or load_bytes() first |
FAQ
What is the LDM?
The light document model (light_document_model.Document) is an abstracted,
Python-native object tree representing a Word document. It separates document logic
from the DOCX binary format and is designed for easy programmatic manipulation.
Can I modify the LDM and write it back to DOCX?
Yes. After modifying the doc object, pass it to LdmDocxWriter.write(doc, path)
to produce an updated DOCX file. See the
LdmDocxWriter guide for details.
What does LdmBuilderMixin add?
LdmBuilderMixin provides the to_light_document() method. DocumentReader
inherits from it so you always access this method via the reader instance.
API Reference Summary
| Class / Method | Description |
|---|---|
DocumentReader | Reads DOCX input and builds the LDM representation |
DocumentReader.load_file(filepath) | Loads a DOCX file by path |
DocumentReader.load_stream(stream) | Loads from a file-like stream |
DocumentReader.load_bytes(data) | Loads from a bytes object |
DocumentReader.to_light_document() | Returns the loaded document as an LDM Document |
LdmBuilderMixin | Mixin that provides to_light_document() |
Document.sections | List of sections in the document |
Document.all_paragraphs | Flat list of all paragraphs |
Document.text | Full plain-text content of the document |
Document.page_count | Number of pages |
Document.headings(max_level) | List of heading paragraphs up to max_level |