Building a Light Document Model with DocumentReader

Building a Light Document Model with DocumentReader

Building a Light Document Model with DocumentReader

DocumentReader is the entry point for reading DOCX files in Aspose.Words FOSS. It accepts input as a file path, an I/O stream, or raw bytes, then exposes the document as a light_document_model.Document (LDM) — a structured Python object tree that is easy to inspect and transform.


Prerequisites

RequirementDetail
Python3.9 or later
Packageaspose-words-foss (MIT-licensed)
InputA .docx file, stream, or bytes
pip install aspose-words-foss

Loading from a File Path

DocumentReader.load_file(filepath) reads a DOCX file by path. Call to_light_document() afterwards to obtain the LDM representation.

from aspose.words_foss import DocumentReader

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()

print("Page count:", doc.page_count)
print("Sections:", len(doc.sections))

Loading from a Stream

DocumentReader.load_stream(stream) accepts any file-like object with a read() method. This is useful for data coming from HTTP responses, ZIP archives, or in-memory buffers.

from aspose.words_foss import DocumentReader

reader = DocumentReader()
with open("data/report.docx", "rb") as fh:
    reader.load_stream(fh)

doc = reader.to_light_document()
print("Paragraphs:", len(doc.all_paragraphs))

Loading from Bytes

DocumentReader.load_bytes(data) accepts a bytes or bytearray object. Use this when the DOCX content is already in memory (e.g. fetched from a database or constructed programmatically).

from aspose.words_foss import DocumentReader

with open("data/report.docx", "rb") as fh:
    data = fh.read()

reader = DocumentReader()
reader.load_bytes(data)
doc = reader.to_light_document()
print("Text preview:", doc.text[:200])

Navigating the LDM Structure

After calling to_light_document(), the returned Document object exposes the full document tree:

from aspose.words_foss import DocumentReader

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()

# Sections
for section in doc.sections:
    print("Section paragraphs:", len(section.paragraphs))

# All paragraphs (flat list across all sections)
for para in doc.all_paragraphs:
    if para.text.strip():
        print(" -", para.text[:80])

# Headings
for heading in doc.headings(max_level=2):
    print("H:", heading.text)

Reading Full Document Text

Document.text returns the complete plain-text content of the document, concatenating all paragraph text across all sections:

reader = DocumentReader()
reader.load_file("data/report.docx")
doc = reader.to_light_document()
print(doc.text[:500])

Tips and Best Practices

  • Always call load_file(), load_stream(), or load_bytes() before calling to_light_document() — calling to_light_document() on an unloaded reader may raise an error or return an empty document.
  • Use load_bytes() when you control the document lifecycle in memory; it avoids opening file handles.
  • Use doc.all_paragraphs for a flat iteration over all paragraphs; use doc.sections when you need to work section-by-section.
  • DocumentReader implements LdmBuilderMixin, which means to_light_document() is available as a mixin method — the same interface is shared with other reader classes in the library.

Common Issues

IssueCauseFix
Empty doc.sections after loadFile is not a valid DOCXVerify the file opens in Word; check it is not password-protected
doc.all_paragraphs is emptyDocument has no body text (e.g. only headers/footers)Check doc.header_paragraphs and doc.footer_paragraphs instead
AttributeError on to_light_document()Load method not called before conversionCall one of load_file(), load_stream(), or load_bytes() first

FAQ

What is the LDM?

The light document model (light_document_model.Document) is an abstracted, Python-native object tree representing a Word document. It separates document logic from the DOCX binary format and is designed for easy programmatic manipulation.

Can I modify the LDM and write it back to DOCX?

Yes. After modifying the doc object, pass it to LdmDocxWriter.write(doc, path) to produce an updated DOCX file. See the LdmDocxWriter guide for details.

What does LdmBuilderMixin add?

LdmBuilderMixin provides the to_light_document() method. DocumentReader inherits from it so you always access this method via the reader instance.


API Reference Summary

Class / MethodDescription
DocumentReaderReads DOCX input and builds the LDM representation
DocumentReader.load_file(filepath)Loads a DOCX file by path
DocumentReader.load_stream(stream)Loads from a file-like stream
DocumentReader.load_bytes(data)Loads from a bytes object
DocumentReader.to_light_document()Returns the loaded document as an LDM Document
LdmBuilderMixinMixin that provides to_light_document()
Document.sectionsList of sections in the document
Document.all_paragraphsFlat list of all paragraphs
Document.textFull plain-text content of the document
Document.page_countNumber of pages
Document.headings(max_level)List of heading paragraphs up to max_level

See Also