Working with DOC Reader

Working with DOC Reader

Aspose.Words FOSS for Python includes a dedicated reader for legacy Word 97-2003 .doc files. The DocFileReader class parses the OLE2 binary format and produces a Light Document Model (LDM) that can be manipulated or converted to other formats.


Loading a DOC File

Use DocFileReader to open a .doc file from disk and convert it to a Document:

import aspose.words_foss as aw

doc = aw.Document("report.doc")
print(f"Paragraphs: {len(doc.all_paragraphs)}")

The Document constructor delegates to create_reader(), which selects DocFileReader automatically based on the .doc extension.


Using DocFileReaderCore Directly

For lower-level control, instantiate DocFileReaderCore and call its load methods:

from aspose.words_foss.doc_reader.doc_file_reader_core import DocFileReaderCore

reader = DocFileReaderCore()
reader.load_file("legacy.doc")

DocFileReaderCore provides three input methods:

MethodInput typeUse case
load_file(filepath)File path stringRead from disk
load_stream(stream)File-like objectRead from upload or network stream
load_bytes(data)bytesRead from in-memory buffer

Building the Light Document Model

After loading, call to_light_document() on DocFileReader to produce an LDM Document:

from aspose.words_foss.doc_reader.doc_file_reader import DocFileReader

reader = DocFileReader()
reader.load_file("input.doc")
ldm_doc = reader.to_light_document()

The LDM is the internal representation shared across all readers (DocFileReader, DocumentReader, TextFileReader). Once you have an LDM document, you can save it to any supported output format.


The DocumentFormatReader Protocol

All readers implement the DocumentFormatReader protocol, which defines the common interface: load_file(), load_stream(), and load_bytes(). The create_reader() factory function returns the appropriate reader based on file extension:

from aspose.words_foss.reader_factory import create_reader

reader = create_reader("data.doc")  # returns DocFileReader
reader.load_file("data.doc")

Tips and Best Practices

  • Use the high-level Document("file.doc") constructor for most workflows — it handles reader selection automatically.
  • Use DocFileReaderCore only when you need direct access to the binary parsing layer.
  • For files received as bytes (e.g., from HTTP uploads), prefer load_bytes() to avoid writing temporary files.
  • The DOC reader supports the same round-trip fidelity as the DOCX reader for text, paragraphs, and basic formatting.

Common Issues

IssueCauseFix
load_file() raises FileNotFoundErrorIncorrect file pathVerify the path exists before calling
Garbled text after loadFile is not a valid OLE2 binaryConfirm the file is genuinely .doc format, not a renamed .docx
Missing formatting in outputSome advanced DOC features not yet supportedCheck limitations.md for unsupported features

FAQ

What is the difference between DocFileReader and DocFileReaderCore?

DocFileReaderCore handles the low-level binary parsing of the OLE2 format. DocFileReader extends it by adding the to_light_document() method that produces a full LDM Document suitable for conversion.

Can I load a DOC file from a stream?

Yes. Use DocFileReaderCore.load_stream(stream) or pass a file-like object. This is useful for web applications where files are uploaded as streams.

Does the DOC reader handle password-protected files?

Password-protected .doc files are not currently supported. The reader will raise an exception if the file is encrypted.


API Reference Summary

Class / MethodDescription
DocFileReaderFull DOC reader with LDM building capability
DocFileReader.to_light_document()Converts parsed DOC data to an LDM Document
DocFileReaderCoreCore reader for Word 97-2003 (.doc) binary files
DocFileReaderCore.load_file(filepath)Load a DOC file from a path
DocFileReaderCore.load_stream(stream)Load a DOC file from a stream
DocFileReaderCore.load_bytes(data)Load a DOC file from a bytes buffer
DocumentFormatReaderProtocol interface for all document readers
create_reader(path)Factory that returns the correct reader for a file extension