Working with DOC Reader
Working with DOC Reader
Aspose.Words FOSS for Python includes a dedicated reader for legacy Word 97-2003 .doc files. The DocFileReader class parses the OLE2 binary format and produces a Light Document Model (LDM) that can be manipulated or converted to other formats.
Loading a DOC File
Use DocFileReader to open a .doc file from disk and convert it to a Document:
import aspose.words_foss as aw
doc = aw.Document("report.doc")
print(f"Paragraphs: {len(doc.all_paragraphs)}")The Document constructor delegates to create_reader(), which selects DocFileReader automatically based on the .doc extension.
Using DocFileReaderCore Directly
For lower-level control, instantiate DocFileReaderCore and call its load methods:
from aspose.words_foss.doc_reader.doc_file_reader_core import DocFileReaderCore
reader = DocFileReaderCore()
reader.load_file("legacy.doc")DocFileReaderCore provides three input methods:
| Method | Input type | Use case |
|---|---|---|
load_file(filepath) | File path string | Read from disk |
load_stream(stream) | File-like object | Read from upload or network stream |
load_bytes(data) | bytes | Read from in-memory buffer |
Building the Light Document Model
After loading, call to_light_document() on DocFileReader to produce an LDM Document:
from aspose.words_foss.doc_reader.doc_file_reader import DocFileReader
reader = DocFileReader()
reader.load_file("input.doc")
ldm_doc = reader.to_light_document()The LDM is the internal representation shared across all readers (DocFileReader, DocumentReader, TextFileReader). Once you have an LDM document, you can save it to any supported output format.
The DocumentFormatReader Protocol
All readers implement the DocumentFormatReader protocol, which defines the common interface: load_file(), load_stream(), and load_bytes(). The create_reader() factory function returns the appropriate reader based on file extension:
from aspose.words_foss.reader_factory import create_reader
reader = create_reader("data.doc") # returns DocFileReader
reader.load_file("data.doc")Tips and Best Practices
- Use the high-level
Document("file.doc")constructor for most workflows — it handles reader selection automatically. - Use
DocFileReaderCoreonly when you need direct access to the binary parsing layer. - For files received as
bytes(e.g., from HTTP uploads), preferload_bytes()to avoid writing temporary files. - The DOC reader supports the same round-trip fidelity as the DOCX reader for text, paragraphs, and basic formatting.
Common Issues
| Issue | Cause | Fix |
|---|---|---|
load_file() raises FileNotFoundError | Incorrect file path | Verify the path exists before calling |
| Garbled text after load | File is not a valid OLE2 binary | Confirm the file is genuinely .doc format, not a renamed .docx |
| Missing formatting in output | Some advanced DOC features not yet supported | Check limitations.md for unsupported features |
FAQ
What is the difference between DocFileReader and DocFileReaderCore?
DocFileReaderCore handles the low-level binary parsing of the OLE2 format. DocFileReader extends it by adding the to_light_document() method that produces a full LDM Document suitable for conversion.
Can I load a DOC file from a stream?
Yes. Use DocFileReaderCore.load_stream(stream) or pass a file-like object. This is useful for web applications where files are uploaded as streams.
Does the DOC reader handle password-protected files?
Password-protected .doc files are not currently supported. The reader will raise an exception if the file is encrypted.
API Reference Summary
| Class / Method | Description |
|---|---|
DocFileReader | Full DOC reader with LDM building capability |
DocFileReader.to_light_document() | Converts parsed DOC data to an LDM Document |
DocFileReaderCore | Core reader for Word 97-2003 (.doc) binary files |
DocFileReaderCore.load_file(filepath) | Load a DOC file from a path |
DocFileReaderCore.load_stream(stream) | Load a DOC file from a stream |
DocFileReaderCore.load_bytes(data) | Load a DOC file from a bytes buffer |
DocumentFormatReader | Protocol interface for all document readers |
create_reader(path) | Factory that returns the correct reader for a file extension |