42

image Extracting Metadata from Word Documents

You can inspect the document properties manually in Word. This provides access to the document metadata through a nice GUI. It’s not easy to integrate that into a workflow.

If we export as RTF, we can inspect the resulting source file to extract the properties but RTF syntax is hard to understand and parse.

A neat way to solve this is to open the document with Open Office and save as an ODT file. Then unpack the document container it creates to extract the metadata XML file.

Here is a systematic process:

1.  In Word, save the document as a normal DOC file.

2.  In Open Office, open the DOC file.

3.  Save ...

Get Developing Quality Metadata now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.