46

image Extracting Hyperlinks from Word Documents

If you have a Word document containing many hyperlinks that you want to test with a URL checking application, you need to extract them first.

Here is a fragment of UNIX shell script code that does the job.

#!/bin/sh cd /working_directory strings Ch_001.doc | sed ‘s/ /\ /g’| grep “HYPERLINK”| cut -d\” -f2 > extracted_links.txt

This provides a list of URLs that you can insert into a database or run through an automated checker. Only hyperlinks will be extracted. URLs in the body of the text that are not activated hyperlinks will not be detected.

There are some caveats related to this approach. We aren’t ...

Get Developing Quality Metadata now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.