47 Extracting URLs from Raw Text

The HYPERLINK data in Microsoft Word documents is formatted as field codes. When they are note activated as HYPERLINK fields, they appear as simple body text. If we save the Word document as a plain text file, we can process it with a filter that is a generalpurpose URL extractor.

This script will pull the URLs from a raw text file and sort them into order.

#!/bin/sh
SOURCE_FILE=$1
cat ${SOURCE_FILE} |
tr “\r\t\n” “ “  |
sed ‘s/ /\
/g’          |
grep “://”     |
sed ’s/\.$//’   |
tr -d “()”     |
sort         |
uniq

After preserving the input filename, the file is pushed through a series of piped filters. Here ...

Get Developing Quality Metadata now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Developing Quality Metadata by Cliff Wootton

47

Extracting URLs from Raw Text

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly