47
Extracting URLs from Raw Text
The HYPERLINK
data in Microsoft Word documents is formatted as field codes. When they are note activated as HYPERLINK
fields, they appear as simple body text. If we save the Word document as a plain text file, we can process it with a filter that is a generalpurpose URL extractor.
This script will pull the URLs from a raw text file and sort them into order.
#!/bin/sh
SOURCE_FILE=$1
cat ${SOURCE_FILE} |
tr “\r\t\n” “ “ |
sed ‘s/ /\
/g’ |
grep “://” |
sed ’s/\.$//’ |
tr -d “()” |
sort |
uniq
After preserving the input filename, the file is pushed through a series of piped filters. Here ...
Get Developing Quality Metadata now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.