Final Thoughts

When storing information, you need to consider what is being stored and how that information will be used later. Furthermore, if the data isn’t going to be used later, you need to ask yourself why you need to save it.

Sometimes it is easier to parse text before the HTML tags are removed. This is especially true with regard to data in tables, where rows and columns are parsed.

While unformatted pages are stripped of presentation, colors, and images, the remaining text is enough to represent the original file. Without the HTML, it is actually easier to characterize, manipulate, or search for the presence of keywords.

Before you continue, this is a good time to download LIB_mysql, LIB_http, and LIB_thumbnail from this book’s website. ...

Get Webbots, Spiders, and Screen Scrapers, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.