Chapter 4. PARSING TECHNIQUES

Parsing is the process of segregating what's desired or useful from what is not. In the case of webbots, parsing involves detecting and separating image names and addresses, key phrases, hyper-references, and other information of interest to your webbot. For example, if you are writing a spider that follows links on web pages, you will have to separate these links from the rest of the HTML. Similarly, if you write a webbot to download all the images from a web page, you will have to write parsing routines that identify all the references to image files.

Parsing Poorly Written HTML

One of the problems you'll encounter when parsing web pages is poorly written HTML. A large amount of HTML is machine generated and shows ...

Get Webbots, Spiders, and Screen Scrapers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.