This appendix provides an overview of web crawling components, a brief description of the implementation details for the crawler provided with the book, and a few open-source crawlers written in Java.
Web crawlers are used to discover, download, and store content from the Web. As we've seen in chapter 2, a web crawler is just a part of a larger application such as a search engine.
A typical web crawler has the following components:
A repository module to keep track of all URLs known to the crawler.
A document download module that retrieves documents from the Web using provided set of URLs.
A document parsing module that's responsible for extracting the raw content out of a variety of document ...