Web content mining

This type of mining focuses on extracting information from the content of web pages. Each page is usually gathered and organized (using a parsing technique), processed to remove the unimportant parts from the text (natural language processing), and then analyzed using an information retrieval system to match the relevant documents to a given query. These three components are discussed in the following paragraphs.

Parsing

A web page is written in HTML format, so the first operation is to extract the relevant pieces of information. An HTML parser builds a tree of tags from which the content can be extracted. Nowadays, there are many parsers available, but as an example, we use the Scrapy library see Chapter 7, Movie Recommendation ...

Get Machine Learning for the Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.