Chapter 4. Web Mining Techniques

Web data mining techniques are used to explore the data available online and then extract the relevant information from the Internet. Searching on the web is a complex process that requires different algorithms, and they will be the main focus of this chapter. Given a search query, the relevant pages are obtained using the data available on each web page, which is usually divided in the page content and the page hyperlinks to other pages. Usually, a search engine has multiple components:

  • A web crawler or spider for collecting web pages
  • A parser that extracts content and preprocesses web pages
  • An indexer to organize the web pages in data structures
  • A retrieval information system to score the most important documents ...

Get Machine Learning for the Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.