CHAPTER 3

image

Collecting Data with Nutch and Solr

Many companies collect vast amounts of data from the web by using web crawlers such as Apache Nutch. Available for more than ten years, Nutch is an open-source product provided by Apache and has a large community of committed users. An Apache Lucene open-source search platform, Solr can be used in connection with Nutch to index and search the data that Nutch collects. When you combine this functionality with Hadoop, you can store the resulting large data volume directly in a distributed file system.

In this chapter, you will learn a number of methods to connect various releases of Nutch to Hadoop. ...

Get Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.