Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Crawling a large amount of web documents can be done efficiently by utilizing the power of a MapReduce cluster.

Note

As of Apache Nutch 2.2.1 release, the Nutch project has not officially migrated to Hadoop 2.x and still depends on Hadoop 1.x for the whole web crawling. However, it is possible to execute the Nutch jobs using a Hadoop 2.x cluster utilizing the backward compatibility nature of Hadoop.

Nutch HBaseStore integration further depends on HBase 0.90.6, which doesn't support Hadoop 2. Hence, this recipe works only with a Hadoop 1.x cluster. We are looking forward to a new Nutch release with full Hadoop 2.x support.

Getting ready

We assume you already have your Hadoop 1.x and HBase ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.