Chapter 7. Apache Nutch

In the previous chapter, we saw how we can index documents using Apache Tika into Solr. In this chapter, we'll see how we can use Apache Nutch to index web content into Solr and index them in Solr. This chapter will cover the following topics:

  • Introducing to Apache Nutch
  • Installing Apache Nutch
  • Configuring Solr with Nutch

Introducing Apache Nutch

Apache Nutch is an open source web crawler that can be used to retrieve data from websites and get data from it. It is an extensible and scalable crawler that gives us the freedom to use it as we like by using plugins. Apache Nutch is written in Java, just like Apache Solr, and both tools make a perfect combination for creating a search engine of our own if they are combined.

Apache ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.