Chapter 6. Indexing Data Using Apache Tika

In previous chapters, we saw how we can use the data import handler provided by Solr to index data using various datasources (JDBC and file datasource). In this chapter, we'll see how we can index data for various file formats, such as MS Word, Excel, PDF and many more. We'll cover the following topics:

Introducing Apache Tika
Configuring Apache Tika in Solr
Indexing PDF and Word documents

Introducing Apache Tika

Apache Tika is an open source library that is used for document type detection and content extraction from various file formats. It uses various existing document parsers and document type detection techniques to detect and extract data. Using Tika, we can develop a universal type detector and content ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Solr for Indexing Data by Sachin Handiekar, Anshul Johri

Chapter 6. Indexing Data Using Apache Tika

Introducing Apache Tika

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly