Indexing PDF and Word documents

We'll create a new schema that will hold the metadata information for our indexed files. Apache Tika will extract the metadata information from the file that we pass to it. The schema.xml configuration, which we'll use, looks like the following:

<?xml version="1.0" encoding="UTF-8" ?> <schema name="tika-example" version="1.5"> <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="author" type="text_general" indexed="true" stored="true"/> <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="false" multiValued="true"/> <fieldType name="text_general" class="solr.TextField" ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Solr for Indexing Data by Sachin Handiekar, Anshul Johri

Indexing PDF and Word documents

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly