Indexing PDF and Word documents

We'll create a new schema that will hold the metadata information for our indexed files. Apache Tika will extract the metadata information from the file that we pass to it. The schema.xml configuration, which we'll use, looks like the following:

<?xml version="1.0" encoding="UTF-8" ?> <schema name="tika-example" version="1.5"> <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="author" type="text_general" indexed="true" stored="true"/> <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="false" multiValued="true"/> <fieldType name="text_general" class="solr.TextField" ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.