O'Reilly logo

Instant Apache Solr for Indexing Data How-to by Alexandre Rafalovitch

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Indexing binary content on the server (Intermediate)

If Solr could only index structured documents, it would be leaving vast majority of possible content untouched. Fortunately, with the help of another Apache open source project—Apache Tika—Solr can also index binary content. Whether it is a PDF document, an MS Word or OpenOffice document, an image, or even a song, it can be indexed into Solr.

Of course, it makes no sense to just load binary content into Solr. Instead, Tika parses binary formats, extracts available metadata and, in some cases, textual content, and makes it available to Solr. In case of pseudo-binary documents, such as the latest MS Word or OpenOffice formats, quite a considerable amount of information is available. For images ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required