12.6. Creating an Index of XML Documents

Problem

You need to quickly search a collection of XML documents, and, to do this, you need to create an index of terms keeping track of the context in which these terms appear.

Solution

Use Jakarta Lucene and Jakarta Digester and create an index of Lucene Document objects for the lowest level of granularity you wish to search. For example, if you are attempting to search for speeches in a Shakespeare play that contain specific terms, create a Lucene Document object for each speech. For the purposes of this recipe, assume that you are attempting to index Shakespeare plays stored in the following XML format:

<?xml version="1.0"?>


<PLAY>
  <TITLE>All's Well That Ends Well</TITLE>

  <ACT>
    <TITLE>ACT I</TITLE>

    <SCENE>
      <TITLE>SCENE I.  Rousillon. The COUNT's palace.</TITLE>

      <SPEECH>
        <SPEAKER>COUNTESS</SPEAKER>
        <LINE>In delivering my son from me, I bury a second husband.</LINE>
      </SPEECH>

      <SPEECH>
        <SPEAKER>BERTRAM</SPEAKER>
        <LINE>And I in going, madam, weep o'er my father's death</LINE>
        <LINE>anew: but I must attend his majesty's command, to</LINE>
        <LINE>whom I am now in ward, evermore in subjection.</LINE>
      </SPEECH>
    </SCENE>
  </ACT>
</PLAY>

The following class creates a Lucene index of Shakespeare speeches, reading XML files for each play in the ./data/Shakespeare directory, and calling the PlayIndexer to create Lucene Document objects for every speech. These Document objects are then written to a Lucene index using an IndexWriter:

import java.io.File; ...

Get Jakarta Commons Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.