12.8. Finding the Frequency of Terms in an Index
Problem
You need to find the most frequently used terms in a Lucene index.
Solution
Use Jakarta Lucene to index your documents and obtain a
TermEnum
using an IndexReader
. The frequency of a term is
defined as the number of documents in which a specific term appears,
and a TermEnum
object contains the frequency of
every term in a set of documents. Example 12-3
iterates over the terms contained in TermEnum
returning every term that appears in more than 1,100 speeches.
Example 12-3. TermFreq finding the most frequent terms in an index
package com.discursive.jccook.xml.bardsearch; import java.util.ArrayList; import java.util.Collections; import java.util.Iterator; import java.util.List; import org.apache.commons.lang.builder.CompareToBuilder; import org.apache.log4j.Logger; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; import com.discursive.jccook.util.LogInit; public class TermFreq { private static Logger logger = Logger.getLogger(TermFreq.class); static { LogInit.init( ); } public static void main(String[] pArgs) throws Exception { logger.info("Threshold is 1100" ); Integer threshold = new Integer( 1100 ); IndexReader reader = IndexReader.open( "index" ); TermEnum enum = reader.terms( ); List termList = new ArrayList( ); while( enum.next( ) ) { if( enum.docFreq( ) >= threshold.intValue( ) && enum.term( ).field( ).equals( "speech" ) ) { Freq freq = new Freq( enum.term( ).text( ), enum.docFreq( ) ...
Get Jakarta Commons Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.