A Quick Example: Indexing the Filesystem

With the explosion of the Internet, a huge amount of information has become available to us. But it doesnât matter how much information is available if we canât find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.

More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drivesâeverything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesnât cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.

By the end of this book, youâll have built a search application that will make searching your hard drive as easy as searching the Web. In this section, we start with plain old text files. Letâs begin by writing a command-line indexing program that takes two arguments: the name of the directory we want to index, and the name of the directory in which the index will be stored. Take a look at ExampleÂ 1-2.

ExampleÂ 1-2.Â index.rb

  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <data dir> <index dir>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Directory '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $data_dir, $index_dir = ARGV
 16 begin
 17   FileUtils.mkdir_p($index_dir)
 18 rescue
 19   usage("Can't create index directory '#$index_dir'.")
 20 end
 21 
 22 index = Index.new(:path => $index_dir,          
 23                   :create => true)
 24 
 25 Dir["#$data_dir/**/*.txt"].each do |file_name|  
 26   index << {:file_name => file_name, :content => File.read(file_name)} 
 27 end
 28 index.optimize()                                
 29 index.close()

Most of this code is for command-line argument handling and can be safely skimmed over. The interesting part of the code begins on line 22. This is where we create the index. The :path parameter clearly specifies where you want to store the index. Setting the :create parameter to true tells Ferret to create a new index in the specified directory. Any index already residing in the specified directory will be overwritten, so be careful when setting :create to true. We saw earlier that we can add simple Strings to an index. This time we use a Hash, as we want each document to have two fields.

Once the index is created, we need to add documents to it. Line 25 simply scans a directory tree for all text files. Line 26 is where most of the action is happening. Since we can add simple Strings to an index, we use a Hash because we want each document to have two fields: a :file_name field and a :content field. Later, weâll learn about the Document class, which lets us assign weightings (or boosts, as they are known in Ferret) to documents and fields.

The Index#optimize method is called on line 28. This method optimizes the index for searching, and it is a good idea to call it whenever you do a batch indexing.^[1] On the following line, we close the index. Index#close will make sure that any data held in RAM is flushed to the index. It then commits the index and closes any locks that the Index object might be holding on the index.

Locking Problems

It is vitally important to close your index when you have finished with it, or you could run into locking errors later on. This is one of the largest causes of error in Ferret. Weâll cover locking issues in the Index Locking and Concurrency Issuesâ section in ChapterÂ 3.

To make things a little easier, you can pass a block to the Index.new method, as you would to File.open, so that the index is automatically closed when you have finished with it:

Ferret::Index::Index.new(:path => 'path/to/index') do |index|
  documents.each {|doc| index << doc}
  index.search_each(query) {|id, score| puts "#{score} #{index[id].load}"}
end

Creating an index is now simply a matter of running the indexer from the command line:

dave$ ruby index.rb index_dir/ text_files/

Now that we have an index, we need to be able to search it. That is why we built it, after all. The search code is as simple as the indexing code; take a look at ExampleÂ 1-3.

ExampleÂ 1-3.Â search.rb

  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <index dir> <search phrase>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Index '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $index_dir, $search_phrase = ARGV
 16 
 17 index = Index.new(:path => $index_dir) 
 18 
 19 results = []
 20 total_hits = index.search_each($search_phrase) do |doc_id, score| 
 21   results << "  #{score} - #{index[doc_id][:file_name]}" 
 22 end
 23 
 24 puts "#{total_hits} matched your query:\n" + results.join("\n")
 25 
 26 index.close()

Document IDs in Ferret

In Ferret, documents are referenced by document IDs. You can think of a Ferret index as an array of documents indexed from 0. As each new document is added to an index, it is assigned the next available document ID. However, it is important to note that the document ID does not remain consistent for the life of a document in the index. As documents are updated and deleted from the index and index segments are merged and optimized, document IDs will change.

On line 21 we simply write the results to a string. You can use the document ID to access the index; the document itself acts like a Hash object. If you would like to build an index of a large number of text files, check out Project Gutenberg (http://www.gutenberg.org/). Go ahead and try out the search script:

dave$ ruby search.rb index_dir/ "Moby Dick"

^[1]When doing incremental indexing, as you might do in a Rails application, it is better not to call the optimize method. Youâll learn more about this in the Optimizing the Indexâ section in ChapterÂ 3.

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain

A Quick Example: Indexing the Filesystem

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly