A Quick Example: Indexing the Filesystem

With the explosion of the Internet, a huge amount of information has become available to us. But it doesn’t matter how much information is available if we can’t find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.

More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drives—everything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesn’t cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.

By the end of this book, you’ll have built a search application that will make searching your hard drive as easy as searching the Web. In this section, we start with plain old text files. Let’s begin by writing a command-line indexing program that takes two arguments: the name of the directory we want to index, and the name of the directory in which the index will be stored. Take a look at Example 1-2.

Example 1-2. index.rb

  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <data dir> <index dir>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Directory '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $data_dir, $index_dir = ARGV
 16 begin
 17   FileUtils.mkdir_p($index_dir)
 18 rescue
 19   usage("Can't create index directory '#$index_dir'.")
 20 end
 21 
 22 index = Index.new(:path => $index_dir,          
 23                   :create => true)
 24 
 25 Dir["#$data_dir/**/*.txt"].each do |file_name|  
 26   index << {:file_name => file_name, :content => File.read(file_name)} 
 27 end
 28 index.optimize()                                
 29 index.close()

Most of this code is for command-line argument handling and can be safely skimmed over. The interesting part of the code begins on line 22. This is where we create the index. The :path parameter clearly specifies where you want to store the index. Setting the :create parameter to true tells Ferret to create a new index in the specified directory. Any index already residing in the specified directory will be overwritten, so be careful when setting :create to true. We saw earlier that we can add simple Strings to an index. This time we use a Hash, as we want each document to have two fields.

Once the index is created, we need to add documents to it. Line 25 simply scans a directory tree for all text files. Line 26 is where most of the action is happening. Since we can add simple Strings to an index, we use a Hash because we want each document to have two fields: a :file_name field and a :content field. Later, we’ll learn about the Document class, which lets us assign weightings (or boosts, as they are known in Ferret) to documents and fields.

The Index#optimize method is called on line 28. This method optimizes the index for searching, and it is a good idea to call it whenever you do a batch indexing.[1] On the following line, we close the index. Index#close will make sure that any data held in RAM is flushed to the index. It then commits the index and closes any locks that the Index object might be holding on the index.

Creating an index is now simply a matter of running the indexer from the command line:

dave$ ruby index.rb index_dir/ text_files/

Now that we have an index, we need to be able to search it. That is why we built it, after all. The search code is as simple as the indexing code; take a look at Example 1-3.

Example 1-3. search.rb

  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <index dir> <search phrase>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Index '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $index_dir, $search_phrase = ARGV
 16 
 17 index = Index.new(:path => $index_dir) 
 18 
 19 results = []
 20 total_hits = index.search_each($search_phrase) do |doc_id, score| 
 21   results << "  #{score} - #{index[doc_id][:file_name]}" 
 22 end
 23 
 24 puts "#{total_hits} matched your query:\n" + results.join("\n")
 25 
 26 index.close()

On line 21 we simply write the results to a string. You can use the document ID to access the index; the document itself acts like a Hash object. If you would like to build an index of a large number of text files, check out Project Gutenberg (http://www.gutenberg.org/). Go ahead and try out the search script:

dave$ ruby search.rb index_dir/ "Moby Dick"


[1] When doing incremental indexing, as you might do in a Rails application, it is better not to call the optimize method. You’ll learn more about this in the Optimizing the Index” section in Chapter 3.

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.