Other Indexing Improvements

Now that we can index multiple different types of documents, it would be nice to be able to have a bit more control over the indexing process. We should be able to specify multiple directories to add, and also specify file path patterns. It would also be nice if we could somehow make sure that files are added only when they need to be—e.g., either they haven’t been added yet or they were modified since they were added. We’d also like some way to update the index so that modified files are reindexed and deleted files are deleted from the index.

To implement these requirements, we use Ruby’s DBM class to record the time each file was added to the index. DBM is basically a storable Hash, which we will store in the /path/to/index/added_at file. Note that since the filename added_at doesn’t begin with an underscore, it won’t conflict with any of the index files. It makes sense to store it in the same place as the rest of the index files, since it is basically just another index file. Here is the code used to add files to the index:

115 if not options.add.empty?
116   include Ferret::Index                                
117   readers_dir = File.join(File.dirname(__FILE__), "readers", "*.rb")
118   Dir[readers_dir].each {|fn| require fn} 
119   field_infos = FieldInfos.new(:index => :untokenized_omit_norms, 
120                                :term_vector => :no)
121   field_infos.add_field(:content, :store => :no, :index => :yes)
122   FerretFind::Reader.load_readers(field_infos) 
123   writer = IndexWriter.new(:path => options ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.