Preface

I first became fascinated by information retrieval during my time as a student at the University of Sydney. Back in those days, finding what you wanted on the Web was an art form. I used to have a list of about six or seven search engines that I used every time I had a query and even then I had to page through their results before I found what I was looking for. As the Web exploded it seemed like it was getting harder to find what I was looking for. Then, during my last year at university, I found that more and more I was turning to a single search engine for all my queries and most of the time I didn’t even need to click past the first page of results. A newcomer at the time, Google had gone on to demonstrate, through its success, the importance of accurate information retrieval on the Web.

Around that time I was writing a statistical based natural language parser for my honors thesis, so I was no stranger to large-scale text processing. After a few years working as an IT consultant, I was really keen to get back to work on something algorithmically interesting. In 2004, I had left the corporate life to go and train in Judo at the Kodokan in Japan, and I found myself with a little spare time on my hands. I also had a new-found interest in the Ruby programming language, which was perfect for most of the little side projects I was working on except for one thing: it was missing a good information retrieval library. This was the perfect project for me to learn Ruby and to get my hands dirty doing some interesting programming. So I began my first port of the Apache Lucene information retrieval library into Ruby.

As it turns out, Ruby is not the best programming language to write this kind of processor-intensive text processing code. Ferret has evolved since then into one of the fastest search libraries available. It has been through three rewrites and is now written completely in C with Ruby bindings, and the algorithms and file formats have changed significantly from those used in Lucene to improve performance. It is now used in projects all over the world online and offline from large-scale news web sites to legal archive search engines. From what started out as a side project to learn a new programming language, Ferret has come a long way.

—David Balmain

February 2008

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, datatypes, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip, suggestion, or general note.

Caution

This icon indicates a warning or caution.

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.