Posted on by & filed under Content - Highlights and Reviews, Programming & Development, Web Development.

Lucene is an open source Java-based full text search engine library that can be integrated with nearly any software application requiring search capabilities. Lucene has been used in Internet search engines, as well as local and single-site searching for its scalable, high performance indexing.

Lucene is a 100% pure-Java library, yet it can be implemented in other programming languages such as C, C++, .NET and Python. Some salient features like over 95GB/hour indexing rate on a modem, very small RAM usage, and a 20-30% index size compared to the size of the text being indexed, makes Lucene a good choice for use in web search engine applications.

Let’s analyze how a web page can be indexed and searched using the Lucene library. Before starting, let’s keep in mind that Lucene is not limited to web pages only; it can index and search anything that’s text. You will find this example by reading this Searching with Lucene section in Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko. The class used in the example is called FetchAndProcessCrawler, and it can quickly read the stored web pages as well as retrieve data from the Internet directly.

As you’ve seen, the crawling took only a few seconds in this example, but when it finished it created a new directory under the base directory called “C:/iWeb2/data/ch02/crawl-1200697910111”. The desired web data was indexed and was ready to be searched for any terms.

If you want to get a good overview of using Lucene, be sure to read this opening section of Lucene in Action, Second Edition by Michael McCandless, Erik Hatcher, and Otis Gospodnetic. The latest release of Lucene can be downloaded from the official website.

Lucene now powers search in diverse companies including Akamai, Netflix, LinkedIn, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others. Some things remain the same, though. Lucene still delivers high-performance search features in a disarmingly easy-to-use API. Due to its vibrant and diverse open-source community of developers and users, Lucene is relentlessly improving, with evolutions to APIs, significant new features such as payloads, and a huge increase (as much as 8x) in indexing speed with Lucene 2.3. And with clear writing, reusable examples, and unmatched advice on best practices, Lucene in Action, Second Edition is still the definitive guide to developing with Lucene.
Algorithms of the Intelligent Web is an example-driven blueprint for creating applications that collect, analyze, and act on the massive quantities of data users leave in their wake as they use the web. You’ll learn how to build Amazon- and Netflix-style recommendation engines, and how the same techniques apply to people matches on social-networking sites. See how click-trace analysis can result in smarter ad rotations. With a plethora of examples and extensive detail, this book shows you how to build Web 2.0 applications that are as smart as your users.

About the authors

Jawad Masood is the lead developer at gKrypt Data Security Solutions, a startup focused on providing cost-effective data security solutions using a mix of multi-core and many-core processors to deliver accelerated bulk data encryption and compression performance. He can be reached at jawad@tunacode.com.

Tags: crawling, indexing, java, Lucene, parsing, search, Web App,

Comments are closed.