Preface

While studying information retrieval and search engines at the University of Southern California in the summer of 2005, I became interested in the Apache Nutch project. My professor, Dr. Ellis Horowitz, had recently discovered Nutch and thought it a good platform for the students in the course to get real-world experience during the final project phase of his “CS599: Seminar on Search Engines” course.

After poking around Nutch and digging into its innards, I decided on a final project. It was a Really Simple Syndication (RSS) plugin described in detail in NUTCH-30.[1] The plugin read an RSS file, extracted its outgoing web links and text, and fed that information back into the Nutch crawler for later indexing and retrieval.

1https://issues.apache.org/jira/browse/NUTCH-30 ...

Get Tika in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.