Chapter 8. Building a text analysis toolkit

This chapter covers
  • A brief introduction to Lucene
  • Understanding tokenizers, TokenStream, and analyzers
  • Building an analyzer to detect phrases and inject synonyms
  • Use cases for leveraging the infrastructure

It’s now common for most applications to leverage user-generated-content (UGC). Users may generate content through one of many ways: writing blog entries, sending messages to others, answering or posing questions on message boards, through journal entries, or by creating a list of related items. In chapter 3, we looked at the use of tagging to represent metadata associated with content. We mentioned that tags can also be detected by automated algorithm.

In this chapter, we build a toolkit to analyze ...

Get Collective Intelligence in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.