Summary

In this chapter, we explored advanced patterns that specifically deal with using Pig to analyze unstructured text data using various patterns.

We started by understanding the context and the motivation behind clustering text data; we then examined in brief several techniques followed by a use case that elaborates through Pig code. Similarly, we understood the relevance of topic models to understanding the latent context of textual documents using an example of text containing Big Data and medicine. We have explored how Pig integrates with the Python's NLTK library to perform natural language processing in order to decompose a text corpus into sentences and recognize named entities; these entities are eventually used in indexing and information ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.