Chapter 10. Mass Text Data Processing

In this chapter, we will cover the following topics:

  • Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python
  • De-duplicating data using Hadoop streaming
  • Loading large datasets to an Apache HBase data store – importtsv and bulkload
  • Creating TF and TF-IDF vectors for the text data
  • Clustering text data using Apache Mahout
  • Topic discovery using Latent Dirichlet Allocation (LDA)
  • Document classification using Mahout Naive Bayes Classifier

Introduction

Hadoop MapReduce together with the supportive set of projects makes it a good framework of choice to process large text datasets and to perform extract-transform-load (ETL) type operations.

In this chapter, we'll be exploring how to use Hadoop ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.