Chapter 10. Mass Text Data Processing

In this chapter, we will cover the following topics:

Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python
De-duplicating data using Hadoop streaming
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Creating TF and TF-IDF vectors for the text data
Clustering text data using Apache Mahout
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes Classifier

Introduction

Hadoop MapReduce together with the supportive set of projects makes it a good framework of choice to process large text datasets and to perform extract-transform-load (ETL) type operations.

In this chapter, we'll be exploring how to use Hadoop ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop MapReduce v2 Cookbook - Second Edition by Thilina Gunarathne

Chapter 10. Mass Text Data Processing

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly