Chapter 10Machine Learning as a Batch Process

This chapter investigates using batch processing to mine and learn from larger amounts of data instead of streaming data. After you've considered the size of data and what you're hoping to learn from it, you then look at various tools to extract, transform, and then process the data for useful results.

This chapter covers using Hadoop, Sqoop, and Pig for large-scale batch processing; these tools enable large data sets to be processed with relative ease. The chapter also discusses more traditional methods of creating programs to run batch processes on data.

Is It Big Data?

Although this book is about machine learning, I can't ignore the term “Big Data” that is increasingly a topic in business today. The phrase is touted as the savior, because it enables companies to see new things in their existing data. The term is broad but ultimately reduces down to the concept of a data set that becomes so large that it is difficult to process with traditional tools.

Depending on whom you ask, you might hear, “It's not Big Data if it's not working on petabytes of data,” or “When it becomes too big for a traditional database, then it's Big Data.” Both statements are true and valid. Personally, I like the term “data” regardless of whether the amount of data is big or small.

As time marches on, the answer to the “What is Big Data?” question will constantly change. The tools will also adapt, improve, and provide different insight. The key question ...

Get Machine Learning: Hands-On for Developers and Technical Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.