Chapter 6. Processing

Getting the concise, valuable information you want from a sea of data can be challenging, but there’s been a lot of progress around systems that help you turn your datasets into something that makes sense. Because there are so many different barriers, the tools range from rapid statistical analysis systems to enlisting human helpers.

The R project is both a specialized language and a toolkit of modules aimed at anyone working with statistics. It covers everything from loading your data to running sophisticated analyses on it and then either exporting or visualizing the results. The interactive shell makes it easy to experiment with your data, since you can try out a lot of different approaches very quickly. The biggest downside from a data processing perspective is that it’s designed to work with datasets that fit within a single machine’s memory. It is possible to use it within Hadoop as another streaming language, but a lot of the most powerful features require access to the complete dataset to be effective. R makes a great prototyping platform for designing solutions that need to run on massive amounts of data, though, or for making sense of the smaller-scale results of your processing.

It’s been several years since Yahoo! released the Pipes environment, but it’s still an unsurpassed tool for building simple data pipelines. It has a graphical interface where you drag and drop components, linking them together into flows of processing operations. ...

Get Big Data Glossary now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.