Chapter 13Big Data

There is a lot of overlap between the terms “data science” and “big data.” In practice, there is a close relationship between them, but really they mean separate things. Big Data refers to several trends in data storage and processing, which have posed new challenges, provided new opportunities, and demanded new solutions. Often, these Big Data problems required a level of software engineering expertise that normal statisticians and data analysts weren't able to handle. It also posed a lot of difficult, ill-posed questions such as how best to segment users based on raw click-stream data. This demand is what turned “data scientist” into a new, distinct job title. But modern data scientists tackle problems of any scale and only use Big Data technologies when they're the right tool for the job.

Big Data is also an area where low-level software engineering concerns become especially important for data scientists. It's always important that they think hard about the logic of their code, but performance concerns are a strictly secondary concern. In Big Data though, it's easy to accidentally add several hours to your code's runtime, or even have the code fail several hours in due to a memory error, if you do not keep an eye on what's going on inside the computer.

This chapter will start with an overview of two pieces of Big Data software that are particularly important: the Hadoop file system, which stores data on clusters, and the Spark cluster computing framework, ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.