Preface

The term big data has come into vogue for an exciting new set of tools and techniques for modern, data-powered applications that are changing the way the world is computing in novel ways. Much to the statistician’s chagrin, this ubiquitous term seems to be liberally applied to include the application of well-known statistical techniques on large datasets for predictive purposes. Although big data is now officially a buzzword, the fact is that modern, distributed computation techniques are enabling analyses of datasets far larger than those typically examined in the past, with stunning results.

Distributed computing alone, however, does not directly lead to data science. Through the combination of rapidly increasing datasets generated from the Internet and the observation that these data sets are able to power predictive models (“more data is better than better algorithms”¹), data products have become a new economic paradigm. Stunning successes of data modeling across large heterogeneous datasets—for example, Nate Silver’s seemingly magical ability to predict the 2008 election using big data techniques—has led to a general acknowledgment of the value of data science, and has brought a wide variety of practitioners to the field.

Hadoop has evolved from a cluster-computing abstraction to an operating system for big data by providing a framework for distributed data storage and parallel computation. Spark has built upon those ideas and made cluster computing more accessible to data scientists. However, data scientists and analysts new to distributed computing may feel that these tools are programmer oriented rather than analytically oriented. This is because a fundamental shift needs to occur in thinking about how we manage and compute upon data in a parallel fashion instead of a sequential one.

This book is intended to prepare data scientists for that shift in thinking by providing an overview of cluster computing and analytics in a readable, straightforward fashion. We will introduce most of the concepts, tools, and techniques involved with distributed computing for data analysis and provide a path for deeper dives into specific topics areas.

What to Expect from This Book

This book is not an exhaustive compendium on Hadoop (see Tom White’s excellent Hadoop: The Definitive Guide for that) or an introduction to Spark (we instead point you to Holden Karau et al.’s Learning Spark), and is certainly not meant to teach the operational aspects of distributed computing. Instead, we offer a survey of the Hadoop ecosystem and distributed computation intended to arm data scientists, statisticians, programmers, and folks who are interested in Hadoop (but whose current knowledge of it is just enough to make them dangerous). We hope that you will use this book as a guide as you dip your toes into the world of Hadoop and find the tools and techniques that interest you the most, be it Spark, Hive, machine learning, ETL (extract, transform, and load) operations, relational databases, or one of the other many topics related to cluster computing.

Who This Book Is For

Data science is often erroneously conflated with big data, and while many machine learning model families do require large datasets in order to be widely generalizable, even small datasets can provide a pattern recognition punch. For that reason, most of the focus of data science software literature is on corpora or datasets that are easily analyzable on a single machine (especially machines with many gigabytes of memory). Although big data and data science are well suited to work in concert with each other, computing literature has separated them up until now.

This book intends to fill in the gap by writing to an audience of data scientists. It will introduce you to the world of clustered computing and analytics with Hadoop, from a data science perspective. The focus will not be on deployment, operations, or software development, but rather on common analyses, data warehousing techniques, and higher-order data workflows.

So who are data scientists? We expect that a data scientist is a software developer with strong statistical skills or a statistician with strong software development skills. Typically, our data teams are composed of three types of data scientists: data engineers, data analysts, and domain experts.

Data engineers are programmers or computer scientists who can build or utilize advanced computing systems. They typically program in Python, Java, or Scala and are familiar with Linux, servers, networking, databases, and application deployment. For those data engineers reading this book, we expect that you’re accustomed to the difficulties of programming multi-process code as well as the challenges of data wrangling and numeric computation. We hope that after reading this book you’ll have a better understanding of deploying your programs across a cluster and handling much larger datasets than can be processed by a single computer in a sufficient amount of time.

Data analysts focus primarily on the statistical modeling and exploration of data. They typically use R, Python, or Julia in their day-to-day work, and should be familiar with data mining and machine learning techniques, including regressions, clustering, and classification problems. Data analysts have probably dealt with larger datasets through sampling. We hope that in this book we can show statistical techniques that take advantage of much larger populations of data than were accessible before—allowing the construction of models that have depth as well as breadth in their predictive ability.

Finally, domain experts are those influential, business-oriented members of a team that understand deeply the types of data and problems that are encountered. They understand the specific challenges of their data and are looking for better ways to make the data productive to solve new challenges. We hope that our book will give them an idea about how to make business decisions that add flexibility to current data workflows as well as to understand how general computation frameworks might be leveraged to specific domain challenges.

How to Read This Book

Hadoop is now over 10 years old, a very long time in technology terms. Moore’s law has still not yet slowed down, and whereas 10 years ago the use of an economic cluster of machines was far simpler in data center terms than programming for super computers, those same economic servers are now approximately 32 times more powerful, and the cost of in-memory computing has gone way down. Hadoop has become an operating system for big data, allowing a variety of computational frameworks from graph processing to SQL-like querying to streaming. This presents a significant challenge to those who are interested in learning about Hadoop—where to start?

We set a very low page limit on this book for a reason: to cover a lot of ground as briefly as possible. We hope that you will read this book in two ways: either as a short, cover-to-cover read that will serve as a broad introduction to Hadoop and distributed data analytics, or by selecting chapters of interest as a preliminary step to doing a deep dive. The purpose of this book is to be accessible. We chose simple examples to expose ideas in code, not necessarily for the reader to implement and run themselves. This book should be a guidebook to the world of Hadoop and Spark, particularly for analytics.

Overview of Chapters

This book is intended to be a guided walk through of the Hadoop ecosystem, and as such we’ve laid out the book in two broad parts split across the halves of the book. Part I (Chapters 1–5) introduces distributed computing at a very high level, discussing how to run computations on a cluster. Part II (Chapters 6–10) focuses more specifically on tools and techniques that should be recognizable to data scientists, and intends to provide a motivation for a variety of analytics and large-scale data management. (Chapter 5 serves as a transition from the broad discussion of distributed computing to more specific tools and an implementation of the big data science pipeline.) The chapter break down is as follows:

Chapter 1, The Age of the Data Product: We begin the book with an introduction to the types of applications that big data and data science produce together: data products. This chapter discusses the workflow behind creating data products and specifies how the sequential model of data analysis fits into the distributed computing realm.
Chapter 2, An Operating System for Big Data: Here we provide an overview of the core concepts behind Hadoop and what makes cluster computing both beneficial and difficult. The Hadoop architecture is discussed in detail with a focus on both YARN and HDFS. Finally, this chapter discusses interacting with the distributed storage system in preparation for performing analytics on large datasets.
Chapter 3, A Framework for Python and Hadoop Streaming: This chapter covers the fundamental programming abstraction for distributed computing: MapReduce. However, the MapReduce API is written in Java, a programming language that is not popular for data scientists. Therefore, this chapter focuses on how to write MapReduce jobs in Python with Hadoop Streaming.
Chapter 4, In-Memory Computing with Spark: While understanding MapReduce is essential to understanding distributed computing and writing high-performance batch jobs such as ETL, day-to-day interaction and analysis on a Hadoop cluster is usually done with Spark. Here we introduce Spark and how to program Python Spark applications to run on YARN either in an interactive fashion using PySpark or in cluster mode.
Chapter 5, Distributed Analysis and Patterns: In this chapter, we take a practical look at how to write distributed data analysis jobs through the presentation of design patterns and parallel analytical algorithms. Coming into this chapter you should understand the mechanics of writing Spark and MapReduce jobs and coming out of the chapter, you should feel comfortable actually implementing them.
Chapter 6, Data Mining and Warehousing: Here we present an introduction to data management, mining, and warehousing in a distributed context, particularly in relation to traditional database systems. This chapter will focus on Hadoop’s most popular SQL-based querying engine, Hive, as well as its most popular NoSQL database, HBase. Data wrangling is the second step in the data science pipeline, but data needs somewhere to be ingested to—and this chapter explores how to manage very large datasets.
Chapter 7, Data Ingestion: Getting data into a distributed system for computation may actually be one of the biggest challenges given the magnitude of both the volume and velocity of data. This chapter explores ingestion techniques from relational databases using Sqoop as a bulk loading tool, as well as the more flexible Apache Flume for ingesting logs and other unstructured data from network sources.
Chapter 8, Analytics with Higher-Level APIs: Here we offer a review of higher-order tools for programming complex Hadoop and Spark applications, in particular with Apache Pig and Spark’s DataFrames API. In Part I, we discussed the implementation of MapReduce and Spark for executing distributed jobs, and how to think of algorithms and data pipelines as data flows. Pig allows you to more easily describe the data flows without actually implementing the low-level details in MapReduce. Spark provides integrated modules that provide the ability to seamlessly mix procedural processing with relational queries and open the door to powerful analytic customizations.
Chapter 9, Machine Learning: Most of the benefits of big data are realized in a machine learning context: a greater variety of features and wider input space mean that pattern recognition techniques are much more effective and personalized. This chapter introduces classification, clustering, and collaborative filtering. Rather than discuss modeling in detail, we will instead get you started on scalable learning techniques using Spark’s MLlib.
Chapter 10, Summary: Doing Distributed Data Science: To conclude, we present a summary of doing distributed data science as a complete view: integrating the tools and techniques that were discussed in isolation in the previous chapters. Data science is not a single activity but rather a lifecycle that involves data ingestion, wrangling, modeling, computation, and operationalization. This chapter discusses architectures and workflows for doing distributed data science at a 20,000-foot view.
Appendix A, Creating a Hadoop Pseudo-Distributed Development Environment: This appendix serves as a guide to setting up a development environment on your local machine in order to program distributed jobs. If you don’t have a cluster available to you, this guide is essential in order to prepare to run the examples provided in the book.
Appendix B, Installing Hadoop Ecosystem Products: An extension to the guide found in Appendix A, this appendix offers instructions for installing the many ecosystem tools and products that we discuss in the book. Although a common methodology for installing services is proposed in Appendix A, this appendix specifically looks at gotchas and caveats for installing the services to run the examples you will find as you read.

As you can see, this is a lot of topics to cover in such a short book! We hope that we have said enough to leave you intrigued and to follow on for more!

Programming and Code Examples

As the distributed computing aspects of Hadoop have become more mature and better integrated, there has been a shift from the computer science aspects of parallelism toward providing a richer analytical experience. For example, the newest member of the big data ecosystem, Spark, exposes programming APIs in four languages to allow easier adoption by data scientists who are used to tools such as data frames, interactive notebooks, and interpreted languages. Hive and SparkSQL provide another familiar domain-specific language (DSL) in the form of a SQL syntax specifically for querying data on a distributed cluster.

Because our audience is a wide array of data scientists, we have chosen to implement as many of our examples as possible in Python. Python is a general-purpose programming language that has found a home in the data science community due to rich analytical packages such as Pandas and Scikit-Learn. Unfortunately, the primary Hadoop APIs are usually in Java, and we’ve had to jump through some hoops to provide Python examples, but for the most part we’ve been able to expose the ideas in a practical fashion. Therefore, code in this book will either be MapReduce using Python and Hadoop Streaming, Spark with the PySpark API, or SQL when discussing Hive or Spark SQL. We hope that this will mean a more concise and accessible read for a more general audience.

GitHub Repository

The code examples found in this book can be found as complete, executable examples on our GitHub repository. This repository also contains code from our video tutorial on Hadoop, Hadoop Fundamentals for Data Scientists (O’Reilly).

Due to the fact that examples are printed, we may have taken shortcuts or omitted details from the code presented in the book in order to provide a clearer explanation of what is going on. For example, generally speaking, import statements are omitted. This means that simple copy and paste may not work. However, by going to the examples in the repository complete, working code is provided with comments that discuss what is happening.

Also note that the repository is kept up to date; check the README to find code and other changes that have occurred. You can of course fork the repository and modify the code for execution in your own environment—we strongly encourage you to do so!

Executing Distributed Jobs

Hadoop developers often use a “single node cluster” in “pseudo-distributed mode” to perform development tasks. This is usually a virtual machine running a virtual server environment, which runs the various Hadoop daemons. Access to this VM can be accomplished with SSH from your main development box, just like you’d access a Hadoop cluster. In order to create a virtual environment, you need some sort of virtualization software, such as VirtualBox, VMWare, or Parallels.

Appendix A discusses how to set up an Ubuntu x64 virtual machine with Hadoop, Hive, and Spark in pseudo-distributed mode. Alternatively, distributions of Hadoop such as Cloudera or Hortonworks will also provide a preconfigured virtual environment for you to use. If you have a target environment that you want to use, then we recommend downloading that virtual machine environment. Otherwise, if you’re attempting to learn more about Hadoop operations, configure it yourself!

We should also note that because Hadoop clusters run on open source software, familiarity with Linux and the command line are required. The virtual machines discussed here are all usually accessed from the command line, and many of the examples in this book describe interactions with Hadoop, Spark, Hive, and other tools from the command line. This is one of the primary reasons that analysts avoid using these tools—however, learning the command line is a skill that will serve you well; it’s not too scary, and we suggest you do it!

Permissions and Citation

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Data Analytics with Hadoop by Benjamin Bengfort and Jenny Kim (O’Reilly). Copyright 2016 Benjamin Bengfort and Jenny Kim, 978-1491-91370-3.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

Feedback and How to Contact Us

To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.

We recognize that tools and technologies change rapidly, particularly in the big data domain. Unfortunately, it is difficult to keep a book (especially a print version) at pace. We hope that this book will continue to serve you well into the future, however, if you’ve noticed a change that breaks an example or an issue in the code, get in touch with us to let us know!

The best method to get in contact with us about code or examples is to leave a note in the form of an issue at Hadoop Fundamentals Issues on GitHub. Alternatively, feel free to send us an email at hadoopfundamentals@gmail.com. We’ll respond as soon as we can, and we really appreciate positive, constructive feedback!

Safari® Books Online

Note

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals.

Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/data-analytics-with-hadoop.

To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

We would like to thank the reviewers who tirelessly offered constructive feedback and criticism on the book throughout the rather long process of development. Thanks to Marck Vaisman, who read the book from the perspective of teaching Hadoop to data scientists. A very special thanks to Konstantinos Xirogiannopoulos, who—despite his busy research schedule—volunteered his time to provide clear, helpful, and above all, positive comments that were a delight to receive.

We would also like to thank our patient, persistent, and tireless editors at O’Reilly. We started the project with Meghan Blanchette who guided us through a series of mis-starts on the project. She stuck with us, but unfortunately our project outlasted her time at O’Reilly and she moved on to bigger and better things. We were especially glad, therefore, when Nicole Tache stepped into her shoes and managed to shepherd us back on track. Nicole took us to the end, and without her, this book would not have happened; she has a special knack for sending welcome emails at critical points that get the job done. Everyone at O’Reilly was wonderful to work with, and we’d also like to mention Marie Beaugureau, Amy Jollymore, Ben Lorica, and Mike Loukides, who gave advice and encouragement.

Here in DC, we were supported in an offline fashion by the crew at District Data Labs, who deserve a special shout out, especially Tony Ojeda, Rebecca Bilbro, Allen Leis, and Selma Gomez Orr. They supported our book in a variety of ways, including being the first to purchase the early release, offering feedback, reviewing code, and generally wondering when it would be done, encouraging us to get back to writing!

This book would not have been possible without the contributions of the amazing people in the Hadoop community, many of whom Jenny has the incredible privilege of working alongside every day at Cloudera. Special thanks to the Hue team; the dedication and passion they bring to providing the best Hadoop user experience around is truly extraordinary and inspiring.

To our families and especially our parents, Randy and Lily Bengfort and Wung and Namoak Kim, thank you for your endless encouragement, love, and support. Our parents have instilled in us a mutual zeal for learning and exploration, which has sent us down more than a few rabbit holes, but they also cultivated in us a shared tenacity and perseverance to always find our way to the other end.

Finally, to our spouses—thanks, Patrick and Jacquelyn, for sticking with us. One of us may have said at some point “my marriage wouldn’t survive another book.” Certainly, in the final stages of the writing process, neither of them was thrilled to hear we were still plugging away. Nonetheless, it wouldn’t have gotten done without them (our book wouldn’t have survived without our marriages). Patrick and Jacquelyn offered friendly winks and waves as we were on video calls working out details and doing rewrites. They even read portions, offered advice, and were generally helpful in all ways. Neither of us were book authors before this, and we weren’t sure what we were getting into. Now that we know, we’re so glad they stuck by us.

¹ Anand Rajaraman, “More data usually beats better algorithms”, Datawocky, March 24, 2008.

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial