Preface
I donât like to think I have many regrets, but itâs hard to believe anything good came out of a particular lazy moment in 2011 when I was looking into how to best distribute tough discrete optimization problems over clusters of computers. My advisor explained this newfangled Apache Spark thing he had heard of, and I basically wrote off the concept as too good to be true and promptly got back to writing my undergrad thesis in MapReduce. Since then, Spark and I have both matured a bit, but only one of us has seen a meteoric rise thatâs nearly impossible to avoid making âigniteâ puns about. Cut to a few years later, and it has become crystal clear that Spark is something worth paying attention to.
Sparkâs long lineage of predecessors, from MPI to MapReduce, makes it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Sparkâs promise is to take this a little furtherâto make writing distributed programs feel like writing regular programs.
Spark is great at giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmerâs daily chant of despair (âwhy? whyyyyy?â) to the Apache Hadoop gods. But the exciting thing for me about it has always been what it opens up for complex analytics. With a paradigm that supports iterative algorithms and interactive exploration, Spark is finally an open source framework that allows a data scientist to be productive with large data sets.
I think the best way to teach data science is by example. To that end, my colleagues and I have put together a book of applications, trying to touch on the interactions between the most common algorithms, data sets, and design patterns in large-scale analytics. This book isnât meant to be read cover to cover. Page to a chapter that looks like something youâre trying to accomplish, or that simply ignites your interest.
Whatâs in This Book
The first chapter will place Spark within the wider context of data science and big data analytics. After that, each chapter will comprise a self-contained analysis using Spark. The second chapter will introduce the basics of data processing in Spark and Scala through a use case in data cleansing. The next few chapters will delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applicationsâfor example, querying Wikipedia through latent semantic relationships in the text or analyzing genomics data.
The Second Edition
Since the first edition, Spark has experienced a major version upgrade that instated an entirely new core API and sweeping changes in subcomponents like MLlib and Spark SQL. In the second edition, weâve made major renovations to the example code and brought the materials up to date with Sparkâs new best practices.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/sryza/aas.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youâre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OâReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your productâs documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills (OâReilly). Copyright 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills, 978-1-491-91276-8.â
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
OâReilly Safari
Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including OâReilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- OâReilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: https://facebook.com/oreilly
Follow us on Twitter: https://twitter.com/oreillymedia
Watch us on YouTube: https://www.youtube.com/oreillymedia
Acknowledgments
It goes without saying that you wouldnât be reading this book if it were not for the existence of Apache Spark and MLlib. We all owe thanks to the team that has built and open sourced it, and the hundreds of contributors who have added to it.
We would like to thank everyone who spent a great deal of time reviewing the content of the book with expert eyes: Michael Bernico, Adam Breindel, Ian Buss, Parviz Deyhim, Jeremy Freeman, Chris Fregly, Debashish Ghosh, Juliet Hougland, Jonathan Keebler, Nisha Muktewar, Frank Nothaft, Nick Pentreath, Kostas Sakellis, Tom White, Marcelo Vanzin, and Juliet Hougland again. Thanks all! We owe you one. This has greatly improved the structure and quality of the result.
I (Sandy) also would like to thank Jordan Pinkus and Richard Wang for helping me with some of the theory behind the risk chapter.
Thanks to Marie Beaugureau and OâReilly for the experience and great support in getting this book published and into your hands.
Get Advanced Analytics with Spark, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.