I wrote the first edition of this book while recovering from disability from a car accident after which I developed chronic pain and lost partial use of my hands. Unable to chop vegetables, I wrote it from bed and the couch on an iPad to get over a failed project that haunted me called Career Explorer. Injured weeks before the ship date, getting the product over the line, staying up for days and doing whatever it took, became a traumatic experience. During the project, we made many mistakes I knew not to make, and I was continuously frustrated. The product bombed. A sense of failure routinely bugged me while I was stuck, horizontal on my back most of the time from intractable chronic pain. Also suffering from a heart condition, missing a third of my heartbeats, I developed dementia. My mind sank to a dark place. I could not easily find a way out. I had to find a way to fix things, to grapple with failure. Strange to say that to fix myself, I wrote a book. I needed to write directions I could give to teammates to make my next project a success. I needed to get this story out of me. More than that, I thought I could bring meaning to my life, most of which was shed by disability, by helping others. By doing something for the greater good. I wanted to ensure that others did not repeat my mistakes. I thought that was worth doing. There was a problem this project illustrated that was bigger then me. Most research sits on a shelf and never gets into the hands of people it can benefit. This book is a prescription and methodology for doing applied research that makes it into the world in the form of a product.
This may sound quite dramatic, but I wanted to put the first edition in personal context before introducing the second. Although it was important to me, of course, the first edition of this book was only a small contribution to the emerging field of data science. But I’m proud of it. I found salvation in its pages, it made me feel right again, and in time I recovered from illness and found a sense of accomplishment that replaced the sting of failure. So thats the first edition.
In this second edition, I hope to do more. Put simply, I want to take a budding data scientist and accelerate her into an analytics application developer. In doing so, I draw from and reflect upon my experience building analytics applications at three Hadoop and one Spark shop. I hope this new edition will become the go-to guide for readers to rapidly learn how to build analytics applications on data of any size, using the lingua franca of data science... Python, and the platform of choice, Spark.
Spark has replaced Hadoop/MapReduce as the default way to process data at scale, so we adopt Spark for this new edition. In addition, the theory and process of the Agile Data Science methodology have been updated to reflect an increased understanding of working in teams. It is hoped that readers of the first edition will become readers of the second. It is also hoped the book will serve Spark users better than the original served Hadoop. Onward to Agile Data Science...
Agile Data Science has two goals: to provide a how-to guide for building analytics applications with data of any size using Python and Spark, and to help product teams collaborate on building analytics application in an agile manner which will ensure success.
You can learn the latest on Agile Data Science on the mailing list, firstname.lastname@example.org on the web at https://groups.google.com/d/forum/agile-data-science.
There is a web page for this book maintained by the author at http://datasyndrome.com/book which contains the latest updates and related material for readers of the book.
The author of this book, Russell Jurney, has founded a consultancy called Data Syndrome to advance the adoption of the methodology and technology stack outlined in this book. If you need help implementing Agile Data Science within your company, if you need hands on help building data products, or if you need “big data” training, you can contact the author at email@example.com or on the web at http://datasyndrome.com.
Data Syndrome offers a video course, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming, that builds on the material from chapters 7 and 8 to teach students how to build entire realtime predictive systems with Kafka and Spark Streaming and a web application front-end. For more information, visit http://datasyndrome.com/video or contact firstname.lastname@example.org.
Data Syndrome is developing a complete curriculum for live “big data” training for data science and data engineering teams. Current course offerings are cusomizable for your needs and include:
Agile Data Science - A three day, eight hour course per day covering the construcion of full-stack analytics applications. Similar in content to this book, this course trains data scientists to be full-stack application developers.
Realtime Predictive Analytics - A one day, six hour course covering the construction of entire realtime predictive systems using Kafka and Spark Streaming with a web application front end.
Introduction to PySpark - a one day, three hour course introducing students to basic data processing with Spark through the Python interface, PySpark. Culminates in the construction of a classifier model to predict flight delays using Spark MLlib.
Agile Data Science is a course to help beginners and budding data scientists to become productive members of data science and analytics teams. It aims to help engineers, analysts, and data scientists work with big data in an agile way using Hadoop. It introduces an agile methodology well suited for big data.
This book is targeted at programmers with some exposure to developing software and working with data. Designers and product managers might particularly enjoy Chapters 1, 2, and 6, which would serve as an introduction to the agile process without focusing on running code.
Agile Data Science assumes you are working in a *nix environment. Examples for Windows users aren’t available, but are possible via Cygwin. A user-contributed Linux Vagrant image with all the prerequisites installed is available here. You can quickly boot a Linux machine in VirtualBox using this tool.
This book is organized into two sections. Part I introduces the data- and toolset we will use in the tutorials in Part II. Part I is intentionally brief, taking only enough time to introduce the tools. We go more in-depth into their use in Part II, so don’t worry if you’re a little overwhelmed in Part I. The chapters that compose Part I are as follows:
Introduces the Agile Data Science methodology.
Introduces our toolset, and helps you get it up and running on your own machine.
Describes the dataset used in this book.
Part II is a tutorial in which we build an analytics application using Agile Data Science. It is a notebook-style guide to building an analytics application. We climb the data-value pyramid one level at a time, applying agile principles as we go. We’ll demonstrate a way of building value step by step in small, agile iterations. Part II comprises the following chapters:
Helps you download flight data and then connect or “plumb” flight records through to a web application.
Steps you through how to navigate your data by preparing simple charts in a web application.
Teaches you how to extract entities from your data and parametize and link between them to create interactive reports.
Takes what you’ve done so far and predicts whether your flight will be on-time or late
Shows how to deploy predictions to ensure they impact real people and systems.
Iteratively improve on the performance of our on-time flight prediction.
Appendix A shows how to manually install our tools.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/rjurney/Agile_Data_Code_2.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Agile Data Science by Russell Jurney (O’Reilly). Copyright 2014 Data Syndrome LLC, 978-1-449-32626-5.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at email@example.com.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/agile-data-science.
To comment or ask technical questions about this book, send email to firstname.lastname@example.org.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia