O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Agile Data Science 2.0

Book Description

Building analytics products at scale requires a deep investment in people, machines, and time. How can you be sure you’re building the right models that people will pay for? With this hands-on book, you’ll learn a flexible toolset and methodology for building effective analytics applications with Spark.

Using lightweight tools such as Python, PySpark, Elastic MapReduce, MongoDB, ElasticSearch, Doc2vec, Deep Learning, D3.js, Leaflet, Docker and Heroku, your team will create an agile environment for exploring data, starting with an example application to mine flight data into an analytic product.

Table of Contents

  1. Preface
    1. Agile Data Science Mailing List
    2. Data Syndrome, Product Analytics Consultancy
      1. Realtime Predictive Analytics Video Course
      2. Live Training
    3. Who This Book Is For
    4. How This Book Is Organized
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
  2. I. Setup
  3. 1. Theory
    1. Introduction
    2. Definition
      1. Methodology as Tweet
      2. Agile Data Science Manifesto
    3. The Problem with the Waterfall
      1. Reasearch vs Application Development
    4. The Problem with Agile Software
      1. Eventual Quality: Financing Technical Debt
      2. Pull of the Waterfall
    5. The Data Science Process
      1. Setting Expectations
      2. Data Science Team Roles
      3. Recognizing the Opportunity and Problem
      4. Adapting to Change
    6. Notes on Process
      1. Code Review and Pair Programming
      2. Agile Environments: Engineering Productivity
      3. Realizing Ideas with Large-Format Printing
  4. 2. Agile Tools
    1. Scalability = Simplicity
    2. Agile Data Science Data Processing
    3. Local Environment Setup
      1. System Requirements
      2. Setting Up Vagrant
      3. Downloading The Data
    4. EC2 Environment Setup
      1. Downloading The Data
    5. Getting and Running the Code
      1. Getting the Code
      2. Running the Code
      3. Jupyter Notebooks
    6. Touring the Toolset
      1. Agile Stack Requirements
      2. Python 3
      3. Serializing Events with JSON Lines and Parquet
      4. Collecting Data
      5. Data Processing with Spark
      6. Publishing Data with MongoDB
      7. Searching Data with Elasticsearch
      8. Distributed Streams with Apache Kafka
      9. Processing Streams with PySpark Streaming
      10. Machine Learning with scikit-learn and Spark MLlib
      11. Scheduling with Apache Incubating Airflow
      12. Reflecting on our Workflow
      13. Lightweight Web Applications
      14. Presenting Our Data
    7. Conclusion
  5. 3. Data
    1. Air Travel Data
      1. Flight On-Time Performance Data
      2. Openflights Database
    2. Weather Data
    3. Data Processing in Agile Data Science
      1. Structured Versus Semistructured Data
    4. SQL vs NoSQL
      1. SQL
      2. NoSQL and Dataflow Programming
      3. Spark: SQL + NoSQL
      4. Schemas in NoSQL
      5. Data Serialization
      6. Extracting and Exposing Features in Evolving Schemas
    5. Conclusion
  6. II. Climbing the Pyramid
  7. 4. Collecting and Displaying Records
    1. Putting It All Together
    2. Collect and Serialize Flight Data
    3. Process and Publish Flight Records
      1. Publishing Flight Records to MongoDB
    4. Presenting Flight Records in a Browser
      1. Serving Flights with Flask and pymongo
      2. Rendering HTML5 with Jinja2
    5. Agile Checkpoint
    6. Listing Flights
      1. Listing Flights with MongoDB
      2. Paginating Data
    7. Searching for Flights
      1. Creating Our Index
      2. Publishing flights to Elasticsearch
      3. Searching Flights on the Web
    8. Conclusion
  8. 5. Visualizing Data with Charts and Tables
    1. Chart Quality: Iteration is Essential
    2. Scaling a Database in the Publish/Decorate Model
      1. First Order Form
      2. Second Order Form
      3. Third Order Form
      4. Choosing a Form
    3. Exploring Seasonality
      1. Querying and presenting flight volume
    4. Extracting Metal (Airplanes [Entities])
      1. Extracting Tail Numbers
      2. Assessing our Airplanes
    5. Data Enrichment
      1. Reverse Engineering a Web Form
      2. Gathering Tail Numbers
      3. Automating Form Submission
      4. Extracting Data from HTML
      5. Evaluating Enriched Data
    6. Conclusion
  9. 6. Exploring Data with Reports
    1. Extracting Airlines (Entities)
      1. Defining Airlines as Groups of Airplanes using PySpark
      2. Querying Airline Data in Mongo
      3. Building an Airline Page in Flask
      4. Linking Back to our Airline Page
      5. Creating an All Airlines Home Page
    2. Curating Ontologies of Semi-Structured Data
    3. Improving Airlines
      1. Adding Names to Carrier Codes
      2. Incorporating Wikipedia Content
      3. Publishing Enriched Airlines to Mongo
      4. Enriched Airlines on the Web
    4. Investigating Airplanes (Entities)
      1. SQL Sub-Queries vs. Dataflow Programming
      2. Dataflow Programming without Sub-Queries
      3. Sub-Queries in Spark SQL
      4. Creating an Airplanes Home Page
      5. Adding Search to the Airplanes Page
      6. Creating a Manufacturer Bar Chart
      7. Iterating on a Manufacturer’s Bar Chart
      8. Entity Resolution: Another Chart Iteration
    5. Conclusion
  10. 7. Making Predictions
    1. The Role of Predictions
    2. Predict What?
    3. Introduction to Predictive Analytics
      1. Making Predictions
    4. Exploring Flight Delays
    5. Extracting Features with PySpark
    6. Building a Regression with scikit-learn
      1. Loading our Data
      2. Sampling our Data
      3. Vectorizing our Results
      4. Preparing our Training Data
      5. Vectorizing our Features
      6. Sparse vs Dense Matrices
      7. Preparing an Experiment
      8. Training our Model
      9. Testing our Model
      10. Conclusion
    7. Building a Classifier with Spark MLlib
      1. Loading our Training Data with a Specified Schema
      2. Addressing Nulls
      3. Replacing FlightNum with Route
      4. Bucketizing a Continuous Variable for Classification
      5. Feature Vectorization with pyspark.ml.feature
      6. Classification with Spark ML
    8. Conclusion
  11. 8. Deploying Predictive Systems
    1. Deploying a scikit-learn Application as a Web Service
      1. Saving and Loading scikit-learn Models
      2. Groundwork for Serving Predictions
      3. Creating our Flight Delay Regression API
      4. Testing our API
      5. Pulling our API into our Product
    2. Deploying Spark ML Applications in Batch with Airflow
      1. Gathering Training Data in Production
      2. Training, Storing and Loading Spark ML Models
      3. Creating Prediction Requests in Mongo
      4. Fetching Prediction Requests from MongoDB
      5. Making Predictions in Batch with Spark ML
      6. Storing Predictions in MongoDB
      7. Displaying Batch Prediction Results in our Web Application
      8. Automating our Workflow with Apache Incubating Airflow
      9. Conclusion
    3. Deploying Spark ML via Spark Streaming.
      1. Gathering Training Data in Production
      2. Training, Storing and Loading Spark ML Models
      3. Sending Prediction Requests to Kafka
      4. Making Predictions in Spark Streaming
      5. Testing the Entire System
    4. Conclusion
  12. 9. Improving Predictions
    1. Fixing our Prediction Problem
    2. When to Improve Predictions
    3. Improving Predictions
      1. Experimental Adhesion Method: See What Sticks
      2. Establishing Rigorous Metrics for Experiments
      3. Time of Day as a Feature
    4. Incorporating Airplane Data
      1. Extracting Airplane Features
      2. Incorporating Airplane Features into our Classifier Model
    5. Incorporating Flight Time
    6. Conclusion
  13. 10. Climbing The Pyramid: Incorporating the Weather
    1. Incorporating Weather Data
      1. Acquiring Historical Weather Data
      2. Loading Weather Data
      3. Optimizing Hourly Observations
      4. Extracting Weather Entities using RDDs
      5. Presenting Weather Data
      6. Interpretting Weather Conditions
    2. Weather and Flight Delays
    3. Matching Weather Stations to Airports
      1. Assessing Our Airports
      2. Asssessing our Weather Stations
      3. Building Weather Station Addresses
      4. Geocoding Weather Station Addresses
      5. Combining Weather Station Coordinate Sets
      6. Making Pairwise Airport/Station Comparisons
      7. Computing the Closest Station to Each Airport
    4. Associating Weather Reports with Flights
    5. Deploying Our Enriched Model
      1. Deploying the Final Model in Batch
      2. Deploying the Final Model in Realtime
  14. A. Appendix A: Manual Installation
    1. Installing Hadoop
    2. Installing Spark
    3. Installing MongoDB
    4. Installing MongoDB’s Java Driver
    5. Installing mongo-hadoop
      1. Building mongo-hadoop
      2. Installing pymongo_spark
    6. Elasticsearch Installation
    7. Installing Elasticsearch for Hadoop
    8. Setting up our Spark Environment
    9. Installing Kafka
    10. Installing scikit-learn
    11. Installing Zeppelin
  15. Index