You are previewing Agile Data Science 2.0.
O'Reilly logo
Agile Data Science 2.0

Book Description

Building analytics products at scale requires a deep investment in people, machines, and time. How can you be sure you’re building the right models that people will pay for? With this hands-on book, you’ll learn a flexible toolset and methodology for building effective analytics applications with Spark.

Using lightweight tools such as Python, PySpark, Elastic MapReduce, MongoDB, ElasticSearch, Doc2vec, Deep Learning, D3.js, Leaflet, Docker and Heroku, your team will create an agile environment for exploring data, starting with an example application to mine flight data into an analytic product.

Table of Contents

  1. 1. Agile Tools
    1. Scalability = Simplicity
    2. Agile Data Science Data Processing
    3. Example Environment Setup
      1. Setting Up Vagrant
    4. Touring the Toolset
      1. Setting Up Python
      2. Serializing Events with JSON Lines and Parquet
      3. Collecting Data
      4. Data Processing with Spark
      5. Publishing Data with MongoDB
      6. Searching Data with Elasticsearch
      7. Distributed Streams with Apache Kafka
      8. Processing Streams with PySpark Streaming
      9. Machine Learning with scikit-learn
        1. Why scikit-learn as well as Spark MLlib?
      10. Scheduling with Apache Incubating Airflow
      11. Reflecting on our Workflow
      12. Lightweight Web Applications
      13. Presenting Our Data
    5. Conclusion
  2. 2. Data
    1. Air Travel Data
      1. Flight On-Time Performance Data
      2. Openflights Database
    2. Weather Data
    3. Data Processing in Agile Data Science
      1. Structured Versus Semistructured Data
    4. SQL vs NoSQL
      1. SQL
      2. NoSQL and Dataflow Programming
      3. Spark: SQL + NoSQL
      4. Schemas in NoSQL
      5. Data Serialization
      6. Extracting and Exposing Features in Evolving Schemas
    5. Batch vs Realtime OR Spark vs Kafka
    6. Conclusion
  3. I. Climbing the Pyramid
  4. 3. Collecting and Displaying Records
    1. Putting It All Together
    2. Collect and Serialize Flight Data
    3. Process and Publish Flight Records
      1. Publishing Flight Records to MongoDB
    4. Presenting Flight Records in a Browser
      1. Serving Flights with Flask and pymongo
      2. Rendering HTML5 with Jinja2
    5. Agile Checkpoint
    6. Listing Flights
      1. Listing Flights with MongoDB
      2. Paginating Data
    7. Searching for Flights
      1. Publishing flights to Elasticsearch
      2. Searching Flights on the Web
    8. Conclusion
  5. 4. Visualizing Data with Charts and Tables
    1. Chart Quality: Iteration is Essential
    2. Scaling a Database in the Publish/Decorate Model
      1. First Order Form
      2. Second Order Form
      3. Third Order Form
      4. Choosing a Form
    3. Exploring Seasonality
      1. Querying and presenting flight volume
    4. Extracting Metal (Airplanes [Entities])
      1. Extracting Tail Numbers
      2. Assessing our Airplanes
    5. Data Enrichment
      1. Reverse Engineering a Web Form
      2. Gathering Tail Numbers
      3. Automating Form Submission
      4. Extracting Data from HTML
      5. Evaluating Enriched Data
    6. Conclusion
  6. 5. Exploring Data with Reports
    1. Extracting Airlines (Entities)
      1. Defining Airlines as Groups of Airplanes using PySpark
      2. Querying Airline Data in Mongo
      3. Building an Airline Page in Flask
      4. Linking Back to our Airline Page
      5. Creating an All Airlines Home Page
    2. Curating Ontologies of Semi-Structured Data
    3. Improving Airlines
      1. Adding Names to Carrier Codes
      2. Incorporating Wikipedia Content
      3. Publishing Enriched Airlines to Mongo
      4. Enriched Airlines on the Web
    4. Investigating Airplanes (Entities)
      1. SQL Sub-Queries vs. Dataflow Programming
      2. Dataflow Programming without Sub-Queries
      3. Sub-Queries in Spark SQL
      4. Creating an Airplanes Home Page
      5. Adding Search to the Airplanes Page
      6. Creating a Manufacturer Bar Chart
      7. Iterating on a Manufacturer’s Bar Chart
      8. Entity Resolution: Another Chart Iteration
    5. Conclusion
  7. 6. Making Predictions
    1. The Role of Predictions
    2. Predict What?
    3. Introduction to Predictive Analytics
      1. Making Predictions
    4. Exploring Flight Delays
    5. Extracting Features with PySpark
    6. Building a Regression with scikit-learn
      1. Loading our Data
      2. Sampling our Data
      3. Vectorizing our Results
      4. Preparing our Training Data
      5. Vectorizing our Features
      6. Sparse vs Dense Matrices
      7. Preparing an Experiment
      8. Training our Model
      9. Testing our Model
      10. Conclusion
    7. Building a Classifier with Spark MLlib
      1. Loading our Training Data with a Specified Schema
      2. Addressing Nulls
      3. Replacing FlightNum with Route
      4. Bucketizing a Continuous Variable for Classification
      5. Feature Vectorization with pyspark.ml.feature
      6. Classification with Spark ML
    8. Conclusion
  8. 7. Deploying Predictive Systems
    1. Deploying a scikit-learn Application as a Web Service
      1. Saving and Loading scikit-learn Models
      2. Groundwork for Serving Predictions
      3. Creating our Flight Delay Regression API
      4. Testing our API
      5. Pulling our API into our Product
    2. Deploying Spark ML Applications in Batch with Airflow
      1. Gathering Training Data in Production
      2. Training, Storing and Loading Spark ML Models
      3. Creating Prediction Requests in Mongo
      4. Fetching Prediction Requests from MongoDB
      5. Making Predictions in Batch with Spark ML
      6. Storing Predictions in MongoDB
      7. Displaying Batch Prediction Results in our Web Application
      8. Automating our Workflow with Apache Incubating Airflow
      9. Conclusion
    3. Deploying Spark ML via Spark Streaming.
      1. Gathering Training Data in Production
      2. Training, Storing and Loading Spark ML Models
      3. Sending Prediction Requests to Kafka
      4. Making Predictions in Spark Streaming
      5. Testing the Entire System
    4. Conclusion