You are previewing Agile Data Science, 2nd Edition.
O'Reilly logo
Agile Data Science, 2nd Edition

Book Description

Building analytics products at scale requires a deep investment in people, machines, and time. How can you be sure you’re building the right models that people will pay for? With this hands-on book, you’ll learn a flexible toolset and methodology for building effective analytics applications with Spark.

Using lightweight tools such as Python, PySpark, Elastic MapReduce, MongoDB, ElasticSearch, Doc2vec, Deep Learning, D3.js, Leaflet, Docker and Heroku, your team will create an agile environment for exploring data, starting with an example application to mine flight data into an analytic product.

Table of Contents

  1. 1. Agile Tools
    1. Scalability = Simplicity
    2. Agile Data Science Data Processing
    3. Setting Up Python
      1. Anaconda
      2. Python Virtual Environments
      3. iPython and Jupyter Notebooks
      4. Installing Python modules via requirements.txt
    4. Serializing Events with JSON Lines and Parquet
      1. JSON for Python
    5. Collecting Data
    6. Data Processing with Spark
      1. Installing Hadoop
      2. Installing Spark
      3. Processing data with Spark
    7. Publishing Data with MongoDB
      1. Installing MongoDB
      2. Installing MongoDB’s Java Driver
      3. Installing mongo-hadoop
      4. Pushing Data to MongoDB from PySpark
    8. Searching Data with Elasticsearch
      1. Installation
      2. Elasticsearch and PySpark
      3. Python and ElasticSearch with pyelasticsearch
      4. Setting up our Spark Environment
    9. Machine Learning with scikit-learn
      1. Why scikit-learn and not Spark MLlib?
      2. Installing scikit-learn
    10. Reflecting on our Workflow
    11. Lightweight Web Applications
      1. Python and Flask
    12. Presenting Our Data
      1. Installing Bootstrap
      2. Booting Boostrap
      3. Visualizing Data with D3.js
    13. Apache Zeppelin
      1. Installing Zeppelin
    14. Conclusion
  2. 2. Data
    1. Air Travel Data
    2. Working with Raw Data
      1. Flight On-Time Performance Data
      2. Structured Versus Semistructured Data
    3. SQL vs NoSQL
      1. SQL
      2. NoSQL and Dataflow Programming
      3. Spark: SQL + NoSQL
      4. Schemas in NoSQL
      5. Data Serialization
      6. Extracting and Exposing Features in Evolving Schemas
    4. Conclusion
  3. 3. Collecting and Displaying Records
    1. Putting It All Together
    2. Collect and Serialize Flight Data
    3. Process and Publish Flight Records
      1. Publishing Flight Records to MongoDB
    4. Presenting Flight Records in a Browser
      1. Serving Flights with Flask and pymongo
      2. Rendering HTML5 with Jinja2
    5. Agile Checkpoint
    6. Listing Flights
      1. Listing Flights with MongoDB
      2. Paginating Data
    7. Searching for Flights
      1. Publishing flights to Elasticsearch
      2. Searching Flights on the Web
    8. Conclusion
  4. 4. Visualizing Data with Charts
    1. Chart Quality: Iteration is Essential
    2. Scaling a Database in the Publish/Decorate Model
      1. First Order Form
      2. Second Order Form
      3. Third Order Form
      4. Choosing a Form
    3. Exploring Seasonality
      1. Querying and presenting flight volume
    4. Extracting Metal (Airplanes [Entities])
      1. Extracting Tail Numbers
      2. Assessing our Airplanes
    5. Data Enrichment
      1. Reverse Engineering a Web Form
      2. Gathering Tail Numbers
      3. Automating Form Submission
      4. Extracting Data from HTML
      5. Evaluating Enriched Data
    6. Conclusion