You are previewing Agile Data Science, 2nd Edition.
O'Reilly logo
Agile Data Science, 2nd Edition

Book Description

Building analytics products at scale requires a deep investment in people, machines, and time. How can you be sure you’re building the right models that people will pay for? With this hands-on book, you’ll learn a flexible toolset and methodology for building effective analytics applications with Spark.

Using lightweight tools such as Python, PySpark, Elastic MapReduce, MongoDB, ElasticSearch, Doc2vec, Deep Learning, D3.js, Leaflet, Docker and Heroku, your team will create an agile environment for exploring data, starting with an example application to mine flight data into an analytic product.

Table of Contents

  1. 1. Agile Tools
    1. Scalability = Simplicity
    2. Agile Data Science Data Processing
    3. Setting Up Python
      1. Anaconda
      2. Python Virtual Environments
      3. iPython and Jupyter Notebooks
    4. Serializing Events with JSON Lines and Parquet
      1. JSON for Python
    5. Collecting Data
    6. Data Processing with Spark
      1. Installing Hadoop
      2. Installing Spark
      3. Processing data with Spark
    7. Publishing Data with MongoDB
      1. Installing MongoDB
      2. Installing MongoDB’s Java Driver
      3. Installing mongo-hadoop
      4. Pushing Data to MongoDB from PySpark
    8. Searching Data with Elasticsearch
      1. Installation
      2. Elasticsearch and PySpark
      3. Python and ElasticSearch with pyelasticsearch
    9. Reflecting on our Workflow
    10. Lightweight Web Applications
      1. Python and Flask
    11. Presenting Our Data
      1. Installing Bootstrap
      2. Booting Boostrap
      3. Visualizing Data with D3.js
    12. Conclusion
  2. 2. Data
    1. Air Travel Data
    2. Working with Raw Data
      1. Flight On-Time Performance Data
      2. Structured Versus Semistructured Data
    3. SQL vs NoSQL
      1. SQL
      2. NoSQL and Dataflow Programming
      3. Spark: SQL + NoSQL
      4. Schemas in NoSQL
      5. Data Serialization
      6. Extracting and Exposing Features in Evolving Schemas
    4. Conclusion
  3. 3. Collecting and Displaying Records
    1. Putting It All Together
    2. Collect and Serialize Flight Data
    3. Process and Publish Flight Records
      1. Publishing Flight Records to MongoDB
    4. Presenting Flight Records in a Browser
      1. Serving Flights with Flask and pymongo
      2. Rendering HTML5 with Jinja2
    5. Agile Checkpoint
    6. Listing Flights
      1. Listing Flights with MongoDB
      2. Paginating Data
    7. Searching for Flights
      1. Publishing flights to Elasticsearch
      2. Searching Flights on the Web
    8. Conclusion