Cover image for Agile Data Science

Book description

Mining big data requires a deep investment in people and time. How can you be sure you’re building the right models? With this hands-on book, you’ll learn a flexible toolset and methodology for building effective analytics applications with Hadoop.

Table of Contents

  1. Preface
    1. About Agile Big Data
      1. Goals of the Book
      2. Why you Wrote Book
      3. How the Book is Organized
      4. Who’s Agile Big Data For?
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
  2. I. Setup
    1. 1. Theory
      1. Agile Big Data
      2. Big Words Defined
      3. Agile Big Data Teams
        1. Opportunity and Problem
        2. Adapting to Change
          1. The Power of Generalists
          2. Agile Platforms
          3. Sharing Intermediate Results
      4. Agile Big Data Process
      5. Code Review and Pair Programming
      6. Agile Environments: Engineering Productivity
        1. Collaboration Space
        2. Private Space
        3. Personal Space
      7. Realizing Ideas with Large Format Printing
    2. 2. Data
      1. Introduction
      2. Email
      3. Working with Raw Data
        1. Raw Email
        2. Structured vs. Semi-Structured Data
        3. SQL and NoSQL
          1. SQL
          2. NoSQL
          3. Conclusion
        4. Serialization
        5. Extracting and Exposing Features in Evolving Schemas
        6. Data Pipelines
      4. Data Perspectives
        1. Networks
        2. Time Series
        3. Natural Language
        4. Probability
        5. Conclusion
    3. 3. Agile Tools
      1. Introduction
      2. Example Code
      3. Scalability = Simplicity
      4. Agile Big Data Processing
      5. Setting up a Virtual Environment for Python
      6. Serializing Events with Avro
        1. Avro for Python
          1. Installation
          2. Testing
      7. Collecting Data
      8. Data Processing with Pig
        1. Introduction
        2. Installing Pig
      9. Publishing Data with MongoDB
        1. Introduction
        2. Installing MongoDB
        3. Installing MongoDB’s Java Driver
        4. Installing mongo-hadoop
        5. Pushing data to MongoDB from Pig
      10. Searching Data with ElasticSearch
        1. Installation
        2. ElasticSearch and Pig with Wonderdog
          1. Installing Wonderdog
          2. Wonderdog and Pig
          3. Searching our Data
          4. Python and ElasticSearch with
      11. Reflecting on our Workflow
      12. Lightweight Web Applications
        1. Python and Flask
          1. Flask
          2. Flask Echo
          3. Python and Mongo with
          4. Displaying sent_counts in Flask
        2. Conclusion
      13. Presenting our Data
        1. Introduction
        2. Installing Bootstrap
        3. Booting Boostrap
        4. Visualizing Data with D3.js and nvd3.js
      14. Summary
    4. 4. To the Cloud!
      1. Introduction
      2. Example Code
      3. Github
      4. dotCloud
        1. Echo on dotCloud
        2. Python Workers
      5. Amazon Web Services
        1. Simple Storage Service - S3
        2. Elastic MapReduce
        3. MongoDB as a service
          1. Pushing data from Pig to MongoDB at dotCloud
      6. Instrumentation
        1. Google Analytics
        2. Mortar Data
  3. II. Climbing the Stack
    1. 5. Data Value Stack
      1. Introduction
      2. Climbing the Stack
      3. Agility through the Pyramid
    2. 6. Collecting and Displaying Records
      1. Introduction
      2. Example Code
      3. Putting it all together
      4. Collect and Serialize our Inbox
      5. Process and Publish our Emails
      6. Presenting Emails in a Browser
        1. Serving emails with Flask and pymongo
        2. Rendering HTML5 with Jinja2
      7. Agile Checkpoint
      8. Listing Emails
        1. Listing Emails with MongoDB
        2. Anatomy of a Presentation
          1. Reinventing the Wheel?
          2. Prototyping back from HTML
      9. Searching our Email
        1. Indexing our Email with Pig, ElasticSearch and Wonderdog
        2. Searching our Email on the Web
      10. Conclusion
    3. 7. Visualizing Data with Charts
      1. Introduction
      2. Example Code
      3. Good Charts
      4. Extracting Entities: Email Addresses
        1. Introduction
        2. Extracting Emails
      5. Visualizing Time
    4. 8. Exploring Data with Reports
      1. Introduction
      2. Example Code
      3. Building Reports with Multiple Charts
      4. Linking Records
      5. Extracting Keywords from Emails with TF-IDF
      6. Conclusion
    5. 9. Making Predictions
      1. Introduction
      2. Example Code
      3. Predicting Response Rates to Emails
      4. Personalization
      5. Conclusion
    6. 10. Driving Actions
      1. Introduction
      2. Example Code
      3. Properties of Successful Emails
      4. Better Predictions with Naive Bayes
      5. P(Reply | From & To)
      6. P(Reply | Token)
      7. Making Predictions in Real-Time
      8. Logging Events
      9. Conclusion
  4. About the Author
  5. Copyright