Cover image for Doing Data Science

Book description

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

Table of Contents

  1. Dedication
  2. Preface
    1. Motivation
    2. Origins of the Class
    3. Origins of the Book
    4. What to Expect from This Book
    5. How This Book Is Organized
    6. How to Read This Book
    7. How Code Is Used in This Book
    8. Who This Book Is For
    9. Prerequisites
    10. Supplemental Reading
    11. About the Contributors
    12. Conventions Used in This Book
    13. Using Code Examples
    14. Safari® Books Online
    15. How to Contact Us
    16. Acknowledgments
  3. 1. Introduction: What Is Data Science?
    1. Big Data and Data Science Hype
    2. Getting Past the Hype
    3. Why Now?
      1. Datafication
    4. The Current Landscape (with a Little History)
      1. Data Science Jobs
    5. A Data Science Profile
    6. Thought Experiment: Meta-Definition
    7. OK, So What Is a Data Scientist, Really?
      1. In Academia
      2. In Industry
  4. 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
    1. Statistical Thinking in the Age of Big Data
      1. Statistical Inference
      2. Populations and Samples
      3. Populations and Samples of Big Data
      4. Big Data Can Mean Big Assumptions
        1. Can N=ALL?
        2. Data is not objective
      5. Modeling
        1. What is a model?
        2. Statistical modeling
        3. But how do you build a model?
        4. Probability distributions
        5. Fitting a model
        6. Overfitting
    2. Exploratory Data Analysis
      1. Philosophy of Exploratory Data Analysis
      2. Exercise: EDA
        1. Sample code
    3. The Data Science Process
      1. A Data Scientist’s Role in This Process
    4. Thought Experiment: How Would You Simulate Chaos?
    5. Case Study: RealDirect
      1. How Does RealDirect Make Money?
      2. Exercise: RealDirect Data Strategy
        1. Sample R code
  5. 3. Algorithms
    1. Machine Learning Algorithms
    2. Three Basic Algorithms
      1. Linear Regression
        1. Start by writing something down
        2. Fitting the model
        3. Extending beyond least squares
          1. Adding in modeling assumptions about the errors
          2. Adding other predictors
          3. Transformations
        4. Review
        5. Exercise
      2. k-Nearest Neighbors (k-NN)
        1. Example with credit scores
        2. Similarity or distance metrics
        3. Training and test sets
        4. Pick an evaluation metric
        5. Putting it all together
        6. Choosing k
        7. What are the modeling assumptions?
      3. k-means
        1. 2D version
    3. Exercise: Basic Machine Learning Algorithms
      1. Solutions
        1. Sample R code: Linear regression on the housing dataset
        2. Sample R code: K-NN on the housing dataset
    4. Summing It All Up
    5. Thought Experiment: Automated Statistician
  6. 4. Spam Filters, Naive Bayes, and Wrangling
    1. Thought Experiment: Learning by Example
      1. Why Won’t Linear Regression Work for Filtering Spam?
      2. How About k-nearest Neighbors?
    2. Naive Bayes
      1. Bayes Law
      2. A Spam Filter for Individual Words
      3. A Spam Filter That Combines Words: Naive Bayes
    3. Fancy It Up: Laplace Smoothing
    4. Comparing Naive Bayes to k-NN
    5. Sample Code in bash
    6. Scraping the Web: APIs and Other Tools
    7. Jake’s Exercise: Naive Bayes for Article Classification
      1. Sample R Code for Dealing with the NYT API
  7. 5. Logistic Regression
    1. Thought Experiments
    2. Classifiers
      1. Runtime
      2. You
      3. Interpretability
      4. Scalability
    3. M6D Logistic Regression Case Study
      1. Click Models
      2. The Underlying Math
      3. Estimating α and β
      4. Newton’s Method
      5. Stochastic Gradient Descent
      6. Implementation
      7. Evaluation
    4. Media 6 Degrees Exercise
      1. Sample R Code
  8. 6. Time Stamps and Financial Modeling
    1. Kyle Teague and GetGlue
    2. Timestamps
      1. Exploratory Data Analysis (EDA)
      2. Metrics and New Variables or Features
      3. What’s Next?
    3. Cathy O’Neil
    4. Thought Experiment
    5. Financial Modeling
      1. In-Sample, Out-of-Sample, and Causality
      2. Preparing Financial Data
      3. Log Returns
      4. Example: The S&P Index
      5. Working out a Volatility Measurement
      6. Exponential Downweighting
      7. The Financial Modeling Feedback Loop
      8. Why Regression?
      9. Adding Priors
      10. A Baby Model
      11. Exercise: GetGlue and Timestamped Event Data
      12. Exercise: Financial Data
  9. 7. Extracting Meaning from Data
    1. William Cukierski
      1. Background: Data Science Competitions
      2. Background: Crowdsourcing
    2. The Kaggle Model
      1. A Single Contestant
      2. Their Customers
    3. Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
    4. Feature Selection
      1. Example: User Retention
      2. Filters
      3. Wrappers
        1. Selecting an algorithm
        2. Selection criterion
        3. In practice
      4. Embedded Methods: Decision Trees
      5. Entropy
      6. The Decision Tree Algorithm
      7. Handling Continuous Variables in Decision Trees
      8. Random Forests
      9. User Retention: Interpretability Versus Predictive Power
    5. David Huffaker: Google’s Hybrid Approach to Social Research
      1. Moving from Descriptive to Predictive
      2. Social at Google
      3. Privacy
      4. Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
  10. 8. Recommendation Engines: Building a User-Facing Data Product at Scale
    1. A Real-World Recommendation Engine
      1. Nearest Neighbor Algorithm Review
      2. Some Problems with Nearest Neighbors
      3. Beyond Nearest Neighbor: Machine Learning Classification
      4. The Dimensionality Problem
      5. Singular Value Decomposition (SVD)
      6. Important Properties of SVD
      7. Principal Component Analysis (PCA)
        1. Theorem: The resulting latent features will be uncorrelated
      8. Alternating Least Squares
        1. Theorem with no proof: The preceding algorithm will converge if your prior is large enough
      9. Fix V and Update U
      10. Last Thoughts on These Algorithms
    2. Thought Experiment: Filter Bubbles
    3. Exercise: Build Your Own Recommendation System
      1. Sample Code in Python
  11. 9. Data Visualization and Fraud Detection
    1. Data Visualization History
      1. Gabriel Tarde
      2. Mark’s Thought Experiment
    2. What Is Data Science, Redux?
      1. Processing
      2. Franco Moretti
    3. A Sample of Data Visualization Projects
    4. Mark’s Data Visualization Projects
      1. New York Times Lobby: Moveable Type
      2. Project Cascade: Lives on a Screen
      3. Cronkite Plaza
      4. eBay Transactions and Books
      5. Public Theater Shakespeare Machine
      6. Goals of These Exhibits
    5. Data Science and Risk
      1. About Square
      2. The Risk Challenge
        1. Detecting suspicious activity using machine learning
      3. The Trouble with Performance Estimation
        1. Defining the error metric
        2. Defining the labels
        3. Challenges in features and learning
      4. Model Building Tips
        1. Code readability and reusability
        2. Get a pair!
        3. Productionizing machine learning models
    6. Data Visualization at Square
    7. Ian’s Thought Experiment
    8. Data Visualization for the Rest of Us
      1. Data Visualization Exercise
  12. 10. Social Networks and Data Journalism
    1. Social Network Analysis at Morning Analytics
      1. Case-Attribute Data versus Social Network Data
    2. Social Network Analysis
    3. Terminology from Social Networks
      1. Centrality Measures
      2. The Industry of Centrality Measures
    4. Thought Experiment
    5. Morningside Analytics
      1. How Visualizations Help Us Find Schools of Fish
    6. More Background on Social Network Analysis from a Statistical Point of View
      1. Representations of Networks and Eigenvalue Centrality
      2. A First Example of Random Graphs: The Erdos-Renyi Model
      3. A Second Example of Random Graphs: The Exponential Random Graph Model
        1. Inference for ERGMs
        2. Further examples of random graphs: latent space models, small-world networks
    7. Data Journalism
      1. A Bit of History on Data Journalism
      2. Writing Technical Journalism: Advice from an Expert
  13. 11. Causality
    1. Correlation Doesn’t Imply Causation
      1. Asking Causal Questions
      2. Confounders: A Dating Example
    2. OK Cupid’s Attempt
    3. The Gold Standard: Randomized Clinical Trials
    4. A/B Tests
    5. Second Best: Observational Studies
      1. Simpson’s Paradox
      2. The Rubin Causal Model
      3. Visualizing Causality
      4. Definition: The Causal Effect
    6. Three Pieces of Advice
  14. 12. Epidemiology
    1. Madigan’s Background
    2. Thought Experiment
    3. Modern Academic Statistics
    4. Medical Literature and Observational Studies
    5. Stratification Does Not Solve the Confounder Problem
      1. What Do People Do About Confounding Things in Practice?
    6. Is There a Better Way?
    7. Research Experiment (Observational Medical Outcomes Partnership)
    8. Closing Thought Experiment
  15. 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
    1. Claudia’s Data Scientist Profile
      1. The Life of a Chief Data Scientist
      2. On Being a Female Data Scientist
    2. Data Mining Competitions
    3. How to Be a Good Modeler
    4. Data Leakage
      1. Market Predictions
      2. Amazon Case Study: Big Spenders
      3. A Jewelry Sampling Problem
      4. IBM Customer Targeting
      5. Breast Cancer Detection
      6. Pneumonia Prediction
    5. How to Avoid Leakage
    6. Evaluating Models
      1. Accuracy: Meh
      2. Probabilities Matter, Not 0s and 1s
    7. Choosing an Algorithm
    8. A Final Example
    9. Parting Thoughts
  16. 14. Data Engineering: MapReduce, Pregel, and Hadoop
    1. About David Crawshaw
    2. Thought Experiment
    3. MapReduce
    4. Word Frequency Problem
      1. Enter MapReduce
    5. Other Examples of MapReduce
      1. What Can’t MapReduce Do?
    6. Pregel
    7. About Josh Wills
    8. Thought Experiment
    9. On Being a Data Scientist
      1. Data Abundance Versus Data Scarcity
      2. Designing Models
        1. Mind the gap
    10. Economic Interlude: Hadoop
      1. A Brief Introduction to Hadoop
      2. Cloudera
    11. Back to Josh: Workflow
    12. So How to Get Started with Hadoop?
  17. 15. The Students Speak
    1. Process Thinking
    2. Naive No Longer
    3. Helping Hands
    4. Your Mileage May Vary
    5. Bridging Tunnels
    6. Some of Our Work
  18. 16. Next-Generation Data Scientists, Hubris, and Ethics
    1. What Just Happened?
    2. What Is Data Science (Again)?
    3. What Are Next-Gen Data Scientists?
      1. Being Problem Solvers
      2. Cultivating Soft Skills
      3. Being Question Askers
    4. Being an Ethical Data Scientist
    5. Career Advice
  19. Index
  20. Colophon
  21. Copyright