O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Data Science Handbook

Book Description

A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline

Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.

Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features:

• Extensive sample code and tutorials using Python™ along with its technical libraries

• Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve real-world problems

• Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity

• A wide variety of case studies from industry

• Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed

The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set.

FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.

Table of Contents

  1. Cover
  2. Title Page
    1. Copyright
    2. Dedication
  3. Preface
    1. Chapter 1: Introduction: Becoming a Unicorn
      1. 1.1 Aren't Data Scientists Just Overpaid Statisticians?
      2. 1.2 How Is This Book Organized?
      3. 1.3 How to Use This Book?
      4. 1.4 Why Is It All in Python™, Anyway?
      5. 1.5 Example Code and Datasets
      6. 1.6 Parting Words
  4. Part I: The Stuff You'll Always Use
    1. Chapter 2: The Data Science Road Map
      1. 2.1 Frame the Problem
      2. 2.2 Understand the Data: Basic Questions
      3. 2.3 Understand the Data: Data Wrangling
      4. 2.4 Understand the Data: Exploratory Analysis
      5. 2.5 Extract Features
      6. 2.6 Model
      7. 2.7 Present Results
      8. 2.8 Deploy Code
      9. 2.9 Iterating
      10. 2.10 Glossary
    2. Chapter 3: Programming Languages
      1. 3.1 Why Use a Programming Language? What Are the Other Options?
      2. 3.2 A Survey of Programming Languages for Data Science
      3. 3.3 Python Crash Course
      4. 3.4 Strings
      5. 3.5 Defining Functions
      6. 3.6 Python's Technical Libraries
      7. 3.7 Other Python Resources
      8. 3.8 Further Reading
      9. 3.9 Glossary
    3. Interlude: My Personal Toolkit
    4. Chapter 4: Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
      1. 4.1 The Worst Dataset in the World
      2. 4.2 How to Identify Pathologies
      3. 4.3 Problems with Data Content
      4. 4.4 Formatting Issues
      5. 4.5 Example Formatting Script
      6. 4.6 Regular Expressions
      7. 4.7 Life in the Trenches
      8. 4.8 Glossary
    5. Chapter 5: Visualizations and Simple Metrics
      1. 5.1 A Note on Python's Visualization Tools
      2. 5.2 Example Code
      3. 5.3 Pie Charts
      4. 5.4 Bar Charts
      5. 5.5 Histograms
      6. 5.6 Means, Standard Deviations, Medians, and Quantiles
      7. 5.7 Boxplots
      8. 5.8 Scatterplots
      9. 5.9 Scatterplots with Logarithmic Axes
      10. 5.10 Scatter Matrices
      11. 5.11 Heatmaps
      12. 5.12 Correlations
      13. 5.13 Anscombe's Quartet and the Limits of Numbers
      14. 5.14 Time Series
      15. 5.15 Further Reading
      16. 5.16 Glossary
    6. Chapter 6: Machine Learning Overview
      1. 6.1 Historical Context
      2. 6.2 Supervised versus Unsupervised
      3. 6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting
      4. 6.4 Further Reading
      5. 6.5 Glossary
    7. Chapter 7: Interlude: Feature Extraction Ideas
      1. 7.1 Standard Features
      2. 7.2 Features That Involve Grouping
      3. 7.3 Preview of More Sophisticated Features
      4. 7.4 Defining the Feature You Want to Predict
    8. Chapter 8: Machine Learning Classification
      1. 8.1 What Is a Classifier, and What Can You Do with It?
      2. 8.2 A Few Practical Concerns
      3. 8.3 Binary versus Multiclass
      4. 8.4 Example Script
      5. 8.5 Specific Classifiers
      6. 8.6 Evaluating Classifiers
      7. 8.7 Selecting Classification Cutoffs
      8. 8.8 Further Reading
      9. 8.9 Glossary
    9. Chapter 9: Technical Communication and Documentation
      1. 9.1 Several Guiding Principles
      2. 9.2 Slide Decks
      3. 9.3 Written Reports
      4. 9.4 Speaking: What Has Worked for Me
      5. 9.5 Code Documentation
      6. 9.6 Further Reading
      7. 9.7 Glossary
  5. Part II: Stuff You Still Need to Know
    1. Chapter 10: Unsupervised Learning: Clustering and Dimensionality Reduction
      1. 10.1 The Curse of Dimensionality
      2. 10.2 Example: Eigenfaces for Dimensionality Reduction
      3. 10.3 Principal Component Analysis and Factor Analysis
      4. 10.4 Skree Plots and Understanding Dimensionality
      5. 10.5 Factor Analysis
      6. 10.6 Limitations of PCA
      7. 10.7 Clustering
      8. 10.8 Further Reading
      9. 10.9 Glossary
    2. Chapter 11: Regression
      1. 11.1 Example: Predicting Diabetes Progression
      2. 11.2 Least Squares
      3. 11.3 Fitting Nonlinear Curves
      4. 11.4 Goodness of Fit: R2 and Correlation
      5. 11.5 Correlation of Residuals
      6. 11.6 Linear Regression
      7. 11.7 LASSO Regression and Feature Selection
      8. 11.8 Further Reading
      9. 11.9 Glossary
    3. Chapter 12: Data Encodings and File Formats
      1. 12.1 Typical File Format Categories
      2. 12.2 CSV Files
      3. 12.3 JSON Files
      4. 12.4 XML Files
      5. 12.5 HTML Files
      6. 12.6 Tar Files
      7. 12.7 GZip Files
      8. 12.8 Zip Files
      9. 12.9 Image Files: Rasterized, Vectorized, and/or Compressed
      10. 12.10 It's All Bytes at the End of the Day
      11. 12.11 Integers
      12. 12.12 Floats
      13. 12.13 Text Data
      14. 12.14 Further Reading
      15. 12.15 Glossary
    4. Chapter 13: Big Data
      1. 13.1 What Is Big Data?
      2. 13.2 Hadoop: The File System and the Processor
      3. 13.3 Using HDFS
      4. 13.4 Example PySpark Script
      5. 13.5 Spark Overview
      6. 13.6 Spark Operations
      7. 13.7 Two Ways to Run PySpark
      8. 13.8 Configuring Spark
      9. 13.9 Under the Hood
      10. 13.10 Spark Tips and Gotchas
      11. 13.11 The MapReduce Paradigm
      12. 13.12 Performance Considerations
      13. 13.13 Further Reading
      14. 13.14 Glossary
    5. Chapter 14: Databases
      1. 14.1 Relational Databases and MySQL®
      2. 14.2 Key-Value Stores
      3. 14.3 Wide Column Stores
      4. 14.4 Document Stores
      5. 14.5 Further Reading
      6. 14.6 Glossary
    6. Chapter 15: Software Engineering Best Practices
      1. 15.1 Coding Style
      2. 15.2 Version Control and Git for Data Scientists
      3. 15.3 Testing Code
      4. 15.4 Test-Driven Development
      5. 15.5 AGILE Methodology
      6. 15.6 Further Reading
      7. 15.7 Glossary
    7. Chapter 16: Natural Language Processing
      1. 16.1 Do I Even Need NLP?
      2. 16.2 The Great Divide: Language versus Statistics
      3. 16.3 Example: Sentiment Analysis on Stock Market Articles
      4. 16.4 Software and Datasets
      5. 16.5 Tokenization
      6. 16.6 Central Concept: Bag-of-Words
      7. 16.7 Word Weighting: TF-IDF
      8. 16.8 n-Grams
      9. 16.9 Stop Words
      10. 16.10 Lemmatization and Stemming
      11. 16.11 Synonyms
      12. 16.12 Part of Speech Tagging
      13. 16.13 Common Problems
      14. 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
      15. 16.15 Further Reading
      16. 16.16 Glossary
    8. Chapter 17: Time Series Analysis
      1. 17.1 Example: Predicting Wikipedia Page Views
      2. 17.2 A Typical Workflow
      3. 17.3 Time Series versus Time-Stamped Events
      4. 17.4 Resampling an Interpolation
      5. 17.5 Smoothing Signals
      6. 17.6 Logarithms and Other Transformations
      7. 17.7 Trends and Periodicity
      8. 17.8 Windowing
      9. 17.9 Brainstorming Simple Features
      10. 17.10 Better Features: Time Series as Vectors
      11. 17.11 Fourier Analysis: Sometimes a Magic Bullet
      12. 17.12 Time Series in Context: The Whole Suite of Features
      13. 17.13 Further Reading
      14. 17.14 Glossary
    9. Chapter 18: Probability
      1. 18.1 Flipping Coins: Bernoulli Random Variables
      2. 18.2 Throwing Darts: Uniform Random Variables
      3. 18.3 The Uniform Distribution and Pseudorandom Numbers
      4. 18.4 Nondiscrete, Noncontinuous Random Variables
      5. 18.5 Notation, Expectations, and Standard Deviation
      6. 18.6 Dependence, Marginal and Conditional Probability
      7. 18.7 Understanding the Tails
      8. 18.8 Binomial Distribution
      9. 18.9 Poisson Distribution
      10. 18.10 Normal Distribution
      11. 18.11 Multivariate Gaussian
      12. 18.12 Exponential Distribution
      13. 18.13 Log-Normal Distribution
      14. 18.14 Entropy
      15. 18.15 Further Reading
      16. 18.16 Glossary
    10. Chapter 19: Statistics
      1. 19.1 Statistics in Perspective
      2. 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
      3. 19.3 Hypothesis Testing: Key Idea and Example
      4. 19.4 Multiple Hypothesis Testing
      5. 19.5 Parameter Estimation
      6. 19.6 Hypothesis Testing: t-Test
      7. 19.7 Confidence Intervals
      8. 19.8 Bayesian Statistics
      9. 19.9 Naive Bayesian Statistics
      10. 19.10 Bayesian Networks
      11. 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
      12. 19.12 Further Reading
      13. 19.13 Glossary
    11. Chapter 20: Programming Language Concepts
      1. 20.1 Programming Paradigms
      2. 20.2 Compilation and Interpretation
      3. 20.3 Type Systems
      4. 20.4 Further Reading
      5. 20.5 Glossary
    12. Chapter 21: Performance and Computer Memory
      1. 21.1 Example Script
      2. 21.2 Algorithm Performance and Big-O Notation
      3. 21.3 Some Classic Problems: Sorting a List and Binary Search
      4. 21.4 Amortized Performance and Average Performance
      5. 21.5 Two Principles: Reducing Overhead and Managing Memory
      6. 21.6 Performance Tip: Use Numerical Libraries When Applicable
      7. 21.7 Performance Tip: Delete Large Structures You Don't Need
      8. 21.8 Performance Tip: Use Built-In Functions When Possible
      9. 21.9 Performance Tip: Avoid Superfluous Function Calls
      10. 21.10 Performance Tip: Avoid Creating Large New Objects
      11. 21.11 Further Reading
      12. 21.12 Glossary
  6. Part III: Specialized or Advanced Topics
    1. Chapter 22: Computer Memory and Data Structures
      1. 22.1 Virtual Memory, the Stack, and the Heap
      2. 22.2 Example C Program
      3. 22.3 Data Types and Arrays in Memory
      4. 22.4 Structs
      5. 22.5 Pointers, the Stack, and the Heap
      6. 22.6 Key Data Structures
      7. 22.7 Further Reading
      8. 22.8 Glossary
    2. Chapter 23: Maximum Likelihood Estimation and Optimization
      1. 23.1 Maximum Likelihood Estimation
      2. 23.2 A Simple Example: Fitting a Line
      3. 23.3 Another Example: Logistic Regression
      4. 23.4 Optimization
      5. 23.5 Gradient Descent and Convex Optimization
      6. 23.6 Convex Optimization
      7. 23.7 Stochastic Gradient Descent
      8. 23.8 Further Reading
      9. 23.9 Glossary
    3. Chapter 24: Advanced Classifiers
      1. 24.1 A Note on Libraries
      2. 24.2 Basic Deep Learning
      3. 24.3 Convolutional Neural Networks
      4. 24.4 Different Types of Layers. What the Heck Is a Tensor?
      5. 24.5 Example: The MNIST Handwriting Dataset
      6. 24.6 Recurrent Neural Networks
      7. 24.7 Bayesian Networks
      8. 24.8 Training and Prediction
      9. 24.9 Markov Chain Monte Carlo
      10. 24.10 PyMC Example
      11. 24.11 Further Reading
      12. 24.12 Glossary
    4. Chapter 25: Stochastic Modeling
      1. 25.1 Markov Chains
      2. 25.2 Two Kinds of Markov Chain, Two Kinds of Questions
      3. 25.3 Markov Chain Monte Carlo
      4. 25.4 Hidden Markov Models and the Viterbi Algorithm
      5. 25.5 The Viterbi Algorithm
      6. 25.6 Random Walks
      7. 25.7 Brownian Motion
      8. 25.8 ARIMA Models
      9. 25.9 Continuous-Time Markov Processes
      10. 25.10 Poisson Processes
      11. 25.11 Further Reading
      12. 25.12 Glossary
    5. Parting Words: Your Future as a Data Scientist
    6. Index
  7. End User License Agreement