You are previewing Getting Started with Data Science: Making Sense of Data with Analytics.
O'Reilly logo
Getting Started with Data Science: Making Sense of Data with Analytics

Book Description

Master Data Analytics Hands-On by Solving Fascinating Problems You’ll Actually Enjoy!

Harvard Business Review recently called data science “The Sexiest Job of the 21st Century.” It’s not just sexy: For millions of managers, analysts, and students who need to solve real business problems, it’s indispensable. Unfortunately, there’s been nothing easy about learning data science–until now.

Getting Started with Data Science takes its inspiration from worldwide best-sellers like Freakonomics and Malcolm Gladwell’s Outliers: It teaches through a powerful narrative packed with unforgettable stories.

Murtaza Haider offers informative, jargon-free coverage of basic theory and technique, backed with plenty of vivid examples and hands-on practice opportunities. Everything’s software and platform agnostic, so you can learn data science whether you work with R, Stata, SPSS, or SAS. Best of all, Haider teaches a crucial skillset most data science books ignore: how to tell powerful stories using graphics and tables. Every chapter is built around real research challenges, so you’ll always know why you’re doing what you’re doing.

You’ll master data science by answering fascinating questions, such as:
• Are religious individuals more or less likely to have extramarital affairs?
• Do attractive professors get better teaching evaluations?
• Does the higher price of cigarettes deter smoking?
• What determines housing prices more: lot size or the number of bedrooms?
• How do teenagers and older people differ in the way they use social media?
• Who is more likely to use online dating services?
• Why do some purchase iPhones and others Blackberry devices?
• Does the presence of children influence a family’s spending on alcohol?

For each problem, you’ll walk through defining your question and the answers you’ll need; exploring how
others have approached similar challenges; selecting your data and methods; generating your statistics;
organizing your report; and telling your story. Throughout, the focus is squarely on what matters most:
transforming data into insights that are clear, accurate, and can be acted upon.

Table of Contents

  1. About This E-Book
  2. Title Page
  3. Copyright Page
  4. Praise for Getting Started with Data Science
  5. Dedication Page
  6. Contents-at-a-Glance
  7. Contents
  8. Preface
    1. Why I Wrote This Book
    2. Who Should Read This Book?
    3. About the Book
    4. The Book’s Three Key Ingredients: Narrative, Graphs, and Tables
    5. The Story Telling Differentiator
    6. Understanding Analytics in a 24/7 World
    7. A Quick Walkthrough of the Book
  9. Acknowledgments
  10. About the Author
  11. Chapter 1. The Bazaar of Storytellers
    1. Data Science: The Sexiest Job in the 21st Century
    2. Storytelling at Google and Walmart
    3. Getting Started with Data Science
      1. Do We Need Another Book on Analytics?
      2. Repeat, Repeat, Repeat, and Simplify
      3. Chapters’ Structure and Features
      4. Analytics Software Used
    4. What Makes Someone a Data Scientist?
      1. Existential Angst of a Data Scientist
      2. Data Scientists: Rarer Than Unicorns
    5. Beyond the Big Data Hype
      1. Big Data: Beyond Cheerleading
      2. Big Data Hubris
      3. Leading by Miles
      4. Predicting Pregnancies, Missing Abortions
    6. What’s Beyond This Book?
    7. Summary
    8. Endnotes
  12. Chapter 2. Data in the 24/7 Connected World
    1. The Liberated Data: The Open Data
    2. The Caged Data
    3. Big Data Is Big News
    4. It’s Not the Size of Big Data; It’s What You Do with It
    5. Free Data as in Free Lunch
      1. FRED
      2. Quandl
      3. U.S. Census Bureau and Other National Statistical Agencies
    6. Search-Based Internet Data
      1. Google Trends
      2. Google Correlate
    7. Survey Data
      1. PEW Surveys
      2. ICPSR
    8. Summary
    9. Endnotes
  13. Chapter 3. The Deliverable
    1. The Final Deliverable
      1. What Is the Research Question?
      2. What Answers Are Needed?
      3. How Have Others Researched the Same Question in the Past?
      4. What Information Do You Need to Answer the Question?
      5. What Analytical Techniques/Methods Do You Need?
    2. The Narrative
      1. The Report Structure
      2. Have You Done Your Job as a Writer?
    3. Building Narratives with Data
      1. “Big Data, Big Analytics, Big Opportunity”
      2. Urban Transport and Housing Challenges
      3. Human Development in South Asia
      4. The Big Move
    4. Summary
    5. Endnotes
  14. Chapter 4. Serving Tables
    1. 2014: The Year of Soccer and Brazil
      1. Using Percentages Is Better Than Using Raw Numbers
      2. Data Cleaning
      3. Weighted Data
      4. Cross Tabulations
      5. Going Beyond the Basics in Tables
    2. Seeing Whether Beauty Pays
      1. Data Set
      2. What Determines Teaching Evaluations?
      3. Does Beauty Affect Teaching Evaluations?
      4. Putting It All on (in) a Table
    3. Generating Output with Stata
      1. Summary Statistics Using Built-In Stata
      2. Using Descriptive Statistics
      3. Weighted Statistics
      4. Correlation Matrix
      5. Reproducing the Results for the Hamermesh and Parker Paper
      6. Statistical Analysis Using Custom Tables
    4. Summary
    5. Endnotes
  15. Chapter 5. Graphic Details
    1. Telling Stories with Figures
      1. Data Types
    2. Teaching Ratings
    3. The Congested Lives in Big Cities
    4. Summary
    5. Endnotes
  16. Chapter 6. Hypothetically Speaking
    1. Random Numbers and Probability Distributions
    2. Casino Royale: Roll the Dice
    3. Normal Distribution
    4. The Student Who Taught Everyone Else
    5. Statistical Distributions in Action
      1. Z-Transformation
      2. Probability of Getting a High or Low Course Evaluation
      3. Probabilities with Standard Normal Table
    6. Hypothetically Yours
      1. Consistently Better or Happenstance
      2. Mean and Not So Mean Differences
      3. Handling Rejections
    7. The Mean and Kind Differences
      1. Comparing a Sample Mean When the Population SD Is Known
      2. Left Tail Between the Legs
      3. Comparing Means with Unknown Population SD
      4. Comparing Two Means with Unequal Variances
      5. Comparing Two Means with Equal Variances
    8. Worked-Out Examples of Hypothesis Testing
      1. Best Buy–Apple Store Comparison
      2. Assuming Equal Variances
    9. Exercises for Comparison of Means
    10. Regression for Hypothesis Testing
    11. Analysis of Variance
    12. Significantly Correlated
    13. Summary
    14. Endnotes
  17. Chapter 7. Why Tall Parents Don’t Have Even Taller Children
    1. The Department of Obvious Conclusions
      1. Why Regress?
    2. Introducing Regression Models
      1. All Else Being Equal
      2. Holding Other Factors Constant
      3. Spuriously Correlated
      4. A Step-By-Step Approach to Regression
      5. Learning to Speak Regression
      6. The Math Behind Regression
      7. Ordinary Least Squares Method
    3. Regression in Action
      1. This Just In: Bigger Homes Sell for More
      2. Does Beauty Pay? Ask the Students
      3. Survey Data, Weights, and Independence of Observations
      4. What Determines Household Spending on Alcohol and Food
      5. What Influences Household Spending on Food?
    4. Advanced Topics
      1. Homoskedasticity
      2. Multicollinearity
    5. Summary
    6. Endnotes
  18. Chapter 8. To Be or Not to Be
    1. To Smoke or Not to Smoke: That Is the Question
      1. Binary Outcomes
      2. Binary Dependent Variables
      3. Let’s Question the Decision to Smoke or Not
      4. Smoking Data Set
    2. Exploratory Data Analysis
    3. What Makes People Smoke: Asking Regression for Answers
      1. Ordinary Least Squares Regression
      2. Interpreting Models at the Margins
    4. The Logit Model
    5. Interpreting Odds in a Logit Model
    6. Probit Model
      1. Interpreting the Probit Model
      2. Using Zelig for Estimation and Post-Estimation Strategies
    7. Estimating Logit Models for Grouped Data
    8. Using SPSS to Explore the Smoking Data Set
      1. Regression Analysis in SPSS
      2. Estimating Logit and Probit Models in SPSS
    9. Summary
    10. Endnotes
  19. Chapter 9. Categorically Speaking About Categorical Data
    1. What Is Categorical Data?
    2. Analyzing Categorical Data
    3. Econometric Models of Binomial Data
      1. Estimation of Binary Logit Models
      2. Odds Ratio
      3. Log of Odds Ratio
      4. Interpreting Binary Logit Models
      5. Statistical Inference of Binary Logit Models
    4. How I Met Your Mother? Analyzing Survey Data
      1. A Blind Date with the Pew Online Dating Data Set
      2. Demographics of Affection
      3. High-Techies
      4. Romancing the Internet
      5. Dating Models
    5. Multinomial Logit Models
      1. Interpreting Multinomial Logit Models
      2. Choosing an Online Dating Service
      3. Pew Phone Type Model
      4. Why Some Women Work Full-Time and Others Don’t
    6. Conditional Logit Models
      1. Random Utility Model
      2. Independence From Irrelevant Alternatives
      3. Interpretation of Conditional Logit Models
      4. Estimating Logit Models in SPSS
    7. Summary
    8. Endnotes
  20. Chapter 10. Spatial Data Analytics
    1. Fundamentals of GIS
    2. GIS Platforms
      1. Freeware GIS
      2. GIS Data Structure
    3. GIS Applications in Business Research
      1. Retail Research
      2. Hospitality and Tourism Research
      3. Lifestyle Data: Consumer Health Profiling
      4. Competitor Location Analysis
      5. Market Segmentation
    4. Spatial Analysis of Urban Challenges
      1. The Hard Truths About Public Transit in North America
      2. Toronto Is a City Divided into the Haves, Will Haves, and Have Nots
      3. Income Disparities in Urban Canada
      4. Where Is Toronto’s Missing Middle Class? It Has Suburbanized Out of Toronto
    5. Adding Spatial Analytics to Data Science
    6. Race and Space in Chicago
      1. Developing Research Questions
      2. Race, Space, and Poverty
      3. Race, Space, and Commuting
      4. Regression with Spatial Lags
    7. Summary
    8. Endnotes
  21. Chapter 11. Doing Serious Time with Time Series
    1. Introducing Time Series Data and How to Visualize It
    2. How Is Time Series Data Different?
    3. Starting with Basic Regression Models
    4. What Is Wrong with Using OLS Models for Time Series Data?
      1. Newey–West Standard Errors
      2. Regressing Prices with Robust Standard Errors
    5. Time Series Econometrics
      1. Stationary Time Series
      2. Autocorrelation Function (ACF)
      3. Partial Autocorrelation Function (PCF)
      4. White Noise Tests
      5. Augmented Dickey Fuller Test
    6. Econometric Models for Time Series Data
      1. Correlation Diagnostics
      2. Invertible Time Series and Lag Operators
      3. The ARMA Model
      4. ARIMA Models
      5. Distributed Lag and VAR Models
    7. Applying Time Series Tools to Housing Construction
      1. Macro-Economic and Socio-Demographic Variables Influencing Housing Starts
    8. Estimating Time Series Models to Forecast New Housing Construction
      1. OLS Models
      2. Distributed Lag Model
      3. Out-of-Sample Forecasting with Vector Autoregressive Models
      4. ARIMA Models
    9. Summary
    10. Endnotes
  22. Chapter 12. Data Mining for Gold
    1. Can Cheating on Your Spouse Kill You?
      1. Are Cheating Men Alpha Males?
      2. UnFair Comments: New Evidence Critiques Fair’s Research
    2. Data Mining: An Introduction
    3. Seven Steps Down the Data Mine
      1. Establishing Data Mining Goals
      2. Selecting Data
      3. Preprocessing Data
      4. Transforming Data
      5. Storing Data
      6. Mining Data
      7. Evaluating Mining Results
    4. Rattle Your Data
      1. What Does Religiosity Have to Do with Extramarital Affairs?
      2. The Principal Components of an Extramarital Affair
      3. Will It Rain Tomorrow? Using PCA For Weather Forecasting
      4. Do Men Have More Affairs Than Females?
      5. Two Kinds of People: Those Who Have Affairs, and Those Who Don’t
      6. Models to Mine Data with Rattle
    5. Summary
    6. Endnotes
  23. Index
  24. Code Snippets