Principles of Data Science - Second Edition

Book description

Learn the techniques and math you need to start making sense of your data

Key Features

  • Enhance your knowledge of coding with the theory for practical insight in data science and analysis
  • More than just a math class; you'll perform real-world data science tasks using Python
  • Get the best insights and transform your data to get tangible value out of it

Book Description

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You'll feel confident asking - and answering - complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas.

Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You'll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You'll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.

What you will learn

  • Understand five most important steps of data science
  • Use your data intelligently and learn how to handle it with care
  • Bridge the gap between mathematics and programming
  • Drive actionable results and clean your data using statistical models, calculus, and probability
  • Build and evaluate baseline machine learning models
  • Explore effective metrics to determine the success of your machine learning models
  • Create data visualizations that communicate actionable insights
  • Apply machine learning concepts to your problems and make actual predictions

Who this book is for

If you are an aspiring data scientist who wants to take your first steps in data science, this book is for you. If you have the basic math skills but want to apply them in data science, or you have good programming skills but lack the necessary math, this book will also help you. Some knowledge of Python programming will also help.

Table of contents

  1. Principles of Data Science - Second Edition
    1. Table of Contents
    2. Principles of Data Science - Second Edition
      1. Why subscribe?
      2. PacktPub.com
    3. Contributors
      1. About the authors
      2. About the reviewers
      3. Packt is searching for authors like you
    4. Preface
      1. Who this book is for
      2. What this book covers
      3. To get the most out of this book
        1. Download the example code files
        2. Download the color images
        3. Conventions used
      4. Get in touch
        1. Reviews
    5. 1. How to Sound Like a Data Scientist
      1. What is data science?
        1. Basic terminology
        2. Why data science?
        3. Example – xyz123 Technologies
      2. The data science Venn diagram
        1. The math
          1. Example – spawner-recruit models
        2. Computer programming
      3. Why Python?
        1. Python practices
        2. Example of basic Python
        3. Example – parsing a single tweet
        4. Domain knowledge
      4. Some more terminology
      5. Data science case studies
        1. Case study – automating government paper pushing
          1. Fire all humans, right?
        2. Case study – marketing dollars
        3. Case study – what's in a job description?
      6. Summary
    6. 2. Types of Data
      1. Flavors of data
      2. Why look at these distinctions?
      3. Structured versus unstructured data
        1. Example of data pre-processing
          1. Word/phrase counts
          2. Presence of certain special characters
          3. The relative length of text
          4. Picking out topics
      4. Quantitative versus qualitative data
        1. Example – coffee shop data
        2. Example – world alcohol consumption data
        3. Digging deeper
      5. The road thus far
      6. The four levels of data
        1. The nominal level
          1. Mathematical operations allowed
          2. Measures of center
          3. What data is like at the nominal level
        2. The ordinal level
          1. Examples
          2. Mathematical operations allowed
          3. Measures of center
        3. Quick recap and check
        4. The interval level
          1. Example
          2. Mathematical operations allowed
          3. Measures of center
          4. Measures of variation
            1. Standard deviation
        5. The ratio level
          1. Examples
          2. Measures of center
          3. Problems with the ratio level
      7. Data is in the eye of the beholder
      8. Summary
    7. 3. The Five Steps of Data Science
      1. Introduction to data science
      2. Overview of the five steps
        1. Asking an interesting question
        2. Obtaining the data
        3. Exploring the data
        4. Modeling the data
        5. Communicating and visualizing the results
      3. Exploring the data
        1. Basic questions for data exploration
        2. Dataset 1 – Yelp
          1. DataFrames
          2. Series
          3. Exploration tips for qualitative data
            1. Nominal level columns
            2. Filtering in pandas
            3. Ordinal level columns
        3. Dataset 2 – Titanic
      4. Summary
    8. 4. Basic Mathematics
      1. Mathematics as a discipline
      2. Basic symbols and terminology
        1. Vectors and matrices
          1. Quick exercises
          2. Answers
        2. Arithmetic symbols
          1. Summation
          2. Proportional
          3. Dot product
        3. Graphs
        4. Logarithms/exponents
        5. Set theory
      3. Linear algebra
        1. Matrix multiplication
          1. How to multiply matrices
      4. Summary
    9. 5. Impossible or Improbable - A Gentle Introduction to Probability
      1. Basic definitions
      2. Probability
      3. Bayesian versus Frequentist
        1. Frequentist approach
          1. The law of large numbers
      4. Compound events
      5. Conditional probability
      6. The rules of probability
        1. The addition rule
        2. Mutual exclusivity
        3. The multiplication rule
        4. Independence
        5. Complementary events
      7. A bit deeper
      8. Summary
    10. 6. Advanced Probability
      1. Collectively exhaustive events
      2. Bayesian ideas revisited
        1. Bayes' theorem
        2. More applications of Bayes' theorem
          1. Example – Titanic
          2. Example – medical studies
      3. Random variables
        1. Discrete random variables
          1. Types of discrete random variables
        2. Binomial random variables
          1. Geometric random variables
          2. Poisson random variable
          3. Continuous random variables
      4. Summary
    11. 7. Basic Statistics
      1. What are statistics?
      2. How do we obtain and sample data?
        1. Obtaining data
          1. Observational
          2. Experimental
        2. Sampling data
          1. Probability sampling
          2. Random sampling
          3. Unequal probability sampling
        3. How do we measure statistics?
          1. Measures of center
          2. Measures of variation
            1. Definition
            2. Example – employee salaries
          3. Measures of relative standing
            1. The insightful part – correlations in data
        4. The empirical rule
        5. Summary
    12. 8. Advanced Statistics
      1. Point estimates
      2. Sampling distributions
      3. Confidence intervals
      4. Hypothesis tests
        1. Conducting a hypothesis test
        2. One sample t-tests
          1. Example of a one-sample t-test
          2. Assumptions of the one-sample t-test
        3. Type I and type II errors
        4. Hypothesis testing for categorical variables
          1. Chi-square goodness of fit test
            1. Assumptions of the chi-square goodness of fit test
            2. Example of a chi-square test for goodness of fit
          2. Chi-square test for association/independence
        5. Assumptions of the chi-square independence test
      5. Summary
    13. 9. Communicating Data
      1. Why does communication matter?
      2. Identifying effective and ineffective visualizations
        1. Scatter plots
        2. Line graphs
        3. Bar charts
        4. Histograms
        5. Box plots
      3. When graphs and statistics lie
        1. Correlation versus causation
        2. Simpson's paradox
        3. If correlation doesn't imply causation, then what does?
      4. Verbal communication
        1. It's about telling a story
        2. On the more formal side of things
      5. The why/how/what strategy of presenting
      6. Summary
    14. 10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials
      1. What is machine learning?
        1. Example – facial recognition
      2. Machine learning isn't perfect
      3. How does machine learning work?
      4. Types of machine learning
        1. Supervised learning
        2. Example – heart attack prediction
          1. It's not only about predictions
          2. Types of supervised learning
            1. Regression
            2. Classification
          3. Data is in the eyes of the beholder
        3. Unsupervised learning
          1. Reinforcement learning
          2. Overview of the types of machine learning
      5. How does statistical modeling fit into all of this?
      6. Linear regression
        1. Adding more predictors
        2. Regression metrics
      7. Logistic regression
      8. Probability, odds, and log odds
        1. The math of logistic regression
      9. Dummy variables
      10. Summary
    15. 11. Predictions Don't Grow on Trees - or Do They?
      1. Naive Bayes classification
      2. Decision trees
        1. How does a computer build a regression tree?
        2. How does a computer fit a classification tree?
      3. Unsupervised learning
        1. When to use unsupervised learning
      4. k-means clustering
        1. Illustrative example – data points
        2. Illustrative example – beer!
      5. Choosing an optimal number for K and cluster validation
        1. The Silhouette Coefficient
        2. Feature extraction and principal component analysis
      6. Summary
    16. 12. Beyond the Essentials
      1. The bias/variance tradeoff
        1. Errors due to bias
        2. Error due to variance
          1. Example – comparing body and brain weight of mammals
        3. Two extreme cases of bias/variance tradeoff
          1. Underfitting
          2. Overfitting
        4. How bias/variance play into error functions
      2. K folds cross-validation
      3. Grid searching
        1. Visualizing training error versus cross-validation error
      4. Ensembling techniques
        1. Random forests
        2. Comparing random forests with decision trees
      5. Neural networks
        1. Basic structure
      6. Summary
    17. 13. Case Studies
      1. Case study 1 – Predicting stock prices based on social media
        1. Text sentiment analysis
        2. Exploratory data analysis
          1. Regression route
          2. Classification route
        3. Going beyond with this example
      2. Case study 2 – Why do some people cheat on their spouses?
      3. Case study 3 – Using TensorFlow
        1. TensorFlow and neural networks
      4. Summary
    18. 14. Microsoft Azure Databricks
      1. The Microsoft data science environment
        1. What exactly are Spark and PySpark?
      2. Basic Azure Databricks use
        1. Setting up our first cluster
        2. Case study 1 – bike-sharing usage prediction using parallelization in Azure Databricks
        3. Case study 2 – Using MLlib in Azure Databricks to predict credit card fraud
          1. Using the MLlib Grid Search module to tune hyperparameters
          2. Case study 3 – Using Azure Databricks to optimize our hyperparameter tuning
          3. How to add Python libraries to your cluster
          4. Using spark_sklearn to build an MNIST classifier
      3. Summary
    19. Another Book You May Enjoy
      1. Leave a review – let other readers know what you think
    20. Index

Product information

  • Title: Principles of Data Science - Second Edition
  • Author(s): Sinan Ozdemir, Sunil Kakade
  • Release date: December 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789804546