You are previewing Data Analysis with R.
O'Reilly logo
Data Analysis with R

Book Description

Load, wrangle, and analyze your data using the world's most powerful statistical programming language

About This Book

  • Load, manipulate and analyze data from different sources
  • Gain a deeper understanding of fundamentals of applied statistics
  • A practical guide to performing data analysis in practice

Who This Book Is For

Whether you are learning data analysis for the first time, or you want to deepen the understanding you already have, this book will prove to an invaluable resource. If you are looking for a book to bring you all the way through the fundamentals to the application of advanced and effective analytics methodologies, and have some prior programming experience and a mathematical background, then this is for you.

What You Will Learn

  • Navigate the R environment
  • Describe and visualize the behavior of data and relationships between data
  • Gain a thorough understanding of statistical reasoning and sampling
  • Employ hypothesis tests to draw inferences from your data
  • Learn Bayesian methods for estimating parameters
  • Perform regression to predict continuous variables
  • Apply powerful classification methods to predict categorical data
  • Handle missing data gracefully using multiple imputation
  • Identify and manage problematic data points
  • Employ parallelization and Rcpp to scale your analyses to larger data
  • Put best practices into effect to make your job easier and facilitate reproducibility

In Detail

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. With over 7,000 user contributed packages, it’s easy to find support for the latest and greatest algorithms and techniques.

Starting with the basics of R and statistical reasoning, Data Analysis with R dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

Packed with engaging problems and exercises, this book begins with a review of R and its syntax. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with “messy data”, large data, communicating results, and facilitating reproducibility.

This book is engineered to be an invaluable resource through many stages of anyone’s career as a data analyst.

Style and approach

Learn data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

Table of Contents

  1. Data Analysis with R
    1. Table of Contents
    2. Data Analysis with R
    3. Credits
    4. About the Author
    5. About the Reviewer
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. RefresheR
      1. Navigating the basics
        1. Arithmetic and assignment
        2. Logicals and characters
        3. Flow of control
      2. Getting help in R
      3. Vectors
        1. Subsetting
        2. Vectorized functions
        3. Advanced subsetting
        4. Recycling
      4. Functions
      5. Matrices
      6. Loading data into R
      7. Working with packages
      8. Exercises
      9. Summary
    9. 2. The Shape of Data
      1. Univariate data
      2. Frequency distributions
      3. Central tendency
      4. Spread
      5. Populations, samples, and estimation
      6. Probability distributions
      7. Visualization methods
      8. Exercises
      9. Summary
    10. 3. Describing Relationships
      1. Multivariate data
      2. Relationships between a categorical and a continuous variable
      3. Relationships between two categorical variables
      4. The relationship between two continuous variables
        1. Covariance
        2. Correlation coefficients
        3. Comparing multiple correlations
      5. Visualization methods
        1. Categorical and continuous variables
        2. Two categorical variables
        3. Two continuous variables
        4. More than two continuous variables
      6. Exercises
      7. Summary
    11. 4. Probability
      1. Basic probability
      2. A tale of two interpretations
      3. Sampling from distributions
        1. Parameters
        2. The binomial distribution
      4. The normal distribution
        1. The three-sigma rule and using z-tables
      5. Exercises
      6. Summary
    12. 5. Using Data to Reason About the World
      1. Estimating means
      2. The sampling distribution
      3. Interval estimation
        1. How did we get 1.96?
      4. Smaller samples
      5. Exercises
      6. Summary
    13. 6. Testing Hypotheses
      1. Null Hypothesis Significance Testing
        1. One and two-tailed tests
        2. When things go wrong
        3. A warning about significance
        4. A warning about p-values
      2. Testing the mean of one sample
        1. Assumptions of the one sample t-test
      3. Testing two means
        1. Don't be fooled!
        2. Assumptions of the independent samples t-test
      4. Testing more than two means
        1. Assumptions of ANOVA
      5. Testing independence of proportions
      6. What if my assumptions are unfounded?
      7. Exercises
      8. Summary
    14. 7. Bayesian Methods
      1. The big idea behind Bayesian analysis
      2. Choosing a prior
      3. Who cares about coin flips
      4. Enter MCMC – stage left
      5. Using JAGS and runjags
      6. Fitting distributions the Bayesian way
      7. The Bayesian independent samples t-test
      8. Exercises
      9. Summary
    15. 8. Predicting Continuous Variables
      1. Linear models
      2. Simple linear regression
      3. Simple linear regression with a binary predictor
        1. A word of warning
      4. Multiple regression
      5. Regression with a non-binary predictor
      6. Kitchen sink regression
      7. The bias-variance trade-off
        1. Cross-validation
        2. Striking a balance
      8. Linear regression diagnostics
        1. Second Anscombe relationship
        2. Third Anscombe relationship
        3. Fourth Anscombe relationship
      9. Advanced topics
      10. Exercises
      11. Summary
    16. 9. Predicting Categorical Variables
      1. k-Nearest Neighbors
        1. Using k-NN in R
          1. Confusion matrices
          2. Limitations of k-NN
      2. Logistic regression
        1. Using logistic regression in R
      3. Decision trees
      4. Random forests
      5. Choosing a classifier
        1. The vertical decision boundary
        2. The diagonal decision boundary
        3. The crescent decision boundary
        4. The circular decision boundary
      6. Exercises
      7. Summary
    17. 10. Sources of Data
      1. Relational Databases
        1. Why didn't we just do that in SQL?
      2. Using JSON
      3. XML
      4. Other data formats
      5. Online repositories
      6. Exercises
      7. Summary
    18. 11. Dealing with Messy Data
      1. Analysis with missing data
        1. Visualizing missing data
        2. Types of missing data
          1. So which one is it?
        3. Unsophisticated methods for dealing with missing data
          1. Complete case analysis
          2. Pairwise deletion
          3. Mean substitution
          4. Hot deck imputation
          5. Regression imputation
          6. Stochastic regression imputation
        4. Multiple imputation
          1. So how does mice come up with the imputed values?
            1. Methods of imputation
        5. Multiple imputation in practice
      2. Analysis with unsanitized data
        1. Checking for out-of-bounds data
        2. Checking the data type of a column
        3. Checking for unexpected categories
        4. Checking for outliers, entry errors, or unlikely data points
        5. Chaining assertions
      3. Other messiness
        1. OpenRefine
        2. Regular expressions
        3. tidyr
      4. Exercises
      5. Summary
    19. 12. Dealing with Large Data
      1. Wait to optimize
      2. Using a bigger and faster machine
      3. Be smart about your code
        1. Allocation of memory
        2. Vectorization
      4. Using optimized packages
      5. Using another R implementation
      6. Use parallelization
        1. Getting started with parallel R
        2. An example of (some) substance
      7. Using Rcpp
      8. Be smarter about your code
      9. Exercises
      10. Summary
    20. 13. Reproducibility and Best Practices
      1. R Scripting
        1. RStudio
        2. Running R scripts
        3. An example script
        4. Scripting and reproducibility
      2. R projects
      3. Version control
      4. Communicating results
      5. Exercises
      6. Summary
    21. Index