O'Reilly logo
live online training icon Live Online training

Foundational data science with R

Mastering basic statistical analysis, summary, and visualization using R

Colin Gillespie

This course provides a firm foundation on the fundamentals of data science using R, with a focus on key statistical methods, exploratory data analysis, and visualizations.

Before worrying about advanced analytics and neural nets, it is important to master the core skills. While this is certainly not a mathematical course, we won’t shy away from giving insight into the underlying mathematical theory. This invaluable online course will give you a solid grounding in the fundamental data science skills you need.

What you'll learn-and how you can apply it

At the end of this live, online training, you’ll understand:

  • How to summarize data sets with key statistics
  • Which statistics are optimal for large data sets
  • The trade-off between different summary measures.
  • The importance of color, transparency and shape in data visualisations
  • Mathematical distribution, and how it relates to “real” data
  • How key algorithms work

And you’ll be able to:

  • Summarize data sets
  • Graphically describe data
  • Compare groups of data using principled statistical techniques
  • Describe relationships among data sets with correlation and regression models
  • Use insight to predict future values

This training course is for you because...

You are a:

  • Programmer, interested in data science but with little or no statistics or mathematical background.
  • Manager who wants to summarize data sets.
  • Someone who uses data, but doesn’t have the necessary training to analyze and summarize it.

Prerequisites

No experience with R is necessary, but participants are expected to understand basic programming via another language, e.g. python, matlab, C, or Java.
The course will be taught using R, but the focus is on the methods, rather than programming.

About your instructor

  • Colin Gillespie is a Senior Lecturer in Statistics at Newcastle University, UK, and the co-author of Efficient R Programming by O’Reilly. His research interests are high-performance statistical computing and Bayesian statistics. He is regularly employed as a consultant by Jumping Rivers and has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

DAY 1

Introduction and course overview (20 minutes)

  • Introduction
  • Course overview

Condensing data with numerical summaries (90 minutes)

Measures of location

  • Mean, median, mode
  • Example
  • Exercise / Q&A (25 minutes)

Measures of spread

  • Variance, standard deviation, quartiles, range
  • Example
  • Exercise / Q&A (25 minutes)

Streaming data

  • Mean vs median
  • variance vs quartiles
  • Example
  • Exercise / Q&A (10 minutes)

Break (5 min)

What, why and how of visualisation (90 minutes)

Scatter plot
Colors

  • Number of points–should you summarize?
  • Transparency
  • log scales
  • Examples
  • Exercise/Q&A (25 minutes)

Histogram

  • How do determine the number of bins
  • Examples
  • Barplot
  • ordinal data
  • Examples
  • Boxplot
  • Great for comparison
  • Examples
  • Exercises/Q&A (25 minutes)

Wrap up

DAY 2

The normal distribution-what’s the point? (30 minutes)

  • Why does the normal distribution come from?
  • Shape: the famous bell shaped curve
  • Key parameters
  • The 2 standard deviations rule
  • Scaling data
  • (Data - mean)/sd
  • Example
  • Exercise/Q&A (10 minutes)
    Break (5 min)

How to compare groups (60 minutes)

  • The t-test
  • The t-distribution
  • Assumptions: normality, independent
  • Example:
  • OK Cupid data. Are the “daters” heights different from the standard population?
  • The central limit theorem (basically, don’t worry about normality too much if your data set is big enough)
  • Confidence intervals:
  • Standard errors vs standard deviation
  • Example
  • Exercise/Q&A

Break (5 min)

Capturing relationships with linear regression (90 minutes)

  • Correlation: linear relationship between two variables
  • Examples
  • Exercise/Q&A
  • Simple linear regression
  • Assumptions
  • Residuals: Observed - expected
  • Examples
  • Exercise/Q&A

Wrap-up (5 min)