Foundational data science with R
Mastering basic statistical analysis, summary, and visualization using R
This course provides a firm foundation on the fundamentals of data science using R, with a focus on key statistical methods, exploratory data analysis, and visualizations.
Before worrying about advanced analytics and neural nets, it is important to master the core skills. While this is certainly not a mathematical course, we won’t shy away from giving insight into the underlying mathematical theory. This invaluable online course will give you a solid grounding in the fundamental data science skills you need.
What you'll learnand how you can apply it
At the end of this live, online training, you’ll understand:
 How to summarize data sets with key statistics
 Which statistics are optimal for large data sets
 The tradeoff between different summary measures.
 The importance of color, transparency and shape in data visualisations
 Mathematical distribution, and how it relates to “real” data
 How key algorithms work
And you’ll be able to:
 Summarize data sets
 Graphically describe data
 Compare groups of data using principled statistical techniques
 Describe relationships among data sets with correlation and regression models
 Use insight to predict future values
This training course is for you because...
You are a:
 Programmer, interested in data science but with little or no statistics or mathematical background.
 Manager who wants to summarize data sets.
 Someone who uses data, but doesn’t have the necessary training to analyze and summarize it.
Prerequisites
No experience with R is necessary, but participants are expected to understand basic programming via another language, e.g. python, matlab, C, or Java.
The course will be taught using R, but the focus is on the methods, rather than programming.
About your instructor

Colin Gillespie is a Senior Lecturer in Statistics at Newcastle University, UK, and the coauthor of Efficient R Programming by O’Reilly. His research interests are highperformance statistical computing and Bayesian statistics. He is regularly employed as a consultant by Jumping Rivers and has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming.
Schedule
The timeframes are only estimates and may vary according to how the class is progressing
DAY 1
Introduction and course overview (20 minutes)
 Introduction
 Course overview
Condensing data with numerical summaries (90 minutes)
Measures of location
 Mean, median, mode
 Example
 Exercise / Q&A (25 minutes)
Measures of spread
 Variance, standard deviation, quartiles, range
 Example
 Exercise / Q&A (25 minutes)
Streaming data
 Mean vs median
 variance vs quartiles
 Example
 Exercise / Q&A (10 minutes)
Break (5 min)
What, why and how of visualisation (90 minutes)
Scatter plot
Colors
 Number of points–should you summarize?
 Transparency
 log scales
 Examples
 Exercise/Q&A (25 minutes)
Histogram
 How do determine the number of bins
 Examples
 Barplot
 ordinal data
 Examples
 Boxplot
 Great for comparison
 Examples
 Exercises/Q&A (25 minutes)
Wrap up
DAY 2
The normal distributionwhat’s the point? (30 minutes)
 Why does the normal distribution come from?
 Shape: the famous bell shaped curve
 Key parameters
 The 2 standard deviations rule
 Scaling data
 (Data  mean)/sd
 Example
 Exercise/Q&A (10 minutes)
Break (5 min)
How to compare groups (60 minutes)
 The ttest
 The tdistribution
 Assumptions: normality, independent
 Example:
 OK Cupid data. Are the “daters” heights different from the standard population?
 The central limit theorem (basically, don’t worry about normality too much if your data set is big enough)
 Confidence intervals:
 Standard errors vs standard deviation
 Example
 Exercise/Q&A
Break (5 min)
Capturing relationships with linear regression (90 minutes)
 Correlation: linear relationship between two variables
 Examples
 Exercise/Q&A
 Simple linear regression
 Assumptions
 Residuals: Observed  expected
 Examples
 Exercise/Q&A
Wrapup (5 min)