O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: R Programming for Data Analysts

Video Description

15+ Hours of Video Instruction

R Programming Data Analyst Learning Path, is a tour through the most important parts of R, the statistical programming language, from the very basics to complex modeling. It covers reading data, programming basics, visualization, data munging, regression, classification, clustering, modern machine learning, network analysis, web graphics, and techniques for dealing with large data, both in memory and in databases.


This 15-hour video teaches you how to program in R even if you are unfamiliar with statistical techniques. It starts with the basics of using R and progresses into data manipulation and model building. Users learn through hands-on practice with the code and techniques. New material covers chaining commands, faster data manipulation, new ways to read rectangular data into R, testing code, and the hot package Shiny.

Based on a course on R and Big Data taught by the author at Columbia

  • Designed from the ground up to help viewers quickly overcome R’s learning curve
  • Packed with hands-on practice opportunities and realistic, downloadable code examples
  • Presented by an author with unsurpassed experience teaching statistical programming and modeling to novices
  • For every potential R user: programmers, data scientists, DBAs, marketers, quants, scientists, policymakers, and many others

About the Instructor

Jared P. Lander is the Chief Data Scientist of Lander Analytics, the organizer of the New York Open Statistical Programming Meetup (formerly the R Meetup) and an adjunct professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. He is the author of R for Everyone, a book about R Programming geared toward data scientists and non-statisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world. He is a member of the Strata New York selection committee.

Skill Level

  • Beginner
  • Intermediate
  • Advanced
What You Will Learn
  • Installing R
  • Basic math
  • Working with variables and different data types
  • Matrix algebra
  • data.frames
  • Reading data
  • Data aggregation and manipulation
  • plyr
  • dplyr
  • Making statistical graphs
  • Manipulate text
  • Automatically generate reports and slideshows
  • Display data with popular JavaScript libraries
  • Build Shiny dashboards
  • Build R packages
  • Incorporate C++ for faster code
  • Basic statistics
  • Linear models
  • Generalized linear models
  • Model validation
  • Decision trees
  • Random forests
  • Bootstrap
  • Time series analysis
  • Clustering
  • Network analysis
  • Automatic parameter tuning
  • Bayesian regression using Stan
Who Should Take This Course

Part 1 of the lessons is geared toward people who are new to either R or programming in general.
Part 2 is for R programmers who already have an intermediate level of knowledge such as that gained from Reading R for Everyone or from viewing Part 1.

Course Requirements
  • Basic Programming Skills
Table of Contents

Part 1: R as a Tool

Lesson 1. Getting Started with R
R can only be used after installation, which fortunately is just as simple as installing any other program. In this lesson, you learn about where to download R, how to decide on the best version, how to install it, and you get familiar with its environment, using RStudio as a front end. We also take a look at the package system.

Lesson 2. The Basic Building Blocks in R
R is a flexible and robust programming language, and using it requires understanding how it handles data. We learn about performing basic math in R, storing various types of data in variables such as numeric, integer, character, and time-based and calling functions on the data.

Lesson 3. Advanced Data Structures in R
Like many other languages, R offers more complex storage mechanisms such as vectors, arrays, matrices, and lists. We take a look at those and the data.frame, a special storage type that strongly resembles a spreadsheet and is part of what makes working with data in R such a pleasure.

Lesson 4. Reading Data into R
Data is abundant in the world, so analyzing it is just a matter of getting the data into R. There are many ways of doing so, the most common being reading from a CSV file or database. We cover these techniques, and also importing from other statistical tools, scraping websites, and reading Excel files.

Lesson 5. Making Statistical Graphs
Visualizing data is a crucial part of data science both in the discovery phase and when reporting results. R has long been known for its capability to produce compelling plots, and Hadley Wickham’s ggplot2 package makes it even easier to produce better looking graphics. We cover histograms scatterplots, boxplots, line charts, and more, in both base graphics and ggplot2 and then explore newer packages ggvis and rCharts.

Lesson 6. Basics of Programming
R has all the standard components of a programming language such as writing functions, if statements and loops, all with their own caveats and quirks. We start with the requisite “Hello, World!” function and learn about arguments to functions, the regular if statement and the vectorized version, and how to build loops and why they should be avoided.

Lesson 7. Data Munging
Data scientists often bemoan that 80% of their work is manipulating data. As such, R has many tools for this, which are, contrary to what Python users may say, easy to use. We see how R excels at group operations using apply, lapply, and the plyr package. We also take a look at its facilities for joining, combining, and rearranging data. Then we speed that up with tidyr, data.table, and dplyr.

Lesson 8. In-Depth with dplyr
dplyr has become such an indispensible tool, nearly superseding plyr, that it is worth devoting extra attention to. So we examine its select, filter, mutate, group_by and summarize functions, among others.

Lesson 9. Manipulating Strings
Text data is becoming more pervasive in the world, and fortunately, R provides ways for both combining text and ripping it apart, which we walk through. We also examine R’s extensive regular expression capabilities.

Lesson 10. Reports and Slideshows with knitr
Successfully delivering the results of an analysis can be just as important as the analysis itself, so it is important to communicate them in an effective way. In this lesson, we learn how to use knitr and rmarkdown to write both static and interactive results in the form of pdf documents, websites, HTML5 slideshows, and even Word documents.

Lesson 11. Include HTML Widgets in HTML Documents
Recent years have seen the advance of JavaScript-powered displays of information, and the htmlwidgets package enables R to take advantage of arbitrary JavaScript libraries. In particular, we look at datatable for a tabular display of data, bokeh for rich web plots, and leaflet for powerful mapping.

Lesson 12. Shiny
Built by Rstudio, Shiny is a tool for building interactive data displays and dashboards all within R. This allows the R programmer to convey results in a compelling, user-rich experience in a language he or she is familiar with.

Lesson 13. Package Building
Building packages is a great way to contribute back to the R community, and doing so has never been easier thanks to Hadley Wickham's devtools package. This lesson covers all the requirements for a package and how to go about authoring, testing, and distributing them.

Lesson 14. Rcpp for Faster Code
Sometimes pure R code is not fast enough, and extra speed is required. Rcpp enables R programmers to seamlessly integrate C++ code into their R code. We go over the basics of getting the two languages working together, write some speedy functions in C++, and even integrate C++ into R packages.

Part 2: R for Statistics, Modeling, and Machine Learning

Lesson 15. Basic Statistics
Naturally, R has all the basics when it comes to statistics such as means, variance, correlation, t-tests, and ANOVAs. We look at all the different ways those can be computed.

Lesson 16. Linear Models
The workhorse of statistics is regression and its extensions. This consists of linear models, generalized linear models—including logistic and Poisson regression—and survival models. We look at how to fit these models in R and how to evaluate them using measures such as mean squared error, deviance, and AIC.

Lesson 17. Other Models
Beyond regression there are many other types of models that can be fit to data. Models covered include regularization with the elastic net, Bayesian shrinkage, nonlinear models such as nonlinear least squares, splines and generalized additive models, decision tress, and random forests.

Lesson 18. Time Series
Special care must be taken with data where there is time-based correlation, otherwise known as autocorrelation. We look at some common methods for dealing with time series such as ARIMA, VAR, and GARCH.

Lesson 19. Clustering
A focal point of modern machine learning is clustering, the partitioning of data into groups. We explore three popular methods: K-means, K-medoids, and hierarchical clustering.

Lesson 20. More Machine Learning
Two areas seeing increasing interest are recommendation engines and text mining, which we illustrate with RecommenderLab, RTextTools, and the irlba package for fast matrix factorization.

Lesson 21. Network Analysis
The world is rich with network data that are nicely studied with graphical models. We show you how to analyze and visualize networks using the igraph package.

Lesson 22. Automatic Parameter Tuning with Caret
Machine learning models often have parameters that need tuning, which can significantly affect the quality of the model. The Caret package, by Max Kuhn, makes finding optimal parameter values easy to find.

Lesson 23. Fit a Bayesian Model with RStan
Bayesian data analysis uses simulations to fit both simple and complex models. Andrew Gelman’s new language, Stan, makes this faster and easier than ever before. We explore this by fitting a simple linear regression and varying-intercept multilevel model.

About LiveLessons Video Training

The LiveLessons Video Training series publishes hundreds of hands-on, expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. This professional and personal technology video series features world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, IBM Press, Pearson IT Certification, Prentice Hall, Sams, and Que. Topics include: IT Certification, Programming, Web Development, Mobile Development, Home and Office Technologies, Business and Management, and more. View all LiveLessons on InformIT at: http://www.informit.com/livelessons.

Table of Contents

  1. Introduction
    1. Part 1: R as a Tool—Introduction 00:03:24
  2. Lesson 1: Getting Started with R
    1. Learning objectives 00:00:28
    2. 1.1 Download and Install R 00:06:23
    3. 1.2 Work in the R Environment 00:18:50
    4. 1.3 Install and load packages 00:04:49
  3. Lesson 2: The Basic Building Blocks in R
    1. Learning objectives 00:00:26
    2. 2.1 Use R as a calculator 00:03:43
    3. 2.2 Work with variables 00:04:11
    4. 2.3 Understand the different data types 00:11:33
    5. 2.4 Store data in vectors 00:16:36
    6. 2.5 Call functions 00:04:03
  4. Lesson 3: Advanced Data Structures in R
    1. Learning objectives 00:00:25
    2. 3.1 Create and access information in data.frames 00:17:20
    3. 3.2 Create and access information in lists 00:10:57
    4. 3.3 Create and access information in matrices 00:08:02
  5. Lesson 4: Reading Data into R
    1. Learning objectives 00:00:26
    2. 4.1 Read a CSV into R 00:05:58
    3. 4.2 Read an Excel Spreadsheet into R 00:04:39
    4. 4.3 Read from databases 00:05:59
    5. 4.4 Read data files from other statistical tools 00:01:17
    6. 4.5 Load binary R files 00:04:40
    7. 4.6 Load data included with R 00:01:49
    8. 4.7 Scrape data from the web 00:02:28
    9. 4.8 Read XML data 00:27:23
  6. Lesson 5: Making Statistical Graphs
    1. Learning objectives 00:00:34
    2. 5.1 Find the diamonds in the data 00:01:13
    3. 5.2 Make histograms with base graphics 00:01:30
    4. 5.3 Make scatterplots with base graphics 00:02:01
    5. 5.4 Make boxplots with base graphics 00:01:39
    6. 5.5 Get familiar with ggplot2 00:02:30
    7. 5.6 Plot histograms and densities with ggplot2 00:03:52
    8. 5.7 Make scatterplots with ggplot2 00:05:12
    9. 5.8 Make boxplots and violin plots with ggplot2 00:04:24
    10. 5.9 Make line plots 00:08:21
    11. 5.10 Create small multiples 00:04:01
    12. 5.11 Control colors and shapes 00:01:19
    13. 5.12 Add themes to graphs 00:02:18
    14. 5.13 Use Web graphics 00:29:48
  7. Lesson 6: Basics of Programming
    1. Learning objectives 00:00:26
    2. 6.1 Write the classic "Hello, World!" example 00:02:05
    3. 6.2 Understand the basics of function arguments 00:10:32
    4. 6.3 Return a value from a function 00:02:47
    5. 6.4 Gain flexibility with do.call 00:03:46
    6. 6.5 Use "if" statements to control program flow 00:02:08
    7. 6.6 Stagger "if" statements with "else" 00:05:33
    8. 6.7 Check multiple statements with switch 00:03:52
    9. 6.8 Run checks on entire vectors 00:05:17
    10. 6.9 Check compound statements 00:05:41
    11. 6.10 Iterate with a for loop 00:06:07
    12. 6.11 Iterate with a while loop 00:01:31
    13. 6.12 Control loops with break and next 00:02:05
  8. Lesson 7: Data Munging
    1. Learning objectives 00:00:34
    2. 7.1 Repeat an operation on a matrix using apply 00:04:46
    3. 7.2 Repeat an operation on a list 00:03:05
    4. 7.3 Apply a function over multiple lists with mapply 00:04:34
    5. 7.4 Perform group summaries with the aggregate function 00:05:27
    6. 7.5 Do group operations with the plyr Package 00:17:18
    7. 7.6 Combine datasets 00:03:51
    8. 7.7 Join datasets 00:05:56
    9. 7.8 Switch storage paradigms 00:05:11
    10. 7.9 Use tidyr 00:02:50
    11. 7.10 Get faster group operations 00:22:03
  9. Lesson 8: In-Depth with dplyr
    1. Learning objectives 00:00:22
    2. 8.1 Use tbl 00:01:49
    3. 8.2 Use select to choose columns 00:03:08
    4. 8.3 Use filter to choose rows 00:03:39
    5. 8.4 Use slice to choose rows 00:01:09
    6. 8.5 Use mutate to change or create columns 00:02:40
    7. 8.6 Use summarize for quick computation on tbl 00:01:35
    8. 8.7 Use group_by to split the data 00:02:35
    9. 8.8 Apply arbitrary functions with do 00:06:50
  10. Lesson 9: Manipulating Strings
    1. Learning objectives 00:00:20
    2. 9.1 Combine strings together 00:07:28
    3. 9.2 Extract text 00:32:01
  11. Lesson 10: Reports and Slideshows with knitr
    1. Learning objectives 00:00:28
    2. 10.1 Understand the basics of LaTeX 00:07:16
    3. 10.2 Weave R code into LaTeX using knitr 00:05:33
    4. 10.3 Understand the basics of Markdown 00:02:45
    5. 10.4 Understand the basics of RMarkdown 00:04:55
    6. 10.5 Weave R code into Markdown using knitr 00:02:53
    7. 10.6 Convert Markdown files to Word 00:01:30
    8. 10.7 Convert Markdown to PDF 00:01:25
    9. 10.8 Create slideshows with RMarkdown 00:03:10
    10. 10.9 Write equations with RMarkdown 00:07:14
  12. Lesson 11: Include HTML Widgets in HTML Documents
    1. Learning objectives 00:00:27
    2. 11.1 Work with datatables of tabular data 00:06:10
    3. 11.2 Use rbokeh 00:08:31
    4. 11.3 Use Leaflet for mapping 00:07:11
  13. Lesson 12: Shiny
    1. Learning objectives 00:00:22
    2. 12.1 Use shiny objects in a markdown document 00:13:57
    3. 12.2 Work with ui.r and server.r files 00:08:21
  14. Lesson 13: Package Building
    1. Learning objectives 00:00:23
    2. 13.1 Understand the folder structure and files in a package 00:05:25
    3. 13.2 Write and document functions 00:07:32
    4. 13.3 Check and build a package 00:02:10
    5. 13.4 Test R code 00:06:58
    6. 13.5 Submit a package to CRAN 00:00:46
  15. Lesson 14: Rcpp for Faster Code
    1. Learning objectives 00:00:29
    2. 14.1 Understand the basics of C++ with R 00:01:47
    3. 14.2 Write a C++ function for R 00:04:35
    4. 14.3 Use Rcpp syntactic sugar 00:05:51
    5. 14.4 Sum in C++ 00:05:35
    6. 14.5 Write a package in R 00:09:37
    7. 14.6 Write a package with C++ code 00:06:05
  16. Summary
    1. Part 1: R as a Tool—Summary 00:01:02
  17. Introduction
    1. Part 2: R for Statistics, Modeling and Machine Learning—Introduction 00:02:14
  18. Lesson 15: Basic Statistics
    1. Learning objectives 00:00:19
    2. 15.1 Draw numbers from probability distributions 00:21:10
    3. 15.2 Calculate averages, standard deviations and correlations 00:16:13
    4. 15.3 Compare samples with t-tests and analysis of variance 00:18:59
  19. Lesson 16: Linear Models
    1. Learning objectives 00:00:28
    2. 16.1 Fit simple linear models 00:10:15
    3. 16.2 Explore the data 00:08:33
    4. 16.3 Fit multiple regression models 00:19:17
    5. 16.4 Fit logistic regression 00:10:06
    6. 16.5 Fit Poisson regression 00:07:05
    7. 16.6 Analyze survival data 00:12:01
    8. 16.7 Assess model quality with residuals 00:05:16
    9. 16.8 Compare models 00:07:18
    10. 16.9 Judge accuracy using cross-validation 00:09:06
    11. 16.10 Estimate uncertainty with the bootstrap 00:06:23
    12. 16.11 Choose variables using stepwise selection 00:02:42
  20. Lesson 17: Other Models
    1. Learning objectives 00:00:27
    2. 17.1 Select variables and improve predictions with the elastic net 00:14:15
    3. 17.2 Decrease uncertainty with weakly informative priors 00:08:53
    4. 17.3 Fit nonlinear least squares 00:05:16
    5. 17.4 Use Splines 00:06:49
    6. 17.5 Use GAMs 00:05:24
    7. 17.6 Fit decision trees to make a random forest 00:06:34
  21. Lesson 18: Time Series
    1. Learning objectives 00:00:20
    2. 18.1 Understand ACF and PACF 00:07:16
    3. 18.2 Fit and assess ARIMA models 00:05:14
    4. 18.3 Use VAR for multivariate time series 00:08:06
    5. 18.4 Use GARCH for better volatility modeling 00:09:24
  22. Lesson 19: Clustering
    1. Learning objectives 00:00:20
    2. 19.1 Partition data with k-means 00:12:26
    3. 19.2 Robustly cluster, even with categorical data, with PAM 00:02:13
    4. 19.3 Perform hierarchical clustering 00:05:38
  23. Lesson 20: More Machine Learning
    1. Learning objectives 00:00:20
    2. 20.1 Build a recommendation engine with RecommenderLab 00:13:13
    3. 20.2 Mine text with RTextTools 00:09:13
    4. 20.3 Perform matrix factorization using irlba 00:04:05
  24. Lesson 21: Network Analysis
    1. Learning objectives 00:00:18
    2. 21.1 Get started with igraph 00:08:16
    3. 21.2 Read edgelists 00:07:12
    4. 21.3 Understand common graph metrics 00:10:12
    5. 21.4 Use centrality measures 00:06:00
    6. 21.5 Utilize more graph operations 00:04:16
  25. Lesson 22: Automatic Parameter Tuning with Caret
    1. Learning objectives 00:00:19
    2. 22.1 Establish optimal tree depth for rpart 00:06:18
    3. 22.2 Choose the best number of trees for a random forest 00:03:35
  26. Lesson 23: Fit a Bayesian Model with RStan
    1. Learning objectives 00:00:25
    2. 23.1 Understand the Stan computing paradigm 00:01:33
    3. 23.2 Fit a simple regression model 00:06:53
    4. 23.3 Fit a multilevel model with Stan 00:06:43
  27. Summary
    1. Part 2: R for Statistics, Modeling and Machine Learning—Summary 00:00:49