Functions for Big Data Sets

If you’re working with a very large data set, you may not have enough memory to use the standard regression functions. Luckily, R includes an alternative set of regression functions for working with big data sets. These functions are slower than the standard regression functions, but will work when there is not enough memory to use the standard regression functions:

library(biglm)
# substitute for lm, works in dataframes
biglm(formula, data, weights=NULL, sandwich=FALSE)
# substitute for glm, works in data frames
bigglm(formula, data, family=gaussian(),
     weights=NULL, sandwich=FALSE, maxit=8, tolerance=1e-7,
     start=NULL,quiet=FALSE,...)

It’s even possible to use bigglm on data sets inside a database. To do this, you would open a database connection using RODBC or RSQLite and then call bigglm with the data argument specifying the database connection and tablename specifying the table in which to evaluate the formula:

bigglm(formula, data, family=gaussian(),
       tablename, ..., chunksize=5000)

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.