Subsets

Often, you’ll be provided with too much data. For example, suppose that you were working with patient records at a hospital. You might want to analyze healthcare records for patients between 5 and 13 years of age who were treated for asthma during the past 3 years. To do this, you need to take a subset of the data and not examine the whole database.

Other times, you might have too much relevant data. For example, suppose that you were looking at a logistics operation that fills billions of orders every year. R can only hold a certain number of records in memory and might not be able to hold the entire database. In most cases, you can get statistically significant results with a tiny fraction of the data; even millions of orders might be too many.

Bracket Notation

One way to take a subset of a data set is to use the bracket notation. As you may recall, you can select rows in a data frame by providing a vector of logical values. If you can write a simple expression describing the set of rows to select from a data frame, you can provide this as an index.

For example, suppose that we wanted to select only batting data from 2008. The column batting.w.names$yearID contains the year associated with each row, so we could calculate a vector of logical values describing which rows to keep with the expression batting.w.names$yearID==2008. Now, we just have to index the data frame batting.w.names with this vector to select only rows for the year 2008:

> batting.w.names.2008 <- batting.w.names[batting.w.names$yearID==2008,] ...

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

R in a Nutshell by Joseph Adler

Subsets

Bracket Notation

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly