Often, you’ll be provided with too much data. For example, suppose that you were working with patient records at a hospital. You might want to analyze healthcare records for patients between 5 and 13 years of age who were treated for asthma during the past 3 years. To do this, you need to take a subset of the data and not examine the whole database.
Other times, you might have too much relevant data. For example, suppose that you were looking at a logistics operation that fills billions of orders every year. R can hold only a certain number of records in memory and might not be able to hold the entire database. In most cases, you can get statistically significant results with a tiny fraction of the data; even millions of orders might be too many.
One way to take a subset of a data set is to use the bracket notation. As you may recall, you can select rows in a data frame by providing a vector of logical values. If you can write a simple expression describing the set of rows to select from a data frame, you can provide this as an index.
For example, suppose that we wanted to select only batting data
from 2008. The column
batting.w.names$yearID contains the year
associated with each row, so we could calculate a vector of logical
values describing which rows to keep with the expression
batting.w.names$yearID==2008. Now we just have
to index the data frame
batting.w.names with this vector to select
only rows for the year 2008:
> batting.w.names.2008 <- batting.w.names[batting.w.names$yearID==2008,] ...