Binning Data

Another common data transformation is to group a set of observations into bins based on the value of a specific variable. For example, suppose that you had some time series data where time was measured in days, but you wanted to summarize the data by month. There are several functions available for binning numeric data in R.

Shingles

We briefly mentioned shingles in Shingles. Shingles are a way to represent intervals in R. They can be overlapping, like roof shingles (hence the name). They are used extensively in the lattice package, when you want to use a numeric value as a conditioning value.

To create shingles in R, use the shingle function:

shingle(x, intervals=sort(unique(x)))

To specify where to separate the bins, use the intervals argument. You can use a numeric vector to indicate the breaks or a two-column matrix, where each row represents a specific interval.

To create shingles where the number of observations is the same in each bin, you can use the equal.count function:

equal.count(x, ...)

Cut

The function cut is useful for taking a continuous variable and splitting it into discrete pieces. Here is the default form of cut for use with numeric vectors:

# numeric form
cut(x, breaks, labels = NULL,
    include.lowest = FALSE, right = TRUE, dig.lab = 3,
    ordered_result = FALSE, ...)

There is also a version of cut for manipulating Date objects:

# Date form
cut(x, breaks, labels = NULL, start.on.monday = TRUE,
    right = FALSE, ...)

The cut function takes a numeric vector as input and ...

Get R in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.