Chapter 3. Sampling Statistics and Model Training in R

Sampling and machine learning go hand in hand. In machine learning, we typically begin with a big dataset that we want to use for predicting something. We usually split this data into a training set and build a model around that, and then unleash a fully trained model on some kind of test set to see what the final output is. In some instances, it might be very difficult to run a machine learning model on an entire dataset, whereas we might achieve as good an accuracy by running on a small sample of it and testing when appropriate. This could be due to the size of the data, for example.

First let’s define some statistical terms. A population is the entire collection (or universe) of things under consideration. A sample is a portion of the population that we select for analysis. So, for example, we could start with a full dataset, break off a chunk into a sample, and do our training there. Another way to look at it is that some data that we’re given to start with might itself be only a sample of a much broader dataset.

Polling data is an example of sampling, and is typically gathered by asking questions of people for specific demographics. By design, the polling data can be only a subset of the general population of a country, because it would be quite an achievement to ask everyone in a country what their favorite color might be. If we have a country with a population of 100 million and we conduct a poll that has 30 million ...

Get Introduction to Machine Learning with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Introduction to Machine Learning with R by Scott V. Burger

Chapter 3. Sampling Statistics and Model Training in R

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly