Chapter 3. Sampling Statistics and Model Training in R

Sampling and machine learning go hand in hand. In machine learning, we typically begin with a big dataset that we want to use for predicting something. We usually split this data into a training set and build a model around that, and then unleash a fully trained model on some kind of test set to see what the final output is. In some instances, it might be very difficult to run a machine learning model on an entire dataset, whereas we might achieve as good an accuracy by running on a small sample of it and testing when appropriate. This could be due to the size of the data, for example.

First let’s define some statistical terms. A population is the entire collection (or universe) of things under consideration. A sample is a portion of the population that we select for analysis. So, for example, we could start with a full dataset, break off a chunk into a sample, and do our training there. Another way to look at it is that some data that we’re given to start with might itself be only a sample of a much broader dataset.

Polling data is an example of sampling, and is typically gathered by asking questions of people for specific demographics. By design, the polling data can be only a subset of the general population of a country, because it would be quite an achievement to ask everyone in a country what their favorite color might be. If we have a country with a population of 100 million and we conduct a poll that has 30 million ...

Get Introduction to Machine Learning with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.