Chapter 9. Modeling Data

In this chapter, we’ll perform the fourth step of the OSEMN model (and the last step to require a computer): modeling data. Generally speaking, to model data is to create an abstract or higher-level description of your data. Just like with creating visualizations, it’s like taking a step back from the individual data points.

Visualizations, on the one hand, are characterized by shapes, positions, and colors such that we can interpret them by looking at them. Models, on the other hand, are internally characterized by a bunch of numbers, which means that computers can use them, for example, to make predictions about new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter, we’ll consider four common types of algorithms to model data:

Dimensionality reduction
Clustering
Regression
Classification

These four types of algorithms come from the field of machine learning. As such, we’re going to change our vocabulary a bit. Let’s assume that we have a CSV file, also known as a data set. Each row, except for the header, is considered to be a data point. For simplicity we assume that each column that contains numerical values is an input feature. If a data point also contains a nonnumerical field, such as the species column in the Iris data set, then that is known as the data point’s label.

The first two types of algorithms (dimensionality reduction and clustering) are most often ...

Get Data Science at the Command Line now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Science at the Command Line by Jeroen Janssens

Chapter 9. Modeling Data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly