Splitting the dataset into training, cross-validation, and testing

To build a statistical model that can be trusted, we need to have confidence that it abstracts the phenomenon that we deal with accurately. To gain such trust, we need to test the model to see if it performs well. To assess the accuracy of our model, we cannot use the same dataset that we used for the training.

In this recipe, you will learn how to split your dataset into two subsets quickly: one that is used solely to train the model and the other one is used to test it.

Getting ready

To execute this recipe, you will need pandas, SQLAlchemy, and NumPy. No other prerequisites are required.

How to do it…

We read our data from the PostgreSQL database and store it in the data DataFrame. ...

Get Practical Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.