In this example, we are going to employ the KDD Cup '99 dataset (provided by Scikit-Learn), which contains the logs generated by an intrusion detection system exposed to normal and dangerous network activities. We are focusing only on the smtp sub-dataset, which is the smallest one, because, as explained before, the training process can be very long. This dataset is not extremely complex and it can be successfully classified with simpler methods; however, the example has only a didactic purpose and can be useful for understanding how to work with this kind of data.
The first step is to load the dataset, encode the labels (which are strings), and standardize the values:
from sklearn.datasets import fetch_kddcup99 ...