Now that we have deployed working models predicting flight delays, it is time to “make believe” that our prediction has proven useful based on user feedback, and further that the prediction is valuable enough that prediction quality is important. In this case, it is time to iteratively improve the quality of our prediction. If a prediction is valuable enough, this becomes a full-time job for one or more people.
In this chapter we will tune our Spark ML classifier and also do additional feature engineering to improve prediction quality. In doing so, we will show you how to iteratively improve predictions.
Code examples for this chapter are available at Agile_Data_Code_2/ch09. Clone the repository and follow along!
git clone https://github.com/rjurney/Agile_Data_Code_2.git
The notebook itself is very long, and we tried many things to fix
our model. It turned out we had made a mistake. We were using
OneHotEncoder on top of the output of
StringIndexerModel when we were encoding
our nominal/categorical string features. This is how you should encode
features for models other than decision trees, but it turns out that for
decision tree models, you are supposed to take the string indexes from
StringIndexerModel and directly compose them with ...