Now that we have deployed working models predicting flight delays, it is time to ‘make believe’ that our prediction has proven useful based on user feedback, and further that the prediction is valuable enough that prediction quality is important. In this case, it is time to iteratively improve the quality of our prediction. If a prediction is valuable enough, this becomes a full-time job for one or more people.
In this chapter we will tune our Spark ML classifier and also do additional feature engineering to improve prediction quality. In doing so, we will show you how to iteratively improve predictions.
Code examples for this chapter are available at https://github.com/rjurney/Agile_Data_Code_2/tree/master/ch09. Clone the repository and follow along!
git clone https://github.com/rjurney/Agile_Data_Code_2.git
At this point we realized that our model was always predicting one class, no matter the
imput. We began by investigating that in a Jupyter Notebook at
ch09/Debugging Prediction Problems.ipynb.
The notebook itself is very long, and we tried many things to fix our model. It turned out
we had made a mistake. We were using
OneHotEncoder on top of the output of
StringIndexerModel when we were encoding our
nominal/categorical string features. This is how you should encode features for models other
than decision trees, but it turns out that for decision tree models, you are supposed to take
the string indexes from