O'Reilly logo
  • Jocelyn Tong thinks this is interesting:

As you can see, the test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, whereas the test set generated using purely random sampling is quite skewed.


Cover of Hands-On Machine Learning with Scikit-Learn and TensorFlow


But what if we have multiple features have the similar patterns, there are some extreme values. How to do the stratified sampling at the same time considering all these features?

Normally, after we draw the histograms of the numerical variables, when we see the variables with skewness, the first thing comes to our mind is to do the transformation to make them approximately normally distributed.

I am not sure when we can use stratified sampling? But since we usually have really large dataset, maybe stratified sampling is not that practical to use.