We have seen that the fare had four passengers paying a much higher price than the others. There are several ways to deal with these values, one of which can be to bin the variable by defining a series of specific ranges. For instance, below 20; from 20 to 50; 50 to 100, ..., and over 200. Binning can be done with Amazon ML recipes. We could also cap the fare value at a specific threshold, such as the 95% percentile. However, we decided that these large fare values were legit and that we ought to keep them. We can still create a new variable, log_fare, with a more compact range and a less skewed distribution by taking the log of the fare:
select fare, log(fare +1, 2) as log_fare from titanic;
The log_fare ...