Part of the trick that makes ANNs perform as well as they do is the use of nonlinear activation functions. A first thought is simply to use a step function. In this case, an output from a particular occurs only when the input exceeds zero. The problem with the step function is that it cannot be differentiated, since it does not have a defined gradient. It consists only of flat sections and is discontinuous at zero.
Another method is to use a linear activation function; however, this restricts our output to a linear function as well. This is not what we want, since we need to model highly nonlinear real-world data. It turns out that we can inject nonlinearity into our networks by using nonlinear activation ...