Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Ian Goodfellow, a Ph.D. student studying deep learning at Yoshua Bengio’s LISA lab in Montréal, Québec. He is one of the lead developers of Pylearn2, an open source machine learning research library.

In an earlier guest post I introduced the Pylearn2 machine learning research library and showed how to use it to solve a regression problem. In a second post I showed you how to use Pylearn2 to do constrained optimization. In this tip, I’ll show you how to use momentum in Pylearn2.

What is momentum?

Momentum is an important technique for improving the performance of stochastic gradient descent.

Typical stochastic gradient descent applies this learning rule to each minibatch of data:

Where `params` is a vector of parameters defining the model, `learning_rate` is a scalar controlling how fast the parameters should be updated, and `gradient` is the gradient of the objective function with respect to the parameters.

A more effective learning rule is to use momentum. When using momentum, you move the parameters with a certain velocity at every time step. This velocity decays exponentially over time, while the gradient can add to it. Since the velocity is a vector, the gradient is able to change its direction as well as its magnitude. The momentum learning rule is:

Where `momentum_constant` is a scalar controlling how quickly the velocity decays. `momentum_constant` should be at least `0.` If it is exactly `0.`, then the momentum learning rule reduces to the standard gradient descent learning rule. `momentum_constant` should also be strictly less than `1`.

The use of momentum is often neglected but should not be. Stochastic gradient descent with momentum is the definitive state of the art in deep neural network training. Essentially, anyone using stochastic gradient descent to train neural networks should be sure to also use momentum. An excellent scientific paper on the subject is available here.

How do you use momentum in Pylearn2?

Using momentum in Pylearn2 is very simple. If you’re already using the `SGD` class to do stochastic gradient descent training of a neural network, all you need to do is specify the `learning_rule` argument of the `SGD` class. To use momentum, set this argument to `Momentum` object. A good value for the `init_momentum` is `0.5`.

That will make sure the learning algorithm uses the momentum learning rule, but the `momentum_constant` will be the same for every step of learning. You probably want this value to increase over time. You can do this by installing a `MomentumAdjustor` object into the extensions argument of the `Train` object. This object will be run once each training epoch to gradually increase the `momentum_constant`.

Here is some example code showing how to use Jobman and Pylearn2 to run a hyperparameter search to look for good settings to train a neural net on the MNIST dataset using momentum. Here is some example code to run the same search without momentum. If you plot test set error and training time for the two methods, you’ll see the best test set error is obtained with momentum:

Conclusion

Note that momentum is not a magic bullet. Most of the jobs using momentum perform worse than most of the jobs not using momentum. However, when momentum is tuned correctly, it improves performance, as can be seen by the fact that the best job shown in the figure above does use momentum.

Look below for some great Machine Learning resources from Safari Books Online.