A guest post by Ian Goodfellow, a Ph.D. student studying deep learning at Yoshua Bengio’s LISA lab in Montréal, Québec. He is one of the lead developers of Pylearn2, an open source machine learning research library.
In my previous guest post I introduced the Pylearn2 machine learning research library and showed how to use it to solve a regression problem. In this post I’ll show how to use Pylearn2 to do constrained optimization.
What is constrained optimization?
All of the machine learning problems solved by Pylearn2 can be viewed as optimization problems. In each learning problem, we have a set of parameters and a cost function. The parameters are adjusted to minimize the cost function. Usually the cost function is the negative likelihood of the data (so that minimizing the cost is equivalent to maximizing the likelihood). Depending upon the problem, the parameters may be weights of a neural network, the mean and variance of a Gaussian distribution, or the weights of a Boltzmann machine. Usually when we solve an optimization problem, the optimization algorithm can set any of these parameters to any value.
However, in many cases, we want to limit the range of values some of the parameters take on. This is called constrained optimization. One simple example is when only some of the values are valid. For example, the variance of a Gaussian distribution must never be zero or negative, so we often constrain it to be larger than some small constant.
How does constrained optimization in Pylearn2 work?
Many optimization algorithms exist for solving specialized kinds of constrained optimization problems very efficiently. In Pylearn2, we use a small number of generic algorithms that can solve generic constrained optimization problems. For some kinds of problems (such as convex problems with simple constraints) the generic algorithms are not very efficient compared to the specialized algorithms. However, neural network and Boltzmann machine cost functions usually do not fall into the class of problems for which specialized solvers exist.
In Pylearn2, to make constrained optimization as generic and broadly supported as possible, we implement it as a post-processing step of other off-the shelf algorithms. To understand how this works, it’s good to understand how shared variables work in Theano. If you’re not already familiar with shared variables, check out the tutorial on the subject. All Pylearn2 parameters are represented as Theano shared variables. All of the Pylearn2 optimization algorithms work by constructing Theano update dictionaries that explain how to perform a single step of learning. These updates are symbolic expressions, like those used in a computer algebra system. For example, the Pylearn2
SGD class implements stochastic gradient descent by producing an updates dictionary that says to update each shared variable representing a set of parameters to itself minus the gradient of the cost with respect to that variable, times the learning rate. At the time the updates dictionary is constructed, we haven’t actually performed any calculations with specific numeric values yet, we’ve only made a description of what calculations will need to be performed.
This is where we have the opportunity to introduce constrained optimization. Before we compile the
updates dictionary into an executable Theano function, we pass the updates dictionary to the Model’s
censor_updates method. When you implement your own Model, you can make this method modify the dictionary to enforce any constraint that you like. The idea is that the optimization algorithm proposes a move in parameter space, then you project that move back into the region of parameter space that respects the constraint.
What can you do with constrained optimization?
One exciting example of constrained optimization in action is the use of constrained optimization to limit the norm of a weight vector of a hidden unit in a multilayer perceptron. This way of constraining matrices to improve generalization on machine learning problems was first advocated by Nathan Srebro and Adi Shraibman in this article. Later, Geoffrey Hinton and his collaborators used this approach to regularize the weight matrices of multilayer perceptrons and used it to obtain the state of the art on several benchmark tasks in their original paper on dropout. My colleaugues and I also found it very useful on several computer vision benchmark datasets in our paper on maxout.
Several Pylearn2 MLP Layer classes support this kind of regularization, as can be seen in mlp.py. Here is an example of how to implement such a constraint on a parameter matrix
def censor_updates(self, updates):
W = self.W
if W in updates:
updated_W = updates[W]
col_norms = T.sqrt(T.sum(T.sqr(updated_W), axis=0))
desired_norms = T.clip(col_norms, 0, self.max_col_norm)
constrained_W = updated_W * (desired_norms / (1e-7 + col_norms))
updates[W] = constrained_W
Note that it’s important that the dictionary key uses
W and the dictionary value uses
constrained_W. A common mistake is to use
W as the value, which prevents any learning from happening.
This YAML file demonstrates how to train a deep rectifier net with dropout and max norm constraints on the MNIST dataset. This file is a template, allowing different values to be filled in. You can disable the max norm constraints by setting them to very high values (or commenting them out). It can be interesting to compare performance with and without norm constraints.
In this graph, I’ve plotted a test set misclassification rate over time for a net both with and without norm constraints. The net with norm constraints significantly makes fewer mistakes:
This image was made with Pylearn2’s
plot_monitor.py script run on the output of two different nets trained with the above YAML template.
Many other techniques are possible with constrained optimization, but this is one of the simplest and most useful.
Look below for some great Machine Learning resources from Safari Books Online.
Not a subscriber? Sign up for a free trial.
Safari Books Online has the content you need
|Machine Learning in Action is a unique book that blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You’ll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification.|
|Machine Learning for Hackers will get you started with machine learning—a toolkit of algorithms that enables computers to train themselves to automate useful tasks. Authors Drew Conway and John Myles White help you understand machine learning and statistics tools through a series of hands-on case studies, instead of a traditional math-heavy presentation.|
|Building Machine Learning Systems with Python shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail.|