Optimizer

Next, we define the optimizer, which is based on the Adam optimizer.

Adam is different to the stochastic gradient descent algorithm. Stochastic gradient descent maintains a single learning rate (called alpha) for all weight updates and the learning rate does not change during training.

This algorithm maintains a learning rate for each network weight (parameter) and separately adapts as learning unfolds. It computes individual adaptive learning rates for different parameters from the estimates of the first and second moments of the gradients.

Adam combines the advantages of two other extensions of stochastic gradient descent.

The adaptive gradient algorithm (AdaGrad) maintains a per-parameter learning rate that improves performance ...

Get Neural Network Programming with TensorFlow now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.