One condition of our method, which can also be construed as a limitation, is that it requires knowledge of the global minimum of the objective function \(f^\star\). However, the method still remains applicable to a large class of problems. Particularly in deep learning, regularization is now commonly performed indirectly with dropout15 or batch normalization16 rather than weight decay. Therefore, under mean squared error or cross-entropy loss functions, the global minimum is simply 0. This will be the case for all our experiments, and we show that Eve can improve over other methods in optimizing complex, practical models.
Now we conduct experiments to compare Eve with other popular optimizers used in deep learning. We use the same hyperparameter settings (as described in Figure 1) for all experiments. We also conduct an experiment to study the behavior of Eve with respect to the new hyperparameters \(\beta_3\) and \(c\). For each experiment, we use the same random number seed when comparing different methods. This ensures same weight initializations (we use the scheme proposed by Glorot and Bengio17 for all experiments), and mini-batch splits. In all experiments, we use cross-entropy as the loss function, and since the models don’t have explicit regularization, \(f^\star\) is set to 0 for training Eve.
First we compare Eve with other optimizers for training a Convolutional Neural Network (CNN). The optimizers we compare against are Adam, Adamax, RMSprop, Adagrad, Adadelta, and SGD with Nesterov momentum18 (momentum \(0.9\)). The learning rate was searched over \(\{1\times 10^{-6}\), \(5\times 10^{-6}\), \(1\times 10^{-5}\), \(5\times 10^{-5}\), \(1\times 10^{-4}\), \(5\times 10^{-4}\), \(1\times 10^{-3}\), \(5\times 10^{-3}\), \(1\times 10^{-2}\), \(5\times 10^{-2}\), \(1\times 10^{-1}\}\), and the value which led to the lowest final loss was selected for reporting results. For Adagrad, Adamax, and Adadelta, we additionally searched over the prescribed default learning rates (\(10^{-2}\), \(2\times 10^{-3}\), and \(1\) respectively).
The model is a deep residual network19 with 16 convolutional layers. The network is regularized with batch normalization and dropout, and contains about 680,000 parameters, making it representative of a practical model.
Figure 2(a) shows the results of training this model on the CIFAR 100 dataset20 for 100 epochs with a batch size of 128. We see that Eve outperforms all other algorithms by a large margin. It quickly surpasses other methods, and achieves a much lower final loss at the end of training.
We also compare our method with other optimizers for training Recurrent Neural Networks (RNNs). We use the same algorithms as in the previous experiment, and the learning rate search was conducted over the same set of values.
We construct a RNN for character-level language modeling task on Penn Treebank (PTB)21. Specifically, the model consists of a 2-layer character-level Gated Recurrent Unit (GRU)22 with hidden layers of size 256, with 0.5 dropout between layers. The sequence length is fixed to 100 characters, and the vocabulary is kept at the original size.
The results for training this model are shown in Figure 2(b). Different optimizers performed similarly on this task, with Eve achieving slightly higher loss than Adam and Adamax.
We also empirically compare Eve with three common decay policies: exponential (\(\alpha_t = \alpha_1 \exp(-\gamma t)\)), \(1/t\) (\(\alpha_t = \alpha_1 / (1 + \gamma t)\)), and \(1/\sqrt{t}\) (\(\alpha_t = \alpha_1 / \sqrt{1 + \gamma t}\)). We consider the same CIFAR 100 classification task described in Section 4.1, and use the same CNN model. We applied the three decay policies to Adam, and tuned both the initial learning rate and decay strength. Learning rate was again searched over the same set as in the previous experiments.
For \(\gamma\), we searched over a different set of values for each of the decay policies, such that final learning rate after 100 epochs would be \(\alpha_1 / k\) where \(k\) is in \(\{1 \times 10^{4}\), \(5 \times 10^{3}\), \(1 \times 10^{3}\), \(5 \times 10^{2}\), \(1 \times 10^{2}\), \(5 \times 10^{1}\), \(1 \times 10^{1}\), \(5 \times 10^{0}\}\).
Figure 3(a) compares Eve with the best exponential decay, the best \(1/t\) decay, and the best \(1/\sqrt{t}\) decay applied to Adam. We see that using decay closes some of the gap between the two algorithms, but Eve still shows faster convergence. Moreover, using such a decay policy requires careful tuning of the decay strength. As seen in Figure 3(b), for different decay strengths, the performance of Adam can vary a lot. Eve can achieve similar or better performance without tuning an additional hyperparameter.
In this experiment, we study the behavior of Eve with respect to the two hyperparameters introduced over Adam: \(\beta_3\) and \(c\). We use the previously presented ResNet model on CIFAR 100, and a RNN model trained for question answering, on question 14 (picked randomly) of the bAbI-10k dataset23. The question answering model composes two separate GRUs (with hidden layers of size 256 each) for question sentences, and story passages.
We trained the models using Eve with \(\beta_3\) in \(\{0\), \(0.00001\), \(0.0001\), \(0.001\), \(0.01\), \(0.1\), \(0.3\), \(0.5\), \(0.7\), \(0.9\), \(0.99\), \(0.999\), \(0.9999\), \(0.99999\}\), and \(c\) in \(\{2\), \(5\), \(10\), \(15\), \(20\), \(50\), \(100\}\). For each \((\beta_3, c)\) pair, we picked the best learning rate from the same set of values used in previous experiments. We also used Adam with the best learning rate chosen from the same set as Eve.
Figure 4 shows the loss curves for each hyperparameter pair, and that of Adam. The bold line in the figure is for \((\beta_3, c) = (0.999, 10)\), which are the default values. For these particular cases, we see that for almost all settings of the hyperparameters, Eve outperforms Adam, and the default values lead to performance close to the best. In general, for different models and/or tasks, not all hyperparameter settings lead to improved performance over Adam, and we did not observe any consistent trend in the performance across hyperparameters. However, the default values suggested in this paper consistently lead to good performance on a variety of tasks. We also note that the default hyperparameter values were not selected based on this experiment, but through an informal initial search using a smaller model.
We proposed a new algorithm, Eve, for stochastic gradient-based optimization. Our work builds on adaptive methods which maintain a separate learning rate for each parameter, and adaptively tunes the global learning rate using feedback from the objective function. Our algorithm is simple to implement, and is efficient, both computationally, and in terms of memory.
Through experiments with CNNs and RNNs, we showed that Eve outperforms other state of the art optimizers in optimizing large neural network models. We also compared Eve with learning rate decay methods and showed that Eve can achieve similar or better performance with far less tuning. Finally, we studied the hyperparameters of Eve and saw that a range of choices leads to performance improvement over Adam.
One limitation of our method is that it requires knowledge of the global minimum of the objective function. One possible approach to address this issue is to use an estimate of the minimum, and update this estimate as training progresses. This approach has been used when using Polyak step sizes with the subgradient method.
Duchi, Hazan, and Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”.↩︎
Duchi, Hazan, and Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”.↩︎
“Introduction to Optimization. Translations Series in Mathematics and Engineering”.↩︎
“Understanding the Difficulty of Training Deep Feedforward Neural Networks”.↩︎
Nesterov, “A Method for Unconstrained Convex Minimization Problem with the Rate of Convergence o (1/k^ 2)”.↩︎
He et al., “Deep Residual Learning for Image Recognition”.↩︎
Krizhevsky and Hinton, “Learning Multiple Layers of Features from Tiny Images”.↩︎
Marcus, Marcinkiewicz, and Santorini, “Building a Large Annotated Corpus of English”.↩︎
Chung et al., “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”.↩︎