.. _sec_scheduler:
Learning Rate Scheduling
========================
So far we primarily focused on optimization *algorithms* for how to
update the weight vectors rather than on the *rate* at which they are
being updated. Nonetheless, adjusting the learning rate is often just as
important as the actual algorithm. There are a number of aspects to
consider:
- Most obviously the *magnitude* of the learning rate matters. If it is
too large, optimization diverges, if it is too small, it takes too
long to train or we end up with a suboptimal result. We saw
previously that the condition number of the problem matters (see
e.g., :numref:`sec_momentum` for details). Intuitively it is the
ratio of the amount of change in the least sensitive direction
vs. the most sensitive one.
- Secondly, the rate of decay is just as important. If the learning
rate remains large we may simply end up bouncing around the minimum
and thus not reach optimality. :numref:`sec_minibatch_sgd`
discussed this in some detail and we analyzed performance guarantees
in :numref:`sec_sgd`. In short, we want the rate to decay, but
probably more slowly than :math:`\mathcal{O}(t^{-\frac{1}{2}})` which
would be a good choice for convex problems.
- Another aspect that is equally important is *initialization*. This
pertains both to how the parameters are set initially (review
:numref:`sec_numerical_stability` for details) and also how they
evolve initially. This goes under the moniker of *warmup*, i.e., how
rapidly we start moving towards the solution initially. Large steps
in the beginning might not be beneficial, in particular since the
initial set of parameters is random. The initial update directions
might be quite meaningless, too.
- Lastly, there are a number of optimization variants that perform
cyclical learning rate adjustment. This is beyond the scope of the
current chapter. We recommend the reader to review details in
:cite:t:`Izmailov.Podoprikhin.Garipov.ea.2018`, e.g., how to obtain
better solutions by averaging over an entire *path* of parameters.
Given the fact that there is a lot of detail needed to manage learning
rates, most deep learning frameworks have tools to deal with this
automatically. In the current chapter we will review the effects that
different schedules have on accuracy and also show how this can be
managed efficiently via a *learning rate scheduler*.
Toy Problem
-----------
We begin with a toy problem that is cheap enough to compute easily, yet
sufficiently nontrivial to illustrate some of the key aspects. For that
we pick a slightly modernized version of LeNet (``relu`` instead of
``sigmoid`` activation, MaxPooling rather than AveragePooling), as
applied to Fashion-MNIST. Moreover, we hybridize the network for
performance. Since most of the code is standard we just introduce the
basics without further detailed discussion. See :numref:`chap_cnn` for
a refresher as needed.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import math
import torch
from torch import nn
from torch.optim import lr_scheduler
from d2l import torch as d2l
def net_fn():
model = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120), nn.ReLU(),
nn.Linear(120, 84), nn.ReLU(),
nn.Linear(84, 10))
return model
loss = nn.CrossEntropyLoss()
device = d2l.try_gpu()
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
# The code is almost identical to `d2l.train_ch6` defined in the
# lenet section of chapter convolutional neural networks
def train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler=None):
net.to(device)
animator = d2l.Animator(xlabel='epoch', xlim=[0, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
for epoch in range(num_epochs):
metric = d2l.Accumulator(3) # train_loss, train_acc, num_examples
for i, (X, y) in enumerate(train_iter):
net.train()
trainer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
trainer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
train_loss = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % 50 == 0:
animator.add(epoch + i / len(train_iter),
(train_loss, train_acc, None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch+1, (None, None, test_acc))
if scheduler:
if scheduler.__module__ == lr_scheduler.__name__:
# Using PyTorch In-Built scheduler
scheduler.step()
else:
# Using custom defined scheduler
for param_group in trainer.param_groups:
param_group['lr'] = scheduler(epoch)
print(f'train loss {train_loss:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from mxnet import autograd, gluon, init, lr_scheduler, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
net = nn.HybridSequential()
net.add(nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='relu'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Conv2D(channels=16, kernel_size=5, activation='relu'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Dense(120, activation='relu'),
nn.Dense(84, activation='relu'),
nn.Dense(10))
net.hybridize()
loss = gluon.loss.SoftmaxCrossEntropyLoss()
device = d2l.try_gpu()
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
# The code is almost identical to `d2l.train_ch6` defined in the
# lenet section of chapter convolutional neural networks
def train(net, train_iter, test_iter, num_epochs, loss, trainer, device):
net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
animator = d2l.Animator(xlabel='epoch', xlim=[0, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
for epoch in range(num_epochs):
metric = d2l.Accumulator(3) # train_loss, train_acc, num_examples
for i, (X, y) in enumerate(train_iter):
X, y = X.as_in_ctx(device), y.as_in_ctx(device)
with autograd.record():
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
trainer.step(X.shape[0])
metric.add(l.sum(), d2l.accuracy(y_hat, y), X.shape[0])
train_loss = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % 50 == 0:
animator.add(epoch + i / len(train_iter),
(train_loss, train_acc, None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'train loss {train_loss:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:48:45] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import math
import tensorflow as tf
from tensorflow.keras.callbacks import LearningRateScheduler
from d2l import tensorflow as d2l
def net():
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=6, kernel_size=5, activation='relu',
padding='same'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Conv2D(filters=16, kernel_size=5,
activation='relu'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(120, activation='relu'),
tf.keras.layers.Dense(84, activation='sigmoid'),
tf.keras.layers.Dense(10)])
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
# The code is almost identical to `d2l.train_ch6` defined in the
# lenet section of chapter convolutional neural networks
def train(net_fn, train_iter, test_iter, num_epochs, lr,
device=d2l.try_gpu(), custom_callback = False):
device_name = device._device_name
strategy = tf.distribute.OneDeviceStrategy(device_name)
with strategy.scope():
optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
net = net_fn()
net.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
callback = d2l.TrainCallback(net, train_iter, test_iter, num_epochs,
device_name)
if custom_callback is False:
net.fit(train_iter, epochs=num_epochs, verbose=0,
callbacks=[callback])
else:
net.fit(train_iter, epochs=num_epochs, verbose=0,
callbacks=[callback, custom_callback])
return net
.. raw:: html
.. raw:: html
Let’s have a look at what happens if we invoke this algorithm with
default settings, such as a learning rate of :math:`0.3` and train for
:math:`30` iterations. Note how the training accuracy keeps on
increasing while progress in terms of test accuracy stalls beyond a
point. The gap between both curves indicates overfitting.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.3, 30
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=lr)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.145, train acc 0.944, test acc 0.877
.. figure:: output_lr-scheduler_1dfeb6_15_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.3, 30
net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.160, train acc 0.939, test acc 0.884
.. figure:: output_lr-scheduler_1dfeb6_18_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.3, 30
train(net, train_iter, test_iter, num_epochs, lr)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.228, train acc 0.916, test acc 0.890
51109.0 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lr-scheduler_1dfeb6_21_2.svg
.. raw:: html
.. raw:: html
Schedulers
----------
One way of adjusting the learning rate is to set it explicitly at each
step. This is conveniently achieved by the ``set_learning_rate`` method.
We could adjust it downward after every epoch (or even after every
minibatch), e.g., in a dynamic manner in response to how optimization is
progressing.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr = 0.1
trainer.param_groups[0]["lr"] = lr
print(f'learning rate is now {trainer.param_groups[0]["lr"]:.2f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
learning rate is now 0.10
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trainer.set_learning_rate(0.1)
print(f'learning rate is now {trainer.learning_rate:.2f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
learning rate is now 0.10
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr = 0.1
dummy_model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
dummy_model.compile(tf.keras.optimizers.SGD(learning_rate=lr), loss='mse')
print(f'learning rate is now ,', dummy_model.optimizer.lr.numpy())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
learning rate is now , 0.1
.. raw:: html
.. raw:: html
More generally we want to define a scheduler. When invoked with the
number of updates it returns the appropriate value of the learning rate.
Let’s define a simple one that sets the learning rate to
:math:`\eta = \eta_0 (t + 1)^{-\frac{1}{2}}`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SquareRootScheduler:
def __init__(self, lr=0.1):
self.lr = lr
def __call__(self, num_update):
return self.lr * pow(num_update + 1.0, -0.5)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SquareRootScheduler:
def __init__(self, lr=0.1):
self.lr = lr
def __call__(self, num_update):
return self.lr * pow(num_update + 1.0, -0.5)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SquareRootScheduler:
def __init__(self, lr=0.1):
self.lr = lr
def __call__(self, num_update):
return self.lr * pow(num_update + 1.0, -0.5)
.. raw:: html
.. raw:: html
Let’s plot its behavior over a range of values.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = SquareRootScheduler(lr=0.1)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_51_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = SquareRootScheduler(lr=0.1)
d2l.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_54_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = SquareRootScheduler(lr=0.1)
d2l.plot(tf.range(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_57_0.svg
.. raw:: html
.. raw:: html
Now let’s see how this plays out for training on Fashion-MNIST. We
simply provide the scheduler as an additional argument to the training
algorithm.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.273, train acc 0.900, test acc 0.886
.. figure:: output_lr-scheduler_1dfeb6_63_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'lr_scheduler': scheduler})
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.523, train acc 0.810, test acc 0.815
.. figure:: output_lr-scheduler_1dfeb6_66_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train(net, train_iter, test_iter, num_epochs, lr,
custom_callback=LearningRateScheduler(scheduler))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.388, train acc 0.858, test acc 0.847
51521.6 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lr-scheduler_1dfeb6_69_2.svg
.. raw:: html
.. raw:: html
This worked quite a bit better than previously. Two things stand out:
the curve was rather more smooth than previously. Secondly, there was
less overfitting. Unfortunately it is not a well-resolved question as to
why certain strategies lead to less overfitting in *theory*. There is
some argument that a smaller stepsize will lead to parameters that are
closer to zero and thus simpler. However, this does not explain the
phenomenon entirely since we do not really stop early but simply reduce
the learning rate gently.
Policies
--------
While we cannot possibly cover the entire variety of learning rate
schedulers, we attempt to give a brief overview of popular policies
below. Common choices are polynomial decay and piecewise constant
schedules. Beyond that, cosine learning rate schedules have been found
to work well empirically on some problems. Lastly, on some problems it
is beneficial to warm up the optimizer prior to using large learning
rates.
Factor Scheduler
~~~~~~~~~~~~~~~~
One alternative to a polynomial decay would be a multiplicative one,
that is :math:`\eta_{t+1} \leftarrow \eta_t \cdot \alpha` for
:math:`\alpha \in (0, 1)`. To prevent the learning rate from decaying
beyond a reasonable lower bound the update equation is often modified to
:math:`\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class FactorScheduler:
def __init__(self, factor=1, stop_factor_lr=1e-7, base_lr=0.1):
self.factor = factor
self.stop_factor_lr = stop_factor_lr
self.base_lr = base_lr
def __call__(self, num_update):
self.base_lr = max(self.stop_factor_lr, self.base_lr * self.factor)
return self.base_lr
scheduler = FactorScheduler(factor=0.9, stop_factor_lr=1e-2, base_lr=2.0)
d2l.plot(torch.arange(50), [scheduler(t) for t in range(50)])
.. figure:: output_lr-scheduler_1dfeb6_75_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class FactorScheduler:
def __init__(self, factor=1, stop_factor_lr=1e-7, base_lr=0.1):
self.factor = factor
self.stop_factor_lr = stop_factor_lr
self.base_lr = base_lr
def __call__(self, num_update):
self.base_lr = max(self.stop_factor_lr, self.base_lr * self.factor)
return self.base_lr
scheduler = FactorScheduler(factor=0.9, stop_factor_lr=1e-2, base_lr=2.0)
d2l.plot(np.arange(50), [scheduler(t) for t in range(50)])
.. figure:: output_lr-scheduler_1dfeb6_78_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class FactorScheduler:
def __init__(self, factor=1, stop_factor_lr=1e-7, base_lr=0.1):
self.factor = factor
self.stop_factor_lr = stop_factor_lr
self.base_lr = base_lr
def __call__(self, num_update):
self.base_lr = max(self.stop_factor_lr, self.base_lr * self.factor)
return self.base_lr
scheduler = FactorScheduler(factor=0.9, stop_factor_lr=1e-2, base_lr=2.0)
d2l.plot(tf.range(50), [scheduler(t) for t in range(50)])
.. figure:: output_lr-scheduler_1dfeb6_81_0.svg
.. raw:: html
.. raw:: html
This can also be accomplished by a built-in scheduler in MXNet via the
``lr_scheduler.FactorScheduler`` object. It takes a few more parameters,
such as warmup period, warmup mode (linear or constant), the maximum
number of desired updates, etc.; Going forward we will use the built-in
schedulers as appropriate and only explain their functionality here. As
illustrated, it is fairly straightforward to build your own scheduler if
needed.
Multi Factor Scheduler
~~~~~~~~~~~~~~~~~~~~~~
A common strategy for training deep networks is to keep the learning
rate piecewise constant and to decrease it by a given amount every so
often. That is, given a set of times when to decrease the rate, such as
:math:`s = \{5, 10, 20\}` decrease
:math:`\eta_{t+1} \leftarrow \eta_t \cdot \alpha` whenever
:math:`t \in s`. Assuming that the values are halved at each step we can
implement this as follows.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.5)
scheduler = lr_scheduler.MultiStepLR(trainer, milestones=[15, 30], gamma=0.5)
def get_lr(trainer, scheduler):
lr = scheduler.get_last_lr()[0]
trainer.step()
scheduler.step()
return lr
d2l.plot(torch.arange(num_epochs), [get_lr(trainer, scheduler)
for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_87_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = lr_scheduler.MultiFactorScheduler(step=[15, 30], factor=0.5,
base_lr=0.5)
d2l.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_90_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MultiFactorScheduler:
def __init__(self, step, factor, base_lr):
self.step = step
self.factor = factor
self.base_lr = base_lr
def __call__(self, epoch):
if epoch in self.step:
self.base_lr = self.base_lr * self.factor
return self.base_lr
else:
return self.base_lr
scheduler = MultiFactorScheduler(step=[15, 30], factor=0.5, base_lr=0.5)
d2l.plot(tf.range(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_93_0.svg
.. raw:: html
.. raw:: html
The intuition behind this piecewise constant learning rate schedule is
that one lets optimization proceed until a stationary point has been
reached in terms of the distribution of weight vectors. Then (and only
then) do we decrease the rate such as to obtain a higher quality proxy
to a good local minimum. The example below shows how this can produce
ever slightly better solutions.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.194, train acc 0.927, test acc 0.869
.. figure:: output_lr-scheduler_1dfeb6_99_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'lr_scheduler': scheduler})
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.194, train acc 0.927, test acc 0.887
.. figure:: output_lr-scheduler_1dfeb6_102_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train(net, train_iter, test_iter, num_epochs, lr,
custom_callback=LearningRateScheduler(scheduler))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.234, train acc 0.912, test acc 0.891
51585.5 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lr-scheduler_1dfeb6_105_2.svg
.. raw:: html
.. raw:: html
Cosine Scheduler
~~~~~~~~~~~~~~~~
A rather perplexing heuristic was proposed by
:cite:t:`Loshchilov.Hutter.2016`. It relies on the observation that we
might not want to decrease the learning rate too drastically in the
beginning and moreover, that we might want to “refine” the solution in
the end using a very small learning rate. This results in a cosine-like
schedule with the following functional form for learning rates in the
range :math:`t \in [0, T]`.
.. math:: \eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2} \left(1 + \cos(\pi t/T)\right)
Here :math:`\eta_0` is the initial learning rate, :math:`\eta_T` is the
target rate at time :math:`T`. Furthermore, for :math:`t > T` we simply
pin the value to :math:`\eta_T` without increasing it again. In the
following example, we set the max update step :math:`T = 20`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class CosineScheduler:
def __init__(self, max_update, base_lr=0.01, final_lr=0,
warmup_steps=0, warmup_begin_lr=0):
self.base_lr_orig = base_lr
self.max_update = max_update
self.final_lr = final_lr
self.warmup_steps = warmup_steps
self.warmup_begin_lr = warmup_begin_lr
self.max_steps = self.max_update - self.warmup_steps
def get_warmup_lr(self, epoch):
increase = (self.base_lr_orig - self.warmup_begin_lr) \
* float(epoch) / float(self.warmup_steps)
return self.warmup_begin_lr + increase
def __call__(self, epoch):
if epoch < self.warmup_steps:
return self.get_warmup_lr(epoch)
if epoch <= self.max_update:
self.base_lr = self.final_lr + (
self.base_lr_orig - self.final_lr) * (1 + math.cos(
math.pi * (epoch - self.warmup_steps) / self.max_steps)) / 2
return self.base_lr
scheduler = CosineScheduler(max_update=20, base_lr=0.3, final_lr=0.01)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_111_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = lr_scheduler.CosineScheduler(max_update=20, base_lr=0.3,
final_lr=0.01)
d2l.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_114_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class CosineScheduler:
def __init__(self, max_update, base_lr=0.01, final_lr=0,
warmup_steps=0, warmup_begin_lr=0):
self.base_lr_orig = base_lr
self.max_update = max_update
self.final_lr = final_lr
self.warmup_steps = warmup_steps
self.warmup_begin_lr = warmup_begin_lr
self.max_steps = self.max_update - self.warmup_steps
def get_warmup_lr(self, epoch):
increase = (self.base_lr_orig - self.warmup_begin_lr) \
* float(epoch) / float(self.warmup_steps)
return self.warmup_begin_lr + increase
def __call__(self, epoch):
if epoch < self.warmup_steps:
return self.get_warmup_lr(epoch)
if epoch <= self.max_update:
self.base_lr = self.final_lr + (
self.base_lr_orig - self.final_lr) * (1 + math.cos(
math.pi * (epoch - self.warmup_steps) / self.max_steps)) / 2
return self.base_lr
scheduler = CosineScheduler(max_update=20, base_lr=0.3, final_lr=0.01)
d2l.plot(tf.range(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_117_0.svg
.. raw:: html
.. raw:: html
In the context of computer vision this schedule *can* lead to improved
results. Note, though, that such improvements are not guaranteed (as can
be seen below).
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.3)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.159, train acc 0.942, test acc 0.904
.. figure:: output_lr-scheduler_1dfeb6_123_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'lr_scheduler': scheduler})
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.343, train acc 0.878, test acc 0.859
.. figure:: output_lr-scheduler_1dfeb6_126_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train(net, train_iter, test_iter, num_epochs, lr,
custom_callback=LearningRateScheduler(scheduler))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.264, train acc 0.904, test acc 0.880
51258.5 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lr-scheduler_1dfeb6_129_2.svg
.. raw:: html
.. raw:: html
Warmup
~~~~~~
In some cases initializing the parameters is not sufficient to guarantee
a good solution. This is particularly a problem for some advanced
network designs that may lead to unstable optimization problems. We
could address this by choosing a sufficiently small learning rate to
prevent divergence in the beginning. Unfortunately this means that
progress is slow. Conversely, a large learning rate initially leads to
divergence.
A rather simple fix for this dilemma is to use a warmup period during
which the learning rate *increases* to its initial maximum and to cool
down the rate until the end of the optimization process. For simplicity
one typically uses a linear increase for this purpose. This leads to a
schedule of the form indicated below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = CosineScheduler(20, warmup_steps=5, base_lr=0.3, final_lr=0.01)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_135_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = lr_scheduler.CosineScheduler(20, warmup_steps=5, base_lr=0.3,
final_lr=0.01)
d2l.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_138_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
scheduler = CosineScheduler(20, warmup_steps=5, base_lr=0.3, final_lr=0.01)
d2l.plot(tf.range(num_epochs), [scheduler(t) for t in range(num_epochs)])
.. figure:: output_lr-scheduler_1dfeb6_141_0.svg
.. raw:: html
.. raw:: html
Note that the network converges better initially (in particular observe
the performance during the first 5 epochs).
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.3)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.181, train acc 0.934, test acc 0.901
.. figure:: output_lr-scheduler_1dfeb6_147_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'lr_scheduler': scheduler})
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
train loss 0.348, train acc 0.874, test acc 0.871
.. figure:: output_lr-scheduler_1dfeb6_150_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train(net, train_iter, test_iter, num_epochs, lr,
custom_callback=LearningRateScheduler(scheduler))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.274, train acc 0.899, test acc 0.880
50584.3 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lr-scheduler_1dfeb6_153_2.svg
.. raw:: html
.. raw:: html
Warmup can be applied to any scheduler (not just cosine). For a more
detailed discussion of learning rate schedules and many more experiments
see also :cite:`Gotmare.Keskar.Xiong.ea.2018`. In particular they find
that a warmup phase limits the amount of divergence of parameters in
very deep networks. This makes intuitively sense since we would expect
significant divergence due to random initialization in those parts of
the network that take the most time to make progress in the beginning.
Summary
-------
- Decreasing the learning rate during training can lead to improved
accuracy and (most perplexingly) reduced overfitting of the model.
- A piecewise decrease of the learning rate whenever progress has
plateaued is effective in practice. Essentially this ensures that we
converge efficiently to a suitable solution and only then reduce the
inherent variance of the parameters by reducing the learning rate.
- Cosine schedulers are popular for some computer vision problems. See
e.g., `GluonCV `__ for details of such a
scheduler.
- A warmup period before optimization can prevent divergence.
- Optimization serves multiple purposes in deep learning. Besides
minimizing the training objective, different choices of optimization
algorithms and learning rate scheduling can lead to rather different
amounts of generalization and overfitting on the test set (for the
same amount of training error).
Exercises
---------
1. Experiment with the optimization behavior for a given fixed learning
rate. What is the best model you can obtain this way?
2. How does convergence change if you change the exponent of the
decrease in the learning rate? Use ``PolyScheduler`` for your
convenience in the experiments.
3. Apply the cosine scheduler to large computer vision problems, e.g.,
training ImageNet. How does it affect performance relative to other
schedulers?
4. How long should warmup last?
5. Can you connect optimization and sampling? Start by using results
from :cite:t:`Welling.Teh.2011` on Stochastic Gradient Langevin
Dynamics.
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html