Which Pruning Schedule Should I Use?
A Comparative Study of Pruning Schedules
In a previous post, we spoke about Pruning Schedules. More specifically, we introduced the 2 most common ones: One-Shot Pruning and Iterative Pruning, but also, another interesting schedule: Automated Gradual Pruning1
From all of those, the most commonly used pruning schedule is the Iterative Pruning because of its simplicity. The process can be summarized as:
- Train the network until convergence
- Prune the network
- Fine-tune it to recover lost performance
- Repeat from step 2 until desired sparsity.
It thus consists of several cycles of pruning/fine-tuning, usually using a smaller learning rate than for the initial training. But is it really better than other schedules ? Also, could we possibly come up with a better schedule ?
Moreover, when using such a schedule, several questions remain open, for example:
- Should we really wait that the network has converged before pruning ?
- If we perform the fine-tuning step, should we wait that the network has recovered the lost performance before doing a new pruning step ?
- How many cycles should we do ?
- How should we chose the learning rate for fine-tuning?
In this post, we'll try to provide elements of answer to those questions. In particular, we will compare all the pruning schedules and introduce a new one.
We will use the Imagenette dataset in order to perform our experiments. All we need then is to import fastai 2 and the sparse module from fasterai 3
from fastai.vision.all import *
from fasterai.sparse.all import *
We will first compare the results of the common schedules introduced earlier. To get a fair comparison, all the schedules will be compared on a fixed training budget, using the same hyperparameters.
The One-Shot Pruning schedule is pretty simple. It consists of 3 phases:
- Train the network
- Prune a portion of the weights
- Fine-Tune the remaining weights
Given that our training budget is fixed, we have to decide wether to put more budget in the initial training phase, i.e. step 1) or in the fine-tuning phase, i.e. step 3). We empirically find that training for $40\%$ of the training budget, and fine-tuning for the remaining time allows for the best results, which suggests that the fine-tuning step is slightly more important than the training one. This can be explained by the fact that, by performing the pruning too late in the training makes it difficult for the network to recover the lost performance as we use a small learning rate towards the end of training.
If we plot the sparsity of our model along the training, one-shot pruning thus looks like this:
We trained our model with $0\%$ sparsity for $40\%$ of our training budget, prune it, then fine-tune for the remaining time. To do this, all is needed is the fasterai SparsifyCallback
.
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, one_shot, start_epoch=4)
learn.fit(10, 1e-3, cbs=sp_cb)
One-Shot Pruning is often seen as the simplest pruning schedule. We thus have the above results that we can use as a baseline.
As decribed above, the Iterative Pruning Schedule can be broken down as:
- Train the network until convergence
- Prune the network
- Fine-tune it to recover lost performance
- Repeat from step 2 until desired sparsity.
Iterative Pruning is slightly different from One-Shot. Here, the pruning doesn't happen in one-step but in several cycles, alternating phase of pruning and fine-tuning. We found that, given a fixed budget, allowing $20\%$ of the training budget for initial training provides best results. For simplicity, we use the same budget of fine-tuning for each fine-tuning phase.
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, iterative, start_epoch=2)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
As we can see, Iterative Pruning leads to worse results than plain One-Shot Pruning, how come? This is because we imposed a fixed training budget and that, as several works have reported, Iterative Pruning requires a significantly longer fine-tuning process in order to get a better performing pruned network 4.
The main problem of the previous schedules is the discontinuity that happens at each pruning step. Indeed, when the pruning is performed, the network sparsity suddenly increases by a lot, making it very difficult for the network to recover its previous performance. More recently, Automated Gradual Pruning was introduced, which allows to vary the pruning frequency, thus making the pruning process "smoother". However, it still requires to set a starting point, which we found to be around $20\%$ of the training.
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, sched_agp, start_epoch=2)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
With AGP, we can see that we are able to outperform One-Shot Pruning. Indeed, the smoother pruning probably makes it easier for the network to accomodate from the increase of sparsity.
What other "smooth" schedule can we think about ? Fasterai let's you try the schedules available by default in fastai, so let's give them a shot!
Those default schedule are:
- Annealing Linear
- Annealing Exponential
- Annealing Cosine
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, sched_lin)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
The linear schedule looks OK until the very last iteration. As we have seen for Iterative Pruning, the network needs a bit of fine-tuning after pruning some weights, which is not the case here as we continue pruning the weights until the very end, so the sparsity in the network never settles.
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, sched_exp, start_sparsity=0.0001)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
Exponential schedule provides even worse results. This was to be expected as the increase of sparsity mostly happen at the end of training (from $24\%$ to $95\%$ in the last epoch), giving the network even less time to recover.
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, sched_cos)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
Cosine Schedule is a bit better, but we can still see the drop in performance at the end, because the sparsity in the network never settles.
So what can we do from here ?
From what we have seen, Automated Gradual Pruning is the technique that works best so far. AGP possess a long "tail", allowing the network to be fine-tuned with almost no increase in sparsity towards the end of pruning, which is definitely lacking from those default fastai schedules.
Can we modify previous schedules to have a similar behaviour ? What if we artificially add a tail to our cosine schedule ?
In fasterai, this can be done by passing the argument end_epoch
, corresponding to the epoch we stop pruning. In this case, it means that we will have 3 entire epochs where the sparsity doesn't change, so the fine-tuning may be more efficient.
As we can see now, this kind of schedule allows our network to reach similar performance than AGP.
It has been showed recently that the most critical phase in the training of a neural network happens during the very first iterations 5 and that applying regularization after that initial transient phase has little effect on the final performance of the network. 6
As network pruning removes some weights, reducing the capacity of the network, it can be seen as a kind of regularization. One thus should apply pruning early in the training to take advantage of its regularization effects but must do so very carefully to not irremediably damage the network during this brittle period.
Can we create a scheduling that gets the best of both worlds, i.e. start pruning slowly right from the start and has a long fine-tuning at the end ? The cosine schedule with a tail seemed to be a good start but is a bit lacking some kind of customization.
We thus introduce One-Cycle Pruning schedule which, as the name suggests, possess only a single cycle of pruning, happening all along the training. The expression of the sparsity along the training is given by:
with $s_t$, the level of sparsity at training step $t$, $s_i$ and $s_f$ respectively the initial and final level of sparsity.
This schedule can be customized by varying the slope of pruning (the $\alpha$ parameter) of the offset (the $\beta$ parameter), but we have found that good defaults values are respectively $14$ and $5$.
To use it with fasterai, we only need to create the corresponding function:
def sched_onecycle(start, end, pos, α=14, β=5):
out = (1+np.exp(-α+β)) / (1 + (np.exp((-α*pos)+β)))
return start + (end-start)*out
Then use it in the Callback:
learn = Learner(dls, resnet18(num_classes=dls.c), metrics=accuracy)
sp_cb = SparsifyCallback(95, 'weight', 'local', large_final, sched_onecycle)
learn.fit_one_cycle(10, 1e-3, cbs=sp_cb)
As we can see, such a schedule allows our network to reach a higher performance given our training budget.
In this blog post, we experimented with a few pruning schedules and showed that, under a strict and fixed training budget, One-Cycle Pruning performs best. If the training budget doesn't matter, then Iterative Pruning might be a good default option
Feel free to also experiment and maybe come up with your own pruning schedule, that perfectly fits your task !
If you notice any mistake or improvement that can be done, please contact me ! If you found that post useful, please consider citing it as:
@article{hubens2021schedule,
title = "Which Pruning Schedule Should I Use ?",
author = "Hubens, Nathan",
journal = "nathanhubens.github.io",
year = "2021",
url = "https://nathanhubens.github.io/posts/deep%20learning/2021/06/15/OneCycle.html"
}