The curse of racing for high performance

While deep learning models are getting better in terms of performance, they also tend to get bigger and more expensive to compute. Until recently, it can seem that state-of-the-art models were achieved by using the good ol' “Stack more layers !” property. Indeed, if you take a look at the history of state-of-the-art models on ImageNet, you will notice that each year, the new best results were achieved by using a deeper network.

It seems that we are obsessed with getting the best results as possible, leading to models that can involve hundreds of millions of parameters ! But what's the point of having a top-tier performing network if we cannot use it?

There has to be a better way to solve our tasks...

After all, as the philosopher William of Occam argued in his principle named after him, the Occam's Razor:

“Pluralitas non est ponenda sine necessitate.”

Or in other words:

“Simpler solutions should be favored over more complex ones.”

One of the ways to do so is to use a more parameter efficient architecture. There is a very active research going on in that field, and we can state some notable architectures that are competitive with their parameter-heavy equivalents (SqueezeNet1, MobileNet2, Xception3, EfficientNet4,...).

Do we really need a sledgehammer to crack a nut ?

What if I told you that you can still use your favorite architecture, but in a more efficient way ? After all, neural networks are meant to be used either on resource-constrained environments (mobile phones, autonomous cars, drones, ...) or to run on servers, so in both case, any gain in computation or memory storage would be beneficial.

Recently, there has been a very interesting paper by Frankle and Carbin5, where they introduce the lottery ticket hypothesis. This hypothesis states that, in a network, only a subset of the parameters is needed to achieve an optimal performance. The whole difficulty is then to find that particular subnetwork; the authors used a pruning technique consisting of pruning the trained model, then “rewinding” the initialization of remaining weights to their original initialization. Only that subnetwork, with that specific set of initialized weights is able to achieve the same level of accuracy as the entire network.

This discovery has huge implications, as it would imply that the only advantage of using parameter-heavy networks is to provide more initialization configuration and thus, more chance to get those “winning tickets”.


The inspiration behind neural network pruning is taken from how our own brain evolves during our life. Indeed, between our birth and adulthood, the amount of synapses (the structures that allows the neurons to transmit an electrical or chemical signal to another neuron) greatly varies. Our brain experience a large amount of growth during infancy, then basically follows a “use it or lose it” process. It results in a synaptic pruning which will remove any synapse that is not needed, in order to reach an optimal amount depending on our needs.


In the case of neural networks, the principle of pruning is to remove network connections that are considered unimportant to keep the network performance unchanged. Pruning is actually a quite old idea (like most ideas of deep learning) but that is an active field of research nowadays.

It dates back to 1990s namely, with most popular work at that time being Optimal Brain Damage 6 and Optimal Brain Surgeon 7. Pruning has been popularized by Han et al. 8 with their 2015 paper.


Pruning thus consists of inducing sparsity in the weights of the network.

“Every block of stone has a statue inside it and it is the task of the sculptor to discover it.” - Michelangelo

As Michelangelo and his blocks of stone, we will carve our neural networks to bring their beauty out of them, or until we make them as sparse as possible while preserving their original performance.


Neural network pruning can come in many fashion, represented in the image below:


You can either be very precise and remove each weight independently or remove bigger chunks at a time. The more fine-grained (or unstructured) the pruning, the more precise the process will be, but the more difficult it will be to get any acceleration. On the other hand, removing bigger chunks at a time (structured pruning) will be less accurate, but will make life easier for any sparse matrix computation libraries. So granularity of pruning will be a trade-off between precision and acceleration.

I guess it's a matter of preference, are you more a cubist or high renaissance deep learning artist ?

Apart from the granularity of pruning, you also have to choose when you will remove the weights.


The timing and scheduling that you will adopt to prune your network will highly impact its final performance.

The three most commonly used schedulings are:

  • One-shot Pruning
  • Iterative Pruning
  • Automated Gradual Pruning

The most basic idea is to start from a trained model, prune it to the desired sparsity, then optionally fine-tune the network to accommodate from the removal of some of its parameters. This technique is known as One-shot Pruning. However, a simple change in that technique is able to provide way better results. The idea is simply to perform the pruning phase over several steps, all followed by some fine-tuning. That technique is called Iterative Pruning and, while leading to better results, can sometimes be very time-consuming and computationally intensive, especially if the number of parameters removed at each iteration is low. There has also been some research9 in incorporating weight pruning directly inside of the training step, periodically removing the weights. This technique is called Automated Gradual Pruning.


In the case of Automated Gradual Pruning, the schedule proposed by the authors is the following:


This schedule leads to an important pruning early in the training, then slowly decreasing as the training progresses.


Now that we know how and when to remove our parameters, we have to know which ones to choose.

There exist many ways to evaluate weights importance, but the two most common ways are:

  • Weight Magnitude Pruning
  • Gradient Magnitude Pruning

While being extremely simple, weight magnitude pruning has been found to be very effective. It simply consists of computing the $L1$-norm, i.e $\sum_{i}\left|x_{i}\right|$, of the weights (or group/kernel/filter depending on the granularity), and to remove the ones with the lowest values. In the case of gradient magnitude pruning, the only change is that we will multiply our weights by their corresponding gradients before computing the $L1$-norm on the result.

Those criteria can be evaluated locally, i.e. each channel is pruned until the desired sparsity is reached, resulting in equally sparse layers, or globally, i.e. we evaluate the weights over the whole network, resulting in a sparse network, but with layers having different sparsity values.


In order to report how well a pruning technique is doing, you need metrics to evaluate it.

To avoid any ambiguity in the metrics used, Davis Blalock and Jose Javier Gonzalez Ortiz10 have proposed a library to unify the way we report metrics. More specifically, they propose to report:

  • Compression Ratio, which should be computed as: Compression Ratio = total_params/nonzero_params, with total_params being the original number of parameters in the network and nonzero_params the number of non-zero weights after pruning.

  • Theoretical Speedup, which should be computed as: Speedup = total_flops/nonzero_flops, with total_flops being the amount of FLOPs in the original model and nonzero_flops the amount of FLOPs of the remaining non-zero weights.

It is important to report both metrics as the speedup greatly depends on where in the network the pruning is performed. Indeed, for a same compression ratio, two similar architectures can have widely different speedup values. This is because the FLOPs of the convolution operation highly depend on the size of their input dimension, which varies along the network. Most of parameters are usually contained towards the end of the network while most of the computation is performed in early layers, reason why early downsampling is widely used.

The graph below show how the number of parameters and FLOPs evolve in the VGG16 network:


What it shows is that, the last 3 layers hold $48\%$ of the parameters while being only responsible of $9\%$ of the total FLOPs in the network. For that reason, in order to see the same speedup improvement, you will need to remove a lot more parameters in late layers than you would need in early layers.

Another problem remaining is that the reported speedup is a theoretical value. It means that you'll never observe such a speedup in reality, especially because common deep learning libraries do not support acceleration for sparse matrices, or that it requires dedicated hardware. The easiest way to make sure that you will get an inference speed improvement is to physically remove the weights (can only be done for entire filters 11), you don't need to take care of sparse matrix computations if you don't have the matrix anymore! The way to perform that operation for a layer $i$ is the following:

......Layer i

As you remove a whole filter, it's not really introducing sparsity in the network as you now change a hyperparameter (number of filters in a layer). Moreover, if you decide to prune a single filter in layer $i$, it means that the corresponding feature map won't exist anymore. Thus, in the layer $i+1$, the kernels corresponding to the deleted feature maps have to be removed. So, pruning a filter saves parameters and computations both in the current layer and in the following one!

That's all! Thank you for reading, I hope that you found this little tour over neural network pruning interesting and, more importantly, useful.

If you notice any mistake or improvement that can be done, please contact me ! If you found that post useful, please consider citing it as:

  title   = "Neural Network Pruning",
  author  = "Hubens, Nathan",
  journal = "",
  year    = "2020",
  url     = ""