= 'gpt2'
pretrained_weights = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer = GPT2LMHeadModel.from_pretrained(pretrained_weights) model
Prune Transformers
This example code is taken from the fastai docs
= untar_data(URLs.WIKITEXT_TINY) path
Let’s create our fastai Learner
.
= Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()) learn
And let’s try to extend a given prompt with the pretrained model.
= "\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn" prompt
= learn.model.generate(inp, max_length=40, num_beams=5, temperature=1.5) preds
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
0].cpu().numpy()) tokenizer.decode(preds[
'\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn on its head.\n\nA unicorn is a magical creature with a rainbow tail and a horn'
learn.validate()
(#2) [3.695716619491577,40.2744255065918]
1, 1e-4) learn.fit_one_cycle(
epoch | train_loss | valid_loss | perplexity | time |
---|---|---|---|---|
0 | 3.124115 | 2.844266 | 17.188944 | 07:50 |
= tokenizer.encode(prompt)
prompt_ids = tensor(prompt_ids)[None]
inp
= learn.model.generate(inp.cuda(), max_length=40, num_beams=5, temperature=1.5)
preds
0].cpu().numpy()) tokenizer.decode(preds[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Make it sparse !
Let’s see now if we retrain our model, this time introducing sparsity
= Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()) learn
Unfortunately, the transformer model uses a custom layer: Conv1D
, which is not a part of PyTorch. To overcome this problem, we have to add this layer to our Granularities
class, so that it knows what to sparsify.
Here, the Conv1D
behaves like a Linear
layer, i.e. the weights are defined by a matrix of dimension (nf,nx)
doc(Conv1D)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Conv1D
Conv1D(nf, nx)
1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2). Basically works like a linear layer but the weights are transposed. Args: nf (`int`): The number of output features. nx (`int`): The number of input features.
We can thus add the Conv1D granularity by using the add_granularity
method, indicating the target module and the corresponding granularities that it can handle (the same as Linear
so we can reuse it)
Granularities.add_granularity(Conv1D, Granularities._granularities_Linear)
Let’s now define our SparsifyCallback
. Let’s say we want to make our model 30% sparse, by removing the highest-norm weight in each attention head.
= SparsifyCallback(sparsity=30, granularity='weight', context='local', criteria=large_final, schedule=one_cycle, layer_type=Conv1D) sp_cb
We now only have to pass our callback to fastai
And we can check the predicion to the same prompt as before
= tokenizer.encode(prompt)
prompt_ids = tensor(prompt_ids)[None]
inp
= learn.model.generate(inp.cuda(), max_length=40, num_beams=5, temperature=1.5)
preds
0].cpu().numpy()) tokenizer.decode(preds[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn @-@ shaped head. The unicorn is a member of the <unk> <unk>'
That’s it ! You now have a sparse Transformer as performant as the whole model. However, this model is currently not more efficient speed and storage wise. To have such a speed-up, I suggest you to look at the granularity section.