Super Weights in LLMs - How Pruning Them Destroys a LLM's Ability to Generate Text ?
Super weights are crucial to performance of LLMs and can have outsized impact on LLM model's behaviour
I blog about latest machine learning research topics that have an immediate impact on the work us data scientists/machine learning engineers do every day. Share the newsletter with your friends so that we all grow together.
Article in Nutshell
Following are key take aways from this article:
Outlier parameters are small number of parameters which are disproportionally important to performance of the LLM. A billion parameter LLMs will have miniscule outlier parameters, say 0.01% of the total count of the total parameters. But this too translates to hundreds of thousands of parameters.
Authors [1] point to presence of “Super weights” as a subset of outlier parameters. Pruning as few as a single super weight can ‘destroy an LLM’s ability to generate text – increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing’.
Authors [1] state that removing non Super weight outliers, that are sometimes larger than the super weight themselves, affects performance of the LLM by no more than a few percentage points. Interestingly, removing a single super weight results in accuracy drop which is much greater than the effect of removing all other non Super weight outliers combined.
Introduction
The authors made an interesting observed wrt to Super weights that ‘pruning the super weight destroys quality by dampening the super activation and shifting almost all logit probability mass to stop words’. Example: When the prompt “Summer is hot. Winter is ” is used. The original model with super weight, the correctly predicts the next token “cold” with a high probability 81.4%. However, when the authors removed the super weight, the model’s top prediction is a stopword “the” with a non-confident low probability of 9.0%. ‘This indicates that the super weights is essential for the model to make a correct and confident prediction of meaningful words’ [1].
Characteristics of Super weight
Super weight is always found in the
mlp.down_proj
weight, always in the early layer.Super Activation, a term defined by authors[1], is exceptionally large magnitude activation amplified by the super weights.
Regardless of the prompt, it was observed that ‘super activation persists throughout the model at exactly the same magnitude and position’[1].
Super weight suppress the likelihood of generating stop words.
Both super weights and super activations are collectively referred to as super outliers by the authors[1] and have been observed to play a critical role in the LLM model quality. ‘By preserving super outliers, it can be shown that ‘round-to-nearest quantization increases effectiveness noticeably and preserving super outliers improves compression quality.’ [1]
Identifying Super Weights
According to authors[1], ‘Super weights can be located by detecting the spikes in the mlp.down_proj
inputs and outputs distributions across the layers. This detection only requires a single input prompt, rather than a set of validation data or use-case examples.’ See the Figure 3 below.
References
For learning more on the topic, readers can refer to the original research paper by the authors of The research paper on Super weight in Large Language Model