Super Weights in LLMs - How Pruning Them Destroys a LLM's Ability to Generate Text ?

Super weights are crucial to performance of LLMs and can have outsized impact on LLM model's behaviour

Nov 18, 2024

I blog about latest machine learning research topics that have an immediate impact on the work us data scientists/machine learning engineers do every day. Share the newsletter with your friends so that we all grow together.

Share Machine Learning Diaries

Article in Nutshell

Following are key take aways from this article:

Outlier parameters are small number of parameters which are disproportionally important to performance of the LLM. A billion parameter LLMs will have miniscule outlier parameters, say 0.01% of the total count of the total parameters. But this too translates to hundreds of thousands of parameters.
Authors [1] point to presence of “Super weights” as a subset of outlier parameters. Pruning as few as a single super weight can ‘destroy an LLM’s ability to generate text – increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing’.
Authors [1] state that removing non Super weight outliers, that are sometimes larger than the super weight themselves, affects performance of the LLM by no more than a few percentage points. Interestingly, removing a single super weight results in accuracy drop which is much greater than the effect of removing all other non Super weight outliers combined.

Introduction

The authors made an interesting observed wrt to Super weights that ‘pruning the super weight destroys quality by dampening the super activation and shifting almost all logit probability mass to stop words’. Example: When the prompt “Summer is hot. Winter is ” is used. The original model with super weight, the correctly predicts the next token “cold” with a high probability 81.4%. However, when the authors removed the super weight, the model’s top prediction is a stopword “the” with a non-confident low probability of 9.0%. ‘This indicates that the super weights is essential for the model to make a correct and confident prediction of meaningful words’ [1].

Characteristics of Super weight

Super weight is always found in the mlp.down_proj weight, always in the early layer.
Super Activation, a term defined by authors[1], is exceptionally large magnitude activation amplified by the super weights.
Regardless of the prompt, it was observed that ‘super activation persists throughout the model at exactly the same magnitude and position’[1].
Super weight suppress the likelihood of generating stop words.
Both super weights and super activations are collectively referred to as super outliers by the authors[1] and have been observed to play a critical role in the LLM model quality. ‘By preserving super outliers, it can be shown that ‘round-to-nearest quantization increases effectiveness noticeably and preserving super outliers improves compression quality.’ [1]

Figure#1 - Super Weight Phenomenon: Authors[1] discover that pruning a single, special scalar, which is called the super weight, can completely destroy a Large Language Model’s ability to generate text. On the left, the original Llama-7B, which contains a super weight, produces a reasonable completion. On the right, after pruning the super weight, Llama-7B generates complete gibberish. This qualitative observation has quantitative impact too: zero-shot accuracy drops to guessing and perplexity increases by orders of magnitude.

Super Weight Importance. Prune SW(i.e. Super Weights): Pruning the single, scalar-valued super weight significantly impairs quality – reducing accuracy on zero-shot datasets and increasing perplexity by orders of magnitude. Prune Non-SW By contrast, retaining the super weight and instead pruning the other 7,000 largest magnitude weights marginally affects quality. In other words, a single super weight is more important than even the top 7,000 largest weights combined. Prune SW(i.e. Super weight) ,+SA(i.e. Super Activation): Pruning the super weight but restoring the super activation partially recovers quality. Note that quality is still drastically impaired however, so we conclude that super activations only partially explain how super weights operate. This also shows that super weights and super activations both need special handling, to preserve quality. — Table 1: **Super Weight Importance**. Prune SW(i.e. Super Weights): Pruning the single, scalar-valued super weight significantly impairs quality – reducing accuracy on zero-shot datasets and increasing perplexity by orders of magnitude. Prune Non-SW By contrast, retaining the super weight and instead pruning the other 7,000 largest magnitude weights marginally affects quality. In other words, a single super weight is more important than even the top 7,000 largest weights combined. Prune SW(i.e. Super weight) ,+SA(i.e. Super Activation): Pruning the super weight but restoring the super activation partially recovers quality. Note that quality is still drastically impaired however, so we conclude that super activations only partially explain how super weights operate. This also shows that super weights and super activations both need special handling, to preserve quality.

Fig 2: : **How Super Weights behave**. I: Super weights are often found in an early layer’s down projection, indicated with a blue-purple box. The super weight immediately creates an incredibly large-magnitude super activation. II: Super activations are propagated through skip connections, indicated with blue-purple lines. III: This has a net effect of suppressing stopword likelihoods in the final logits. Removing the super weight causes stopword likelihood skyrocket, indicated with the blue-purple stacked bars.

Identifying Super Weights

According to authors[1], ‘Super weights can be located by detecting the spikes in the mlp.down_proj inputs and outputs distributions across the layers. This detection only requires a single input prompt, rather than a set of validation data or use-case examples.’ See the Figure 3 below.

Figure 3: **How to identify the Super Weight for Llama-7B**. down_proj input features a large maximum-magnitude activation only in Layer 2, where the super activation first appeared. The value’s channel index, e.g., 7003, tells the row of Super Weight. down_proj output likewise features a large maximum-magnitude activation at Layer 2. This value’s channel index, e.g., 3968, gives us the column of the Super Weight.

References

For learning more on the topic, readers can refer to the original research paper by the authors of The research paper on Super weight in Large Language Model

Thanks for reading Machine Learning Diaries! This post is public so feel free to share it. Please reach out to author at pn@vevesta.com to give feedback.

Machine Learning Diaries