Cracking LLM Quantization: A New Metric and a Bold Approach

Quantization of Large Language Models (LLMs) is no longer an optional task for developers but a necessity. With the relentless growth of model sizes, efficient inference demands compressing these giants without sacrificing accuracy. Yet, the journey is riddled with obstacles, primarily due to pesky activation outliers that can send performance plummeting, especially when working with lower bit precision.

The Outlier Dilemma

Activation outliers are the Achilles' heel of LLM quantization. Recent attempts to squash these outliers through linear transformations across feature dimensions have faltered. Our analysis shows that they persist, with concentrated magnitude distributions stubbornly sticking around. If you can't squash them, is there a way to work around them?

Introducing 'Flatness'

Enter 'Flatness', a novel metric aimed at quantifying the distribution of these outliers. By modeling the mathematical relationship between quantization error and outliers, Flatness paves the way for identifying a theoretically optimal solution. And with this, comes Bidirectional Diagonal Quantization (BDQ) - a framework that strategically disperses these outliers across matrix dimensions using clever diagonal operations.

BDQ: The New Benchmark

BDQ isn't just another entry in the quantization field, it's a big deal. It establishes a new benchmark in quantization, achieving less than a 1% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. For those tackling the more challenging W2A4KV16 experiment, BDQ narrows the performance gap by a significant 39.1% on the DeepSeek-R1-Distill-LLaMA-70B model.

So, why should you care? BDQ demonstrates that with the right approach, the quantization of LLMs can be both efficient and accurate. Slapping a model on a GPU rental isn't a convergence thesis. Instead, what BDQ achieves is a tangible step forward in practical AI deployment. If the AI can hold a wallet, who writes the risk model?

The Road Ahead

While BDQ's results are impressive, the broader question remains: how scalable is this approach across diverse LLM architectures? As more models emerge, the scalability and adaptability of this quantization framework will be put to the test. But for now, BDQ stands as a testament to the potential of innovative matrix transformations in surmounting the quantization challenge.