Revolutionizing LLM Quantization: TORQ's Leap Forward

By Marcus DialloMay 20, 2026

As the Microscaling FP4 format faces challenges in LLM quantization, TORQ emerges as a promising solution, enhancing accuracy without training. Here's why it matters.

deploying large language models (LLMs) efficiently, the Microscaling FP4 (MXFP4) format seemed like a promising player. But the story looks different from Nairobi. MXFP4's balancing act between dynamic range and hardware efficiency hits a snag with significant accuracy drops when applied to activation quantization.

Cracking the Code of MXFP4

The root of MXFP4's issue lies in two structural imbalances: extreme inter-block variance and problematic intra-block codebook usage. These imbalances cause LLMs to lose precision, making them less reliable in practice. This isn't about replacing workers. It's about reach. But how can we reach a solution?

Enter TORQ: A New Hope

The proposed TORQ framework tackles these challenges head-on without requiring additional training. It's a Post-Training Quantization (PTQ) framework, which means it's designed to work its magic after the fact. TORQ employs orthogonal rotation strategies at both macro and micro levels to restore balance.

On a larger scale, TORQ uses the Schur-Horn theorem to redistribute activation energy. It prevents high-variance blocks from distorting the picture, preserving the precision of smaller elements. On the microscopic front, it optimizes the MXFP4 codebook's capacity, ensuring no information gets lost in translation. The farmer I spoke with put it simply: accuracy is everything.

Results That Speak Volumes

The numbers don't lie. Experiments on models like LLaMA3 and Qwen3 show TORQ's prowess. On the Qwen3-32B model, perplexity on WikiText impressively dropped to 8.43, close to the higher-precision BF16 format at 7.61. accuracy, it soared from 38.40% with direct RTN to 73.63%. Silicon Valley designs it. The question is where it works.

But why should we care about yet another quantization method? The answer is simple: efficiency. In regions where computational power and resources are limited, these gains mean LLMs become more accessible, practical, and valuable. Automation doesn't mean the same thing everywhere. So, is TORQ the magic bullet for LLM quantization? Let's see how it holds up in the field.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.