AI Pulse
← Feed · 2026-06-02 · research

Trimming the Fat: Two Approaches to Faster LLM Inference

Two recent papers attack LLM inference overhead from different angles. NanoSpec shrinks the vocabulary projection bottleneck in speculative decoding from 30k tokens down to roughly 3k without sacrificing draft quality. InfoMerge tackles the quadratic token explosion in video LLMs by compressing visual tokens based on information content rather than temporal proximity.

Inference speed has become the unglamorous but consequential front in LLM development. Faster generation means lower cost, tighter latency budgets, and the practical difference between a feature shipping or not. Two papers out this week address the problem from opposite ends of the stack — one targeting the vocabulary projection bottleneck in text generation, the other attacking the token flood in video understanding.

Speculative Decoding’s Hidden Tax

Speculative decoding — where a small draft model proposes tokens that a larger verifier accepts or rejects — is one of the more elegant throughput hacks in modern inference. The draft model runs fast; the verifier corrects mistakes in parallel. The catch is that the final linear projection over a vocabulary that can exceed 100k tokens still has to happen, and it is expensive.

Existing workarounds prune the vocabulary statically or with coarse granularity, which forces practitioners to keep roughly 30k tokens active just to preserve draft quality. That is a big chunk of the savings you were hoping to capture.

NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies takes a different route. Rather than committing to a fixed pruned vocabulary at training time, NanoSpec dynamically selects a minimal vocabulary per query — dropping active vocabulary size to around 3k tokens. Critically, it does this without retraining, making it a drop-in optimization. The authors describe the approach as “training-free,” which matters a great deal for adoption: there is no fine-tuning cost to amortize, and the technique can be applied to existing draft models.

The core insight is that the relevant vocabulary for any given context is far smaller than the full token set. Most tokens are simply not candidates given the preceding text, so computing their logits wastes computation. By identifying this minimal in-context vocabulary on the fly, NanoSpec breaks the trade-off that forced earlier methods to accept bloated active vocabularies as the price of quality.

Video LLMs and the Token Explosion

The second paper operates in a different modality but confronts an analogous problem. Video understanding models must process sequences of frames, and visual tokens scale roughly quadratically with sequence length. The naive response — sample fewer frames — discards information that may matter.

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models argues that most existing compression methods are working with the wrong signal. They identify redundancy by measuring similarity between adjacent frames, which is a temporal heuristic. Two frames can look nearly identical while encoding very different semantic content; two frames can look different while being semantically interchangeable.

InfoMerge instead allocates token budgets according to information content — compressing tokens that carry redundant semantic signal regardless of their temporal position, and preserving tokens that carry unique content even if they appear in a long uniform-looking sequence. The approach is also training-free, and it enables longer effective video context by making the computational budget go further.

A Pattern Worth Noting

Both papers share a structural feature that is worth naming: they are training-free. This is not a coincidence. The cost and friction of fine-tuning a production model — even a draft model or a vision encoder — is high enough that optimizations requiring it face a steep adoption barrier. Methods that treat the existing model as a black box and operate on the inference path are far more likely to land in real deployments.

The other common thread is a shift from coarse, static heuristics to dynamic, content-aware decisions. Static vocabulary pruning and temporal frame sampling are both approximations that fail in the cases that matter most. Dynamically selecting what to compute — guided by the actual content of the input — is the more principled path, and these papers suggest it is also the more efficient one.

Neither result is a revolution. Speculative decoding already existed; video token compression already existed. But the marginal gains here are the kind that compound: a 10x vocabulary reduction in speculative decoding and meaningful token savings in video LLMs are both changes that show up in infrastructure costs and user-facing latency in production systems.

Sources

Sources