Speculative Decoding Accelerator with Dynamic Top-K Projection

Prototype that implements the NanoSpec core idea — dynamically shrinking the draft model’s vocabulary projection at inference time — and benchmarks the speedup.

Difficulty: 1-week | Stack: Python, PyTorch, Transformers, triton (optional for kernel), wandb for tracking

Who this is for

LLM inference engineers who want a working proof-of-concept they can measure and extend, demonstrating training-free vocabulary pruning drops draft latency by 3-5× on the projection layer alone.

Build steps

Set up a baseline speculative decoding loop: GPT-2 small as draft, GPT-2 large as verifier, targeting 4-token drafts per verifier call.
Profile the baseline to isolate projection time (lm_head matmul) vs. attention time using torch.profiler — confirm projection is the dominant cost at large vocab.
Implement dynamic top-K vocabulary selection: after the last transformer layer, use the hidden state to run a cheap ‘selector’ (cosine similarity against token embeddings clustered offline into ~3k centroids) to pick active tokens, then run the projection only over those rows.
Swap the baseline draft model’s lm_head call with the dynamic projection and verify that acceptance rates (tokens accepted / tokens drafted) degrade by less than 2% on a held-out test set (WikiText-103).
Run a sweep over K={500, 1000, 3000, 10000, 30000} measuring wall-clock tokens/sec and acceptance rate, plot the Pareto frontier, and write a short findings README.

Risks

Clustering token embeddings offline is straightforward but the centroid count (3k) is a hyperparameter — too few and acceptance rate collapses, too many and you lose the speedup; plan an afternoon for tuning.
Custom sparse matmul for the dynamic projection may not beat PyTorch’s dense mm until batch sizes are large enough for occupancy — you may need triton or accept the win only shows at larger vocabs (Llama-3 70k+).
GPT-2’s vocabulary (50k) is small compared to Llama (128k); the speedup will be modest — use Llama-3.2-1B as draft / Llama-3.2-3B as verifier via llama.cpp Python bindings for a more realistic demo.

Speculative Decoding Accelerator with Dynamic Top-K Projection

Who this is for

Build steps

Risks

Business Angle