Speculative Decoding Accelerator with Dynamic Top-K Projection
Prototype that implements the NanoSpec core idea — dynamically shrinking the draft model’s vocabulary projection at inference time — and benchmarks the speedup.
Difficulty: 1-week | Stack: Python, PyTorch, Transformers, triton (optional for kernel), wandb for tracking
Who this is for
LLM inference engineers who want a working proof-of-concept they can measure and extend, demonstrating training-free vocabulary pruning drops draft latency by 3-5× on the projection layer alone.
Build steps
- Set up a baseline speculative decoding loop: GPT-2 small as draft, GPT-2 large as verifier, targeting 4-token drafts per verifier call.
- Profile the baseline to isolate projection time (lm_head matmul) vs. attention time using torch.profiler — confirm projection is the dominant cost at large vocab.
- Implement dynamic top-K vocabulary selection: after the last transformer layer, use the hidden state to run a cheap ‘selector’ (cosine similarity against token embeddings clustered offline into ~3k centroids) to pick active tokens, then run the projection only over those rows.
- Swap the baseline draft model’s lm_head call with the dynamic projection and verify that acceptance rates (tokens accepted / tokens drafted) degrade by less than 2% on a held-out test set (WikiText-103).
- Run a sweep over K={500, 1000, 3000, 10000, 30000} measuring wall-clock tokens/sec and acceptance rate, plot the Pareto frontier, and write a short findings README.
Risks
- Clustering token embeddings offline is straightforward but the centroid count (3k) is a hyperparameter — too few and acceptance rate collapses, too many and you lose the speedup; plan an afternoon for tuning.
- Custom sparse matmul for the dynamic projection may not beat PyTorch’s dense mm until batch sizes are large enough for occupancy — you may need triton or accept the win only shows at larger vocabs (Llama-3 70k+).
- GPT-2’s vocabulary (50k) is small compared to Llama (128k); the speedup will be modest — use Llama-3.2-1B as draft / Llama-3.2-3B as verifier via llama.cpp Python bindings for a more realistic demo.
Business Angle
A drop-in PyTorch benchmark kit that proves NanoSpec-style vocabulary pruning cuts draft-model latency 3–5×, sold as a one-time purchase to LLM inference engineers who need a credible PoC to justify infra changes.
Customer: A solo ML engineer or small-team inference lead at a Series A–C AI startup who is already running speculative decoding (e.g., vLLM or TGI) and needs hard numbers to pitch their CTO on switching draft-model architecture — not a researcher, but a practitioner who ships prod systems.
Pricing: one-time — $800 in month 1 (16 sales at $49), $300 passive by month 3
Full business breakdown →