AI Pulse
← Projects · 1-week

Speculative Decoding Accelerator with Dynamic Top-K Projection

Prototype that implements the NanoSpec core idea — dynamically shrinking the draft model’s vocabulary projection at inference time — and benchmarks the speedup.

Difficulty: 1-week | Stack: Python, PyTorch, Transformers, triton (optional for kernel), wandb for tracking

Who this is for

LLM inference engineers who want a working proof-of-concept they can measure and extend, demonstrating training-free vocabulary pruning drops draft latency by 3-5× on the projection layer alone.

Build steps

  1. Set up a baseline speculative decoding loop: GPT-2 small as draft, GPT-2 large as verifier, targeting 4-token drafts per verifier call.
  2. Profile the baseline to isolate projection time (lm_head matmul) vs. attention time using torch.profiler — confirm projection is the dominant cost at large vocab.
  3. Implement dynamic top-K vocabulary selection: after the last transformer layer, use the hidden state to run a cheap ‘selector’ (cosine similarity against token embeddings clustered offline into ~3k centroids) to pick active tokens, then run the projection only over those rows.
  4. Swap the baseline draft model’s lm_head call with the dynamic projection and verify that acceptance rates (tokens accepted / tokens drafted) degrade by less than 2% on a held-out test set (WikiText-103).
  5. Run a sweep over K={500, 1000, 3000, 10000, 30000} measuring wall-clock tokens/sec and acceptance rate, plot the Pareto frontier, and write a short findings README.

Risks

  • Clustering token embeddings offline is straightforward but the centroid count (3k) is a hyperparameter — too few and acceptance rate collapses, too many and you lose the speedup; plan an afternoon for tuning.
  • Custom sparse matmul for the dynamic projection may not beat PyTorch’s dense mm until batch sizes are large enough for occupancy — you may need triton or accept the win only shows at larger vocabs (Llama-3 70k+).
  • GPT-2’s vocabulary (50k) is small compared to Llama (128k); the speedup will be modest — use Llama-3.2-1B as draft / Llama-3.2-3B as verifier via llama.cpp Python bindings for a more realistic demo.

Business Angle

A drop-in PyTorch benchmark kit that proves NanoSpec-style vocabulary pruning cuts draft-model latency 3–5×, sold as a one-time purchase to LLM inference engineers who need a credible PoC to justify infra changes.

Customer: A solo ML engineer or small-team inference lead at a Series A–C AI startup who is already running speculative decoding (e.g., vLLM or TGI) and needs hard numbers to pitch their CTO on switching draft-model architecture — not a researcher, but a practitioner who ships prod systems.

Pricing: one-time — $800 in month 1 (16 sales at $49), $300 passive by month 3

Full business breakdown →