Context Vocabulary Scope Visualizer
Interactive tool that shows, token by token, how small the ‘active’ vocabulary really is for any given prompt.
Difficulty: weekend | Stack: Python, Transformers (HuggingFace), Gradio, Plotly
Who this is for
ML engineers and researchers who want intuition for the NanoSpec insight — seeing empirically that >90% of vocabulary logits are noise for any real context, motivating dynamic pruning work.
Build steps
- Load a small causal LM (GPT-2 medium or TinyLlama) via HuggingFace and run a greedy forward pass on a user-supplied prompt.
- At each generation step, capture the full logit vector over the vocabulary before softmax and record which tokens fall in the top-K (sweep K from 50 to 3000 to 30000).
- Compute cumulative probability mass vs. vocabulary size and plot the curve — show the ‘knee’ where 3k tokens capture 99%+ of probability mass.
- Build a Gradio interface with a text input, a token-step slider, and a Plotly bar chart of the top-100 active tokens colored by rank.
- Add a side-by-side latency estimate panel comparing projected compute cost at full vocab (100k), NanoSpec size (3k), and static pruning (30k) using FLOP counts.
Risks
- Full logit tensors for large models are memory-intensive — stick to models ≤7B or you’ll OOM on a consumer GPU.
- Vocabulary ‘knees’ vary significantly by domain (code vs. prose vs. math), so a single demo prompt may not generalize — need diverse examples to tell a compelling story.
- Gradio’s real-time slider updates can be sluggish if you’re re-running inference per step; pre-compute all steps and cache results instead.
Business Angle
A hosted interactive playground that lets ML engineers viscerally see vocabulary sparsity in real prompts — making the NanoSpec/dynamic-pruning case without reading a paper.
Customer: ML infrastructure engineer at a startup or mid-size AI company (5–200 people) who is tasked with cutting inference costs on an LLM deployment and needs to justify pruning/speculative-decoding experiments to a skeptical tech lead.
Pricing: freemium — $800 MRR in 4 months (16 teams × $50/mo Pro tier)
Full business breakdown →