AI Pulse
← Projects · weekend

Context Vocabulary Scope Visualizer

Interactive tool that shows, token by token, how small the ‘active’ vocabulary really is for any given prompt.

Difficulty: weekend | Stack: Python, Transformers (HuggingFace), Gradio, Plotly

Who this is for

ML engineers and researchers who want intuition for the NanoSpec insight — seeing empirically that >90% of vocabulary logits are noise for any real context, motivating dynamic pruning work.

Build steps

  1. Load a small causal LM (GPT-2 medium or TinyLlama) via HuggingFace and run a greedy forward pass on a user-supplied prompt.
  2. At each generation step, capture the full logit vector over the vocabulary before softmax and record which tokens fall in the top-K (sweep K from 50 to 3000 to 30000).
  3. Compute cumulative probability mass vs. vocabulary size and plot the curve — show the ‘knee’ where 3k tokens capture 99%+ of probability mass.
  4. Build a Gradio interface with a text input, a token-step slider, and a Plotly bar chart of the top-100 active tokens colored by rank.
  5. Add a side-by-side latency estimate panel comparing projected compute cost at full vocab (100k), NanoSpec size (3k), and static pruning (30k) using FLOP counts.

Risks

  • Full logit tensors for large models are memory-intensive — stick to models ≤7B or you’ll OOM on a consumer GPU.
  • Vocabulary ‘knees’ vary significantly by domain (code vs. prose vs. math), so a single demo prompt may not generalize — need diverse examples to tell a compelling story.
  • Gradio’s real-time slider updates can be sluggish if you’re re-running inference per step; pre-compute all steps and cache results instead.

Business Angle

A hosted interactive playground that lets ML engineers viscerally see vocabulary sparsity in real prompts — making the NanoSpec/dynamic-pruning case without reading a paper.

Customer: ML infrastructure engineer at a startup or mid-size AI company (5–200 people) who is tasked with cutting inference costs on an LLM deployment and needs to justify pruning/speculative-decoding experiments to a skeptical tech lead.

Pricing: freemium — $800 MRR in 4 months (16 teams × $50/mo Pro tier)

Full business breakdown →