AI Pulse
← Projects · 1-week

Sparse vs. Dense Attention Diff Visualizer

An interactive web app that loads a small transformer, lets you toggle between full and DeepSeek-style sparse attention masks, and shows in real time which token pairs are dropped and how outputs shift.

Difficulty: 1-week | Stack: Python, PyTorch, Transformers (HuggingFace), Gradio, matplotlib, einops

Who this is for

ML students and curious engineers who read about MLA and GLM-5.1’s sparse attention adoption but want an intuition for what ‘sparse’ actually means at the attention-weight level—turns an abstract architectural claim into a tangible experiment.

Build steps

  1. Load a small open-weight model (GPT-2 or Qwen-0.5B) and hook into its attention layers with a forward hook that captures raw attention weight matrices before softmax.
  2. Implement a configurable sparse mask generator: sliding window, top-k per query, and block-sparse patterns matching DeepSeek’s published design—apply the mask post-QK dot-product.
  3. Build a Gradio interface with a text input, a mask-type dropdown, and a sparsity slider (% of attention edges dropped); show the full vs. masked attention heatmap side by side using matplotlib.
  4. Run the forward pass under both regimes and display the decoded output and per-layer perplexity delta so users can directly observe quality degradation at various sparsity levels.
  5. Add an export button that saves an animated GIF stepping through layers, useful for blog posts or conference talks explaining architectural trade-offs.

Risks

  • Applying post-hoc sparse masks to a model trained with full attention is not architecturally equivalent to training with sparse attention—results show degradation curves, not real MLA behavior, so you must communicate this limitation clearly in the UI.
  • Attention weight hooking differs significantly across model families (some fuse QKV projections, some use flash attention that never materializes the weight matrix); you may need to patch model code rather than use a generic hook.
  • Rendering heatmaps for long sequences (>512 tokens) becomes slow and unreadable; you will need to cap input length or aggregate attention heads early.