AI Pulse
← Projects · weekend

Negation Ablation Sandbox

An interactive notebook that lets you ablate late-layer attention heads in a transformer and watch negation accuracy change in real time.

Difficulty: weekend | Stack: Python, TransformerLens, Jupyter / Marimo, Plotly, HuggingFace Datasets

Who this is for

Interpretability researchers and curious ML practitioners who want hands-on intuition for how specific attention heads suppress or promote correct negation handling — directly replicating and extending the mechanistic negation findings.

Build steps

  1. Curate a small negation test set (100-200 examples) using existing datasets like MultiNLI filtered for negation words, plus hand-crafted minimal pairs (e.g. ‘Paris is the capital of France’ vs. ‘Paris is not the capital of France’).
  2. Use TransformerLens to run GPT-2-Medium or similar on this set with full hook access; log per-head attention patterns and accuracy baseline.
  3. Implement a head-ablation loop: zero out or mean-ablate attention heads one at a time (or in ranked groups by layer) and re-evaluate negation accuracy after each ablation.
  4. Plot a heatmap (layer × head) of accuracy delta from ablation using Plotly; highlight heads where ablation improves negation accuracy — these are the ‘shortcut promoters’.
  5. Package as a Marimo reactive notebook so users can toggle individual heads on/off via checkboxes and see accuracy update live.

Risks

  • Ablating heads that handle multiple functions simultaneously will produce confounded results — a head that hurts negation may also be critical for syntax, making interpretation ambiguous.
  • Small test sets (100-200 examples) will produce noisy accuracy deltas; differences of less than 5% are likely noise, which can mislead about which heads actually matter.
  • Marimo’s reactive execution model can create infinite loops or stale state when ablation state is stored in mutable globals — requires careful cell dependency design.

Business Angle

A $49 one-time Jupyter/Marimo notebook toolkit for mechanistic interpretability researchers to ablate attention heads and visualize negation accuracy in real time

Customer: PhD students and postdocs in ML interpretability labs (Anthropic, EleutherAI, independent researchers) who have read the ROME/MEMIT/negation papers and want to reproduce or extend findings without spending a week wiring up TransformerLens from scratch

Pricing: one-time — $400 in first 90 days (8–10 sales at $49)

Full business breakdown →