Negation Ablation Sandbox
An interactive notebook that lets you ablate late-layer attention heads in a transformer and watch negation accuracy change in real time.
Difficulty: weekend | Stack: Python, TransformerLens, Jupyter / Marimo, Plotly, HuggingFace Datasets
Who this is for
Interpretability researchers and curious ML practitioners who want hands-on intuition for how specific attention heads suppress or promote correct negation handling — directly replicating and extending the mechanistic negation findings.
Build steps
- Curate a small negation test set (100-200 examples) using existing datasets like MultiNLI filtered for negation words, plus hand-crafted minimal pairs (e.g. ‘Paris is the capital of France’ vs. ‘Paris is not the capital of France’).
- Use TransformerLens to run GPT-2-Medium or similar on this set with full hook access; log per-head attention patterns and accuracy baseline.
- Implement a head-ablation loop: zero out or mean-ablate attention heads one at a time (or in ranked groups by layer) and re-evaluate negation accuracy after each ablation.
- Plot a heatmap (layer × head) of accuracy delta from ablation using Plotly; highlight heads where ablation improves negation accuracy — these are the ‘shortcut promoters’.
- Package as a Marimo reactive notebook so users can toggle individual heads on/off via checkboxes and see accuracy update live.
Risks
- Ablating heads that handle multiple functions simultaneously will produce confounded results — a head that hurts negation may also be critical for syntax, making interpretation ambiguous.
- Small test sets (100-200 examples) will produce noisy accuracy deltas; differences of less than 5% are likely noise, which can mislead about which heads actually matter.
- Marimo’s reactive execution model can create infinite loops or stale state when ablation state is stored in mutable globals — requires careful cell dependency design.
Business Angle
A $49 one-time Jupyter/Marimo notebook toolkit for mechanistic interpretability researchers to ablate attention heads and visualize negation accuracy in real time
Customer: PhD students and postdocs in ML interpretability labs (Anthropic, EleutherAI, independent researchers) who have read the ROME/MEMIT/negation papers and want to reproduce or extend findings without spending a week wiring up TransformerLens from scratch
Pricing: one-time — $400 in first 90 days (8–10 sales at $49)
Full business breakdown →