Negation Ablation Sandbox

An interactive notebook that lets you ablate late-layer attention heads in a transformer and watch negation accuracy change in real time.

Difficulty: weekend | Stack: Python, TransformerLens, Jupyter / Marimo, Plotly, HuggingFace Datasets

Who this is for

Interpretability researchers and curious ML practitioners who want hands-on intuition for how specific attention heads suppress or promote correct negation handling — directly replicating and extending the mechanistic negation findings.

Build steps

Curate a small negation test set (100-200 examples) using existing datasets like MultiNLI filtered for negation words, plus hand-crafted minimal pairs (e.g. ‘Paris is the capital of France’ vs. ‘Paris is not the capital of France’).
Use TransformerLens to run GPT-2-Medium or similar on this set with full hook access; log per-head attention patterns and accuracy baseline.
Implement a head-ablation loop: zero out or mean-ablate attention heads one at a time (or in ranked groups by layer) and re-evaluate negation accuracy after each ablation.
Plot a heatmap (layer × head) of accuracy delta from ablation using Plotly; highlight heads where ablation improves negation accuracy — these are the ‘shortcut promoters’.
Package as a Marimo reactive notebook so users can toggle individual heads on/off via checkboxes and see accuracy update live.

Risks

Ablating heads that handle multiple functions simultaneously will produce confounded results — a head that hurts negation may also be critical for syntax, making interpretation ambiguous.
Small test sets (100-200 examples) will produce noisy accuracy deltas; differences of less than 5% are likely noise, which can mislead about which heads actually matter.
Marimo’s reactive execution model can create infinite loops or stale state when ablation state is stored in mutable globals — requires careful cell dependency design.

Negation Ablation Sandbox

Who this is for

Build steps

Risks

Business Angle