Steering Vector Leakage Auditor

CLI tool that measures cross-concept contamination when applying activation steering vectors to a local LLM.

Difficulty: weekend | Stack: Python, TransformerLens, nnsight, Llama-3.2-1B or Gemma-2B, matplotlib

Who this is for

Alignment researchers and ML engineers who use activation steering and need empirical data on whether their vectors bleed into adjacent concept dimensions.

Build steps

Load a small open-weight model (Llama-3.2-1B) via TransformerLens; extract residual stream activations for 2-3 concept pairs (e.g., sentiment vs. formality, factual vs. fictional).
Compute steering vectors via mean-difference on contrastive prompt sets; build a concept probe for each target concept using logistic regression on held-out activations.
Apply steering vector for concept A; measure probe confidence shift on concept B activations — this is the leakage score.
Sweep steering vector magnitudes (0.5x–3x); plot leakage score vs. magnitude per concept pair to show cylindrical instability empirically.
Output a JSON report + matplotlib heatmap: rows=source concept, cols=affected concept, cells=leakage delta.

Risks

Small models may not have cleanly separable concept representations, making leakage signal indistinguishable from noise — need at least 500 contrastive pairs per concept.
TransformerLens hook API changes between versions can silently break activation capture.
Probe accuracy on the base concepts may be too low (<70%) to give meaningful leakage deltas — need to validate probes before running leakage sweeps.

Steering Vector Leakage Auditor

Who this is for

Build steps

Risks

Business Angle