Steering Vector Leakage Auditor
CLI tool that measures cross-concept contamination when applying activation steering vectors to a local LLM.
Difficulty: weekend | Stack: Python, TransformerLens, nnsight, Llama-3.2-1B or Gemma-2B, matplotlib
Who this is for
Alignment researchers and ML engineers who use activation steering and need empirical data on whether their vectors bleed into adjacent concept dimensions.
Build steps
- Load a small open-weight model (Llama-3.2-1B) via TransformerLens; extract residual stream activations for 2-3 concept pairs (e.g., sentiment vs. formality, factual vs. fictional).
- Compute steering vectors via mean-difference on contrastive prompt sets; build a concept probe for each target concept using logistic regression on held-out activations.
- Apply steering vector for concept A; measure probe confidence shift on concept B activations — this is the leakage score.
- Sweep steering vector magnitudes (0.5x–3x); plot leakage score vs. magnitude per concept pair to show cylindrical instability empirically.
- Output a JSON report + matplotlib heatmap: rows=source concept, cols=affected concept, cells=leakage delta.
Risks
- Small models may not have cleanly separable concept representations, making leakage signal indistinguishable from noise — need at least 500 contrastive pairs per concept.
- TransformerLens hook API changes between versions can silently break activation capture.
- Probe accuracy on the base concepts may be too low (<70%) to give meaningful leakage deltas — need to validate probes before running leakage sweeps.
Business Angle
SaaS CLI + hosted dashboard that audits activation steering vectors for cross-concept contamination, giving alignment researchers shareable leakage reports.
Customer: Independent alignment researcher or ML safety engineer at small lab (1-5 people) who runs steering experiments on local LLMs weekly, publishes findings, and needs reproducible evidence that their vectors aren't polluting adjacent concept dimensions — not a big-lab employee with infra team.
Pricing: freemium — $800 MRR in 4 months (16 paying users at $50/mo for cloud report storage + shareable audit URLs; CLI stays free/OSS)
Full business breakdown →