AI Pulse
← Projects · weekend

Steering Vector Leakage Auditor

CLI tool that measures cross-concept contamination when applying activation steering vectors to a local LLM.

Difficulty: weekend | Stack: Python, TransformerLens, nnsight, Llama-3.2-1B or Gemma-2B, matplotlib

Who this is for

Alignment researchers and ML engineers who use activation steering and need empirical data on whether their vectors bleed into adjacent concept dimensions.

Build steps

  1. Load a small open-weight model (Llama-3.2-1B) via TransformerLens; extract residual stream activations for 2-3 concept pairs (e.g., sentiment vs. formality, factual vs. fictional).
  2. Compute steering vectors via mean-difference on contrastive prompt sets; build a concept probe for each target concept using logistic regression on held-out activations.
  3. Apply steering vector for concept A; measure probe confidence shift on concept B activations — this is the leakage score.
  4. Sweep steering vector magnitudes (0.5x–3x); plot leakage score vs. magnitude per concept pair to show cylindrical instability empirically.
  5. Output a JSON report + matplotlib heatmap: rows=source concept, cols=affected concept, cells=leakage delta.

Risks

  • Small models may not have cleanly separable concept representations, making leakage signal indistinguishable from noise — need at least 500 contrastive pairs per concept.
  • TransformerLens hook API changes between versions can silently break activation capture.
  • Probe accuracy on the base concepts may be too low (<70%) to give meaningful leakage deltas — need to validate probes before running leakage sweeps.

Business Angle

SaaS CLI + hosted dashboard that audits activation steering vectors for cross-concept contamination, giving alignment researchers shareable leakage reports.

Customer: Independent alignment researcher or ML safety engineer at small lab (1-5 people) who runs steering experiments on local LLMs weekly, publishes findings, and needs reproducible evidence that their vectors aren't polluting adjacent concept dimensions — not a big-lab employee with infra team.

Pricing: freemium — $800 MRR in 4 months (16 paying users at $50/mo for cloud report storage + shareable audit URLs; CLI stays free/OSS)

Full business breakdown →