Model Edit Reversal Curse Auditor
A testing harness that applies knowledge edits to a model and automatically checks whether the edit propagates to reversed and paraphrased queries.
Difficulty: 1-week | Stack: Python, FastAPI, ROME / MEMIT (model editing libraries), HuggingFace Transformers, SQLite, Streamlit
Who this is for
AI safety researchers and teams using model editing in production who need to know whether an edit is truly internalized or just surface-level — catching the reversal curse before it causes incorrect outputs in deployment.
Build steps
- Integrate ROME or MEMIT to apply single-fact edits to GPT-2-XL or LLaMA-3-8B; wrap in a simple Python API that accepts (subject, relation, old_object, new_object) and returns an edited model copy.
- For each edit, auto-generate a test suite of 4 query types: forward (Who is X married to?), reversed (Who is married to Y?), paraphrased forward, and multi-hop (What country does X’s spouse come from?).
- Run all 4 query types against both the original and edited model; log exact-match and semantic similarity (using a small embedding model) between model output and expected new fact.
- Store results in SQLite: edit ID, query type, pass/fail, confidence score, and whether the reversal specifically failed while forward passed — flagging classic reversal curse instances.
- Build a Streamlit dashboard showing per-edit propagation scores as a 2×2 grid (forward vs. reverse × direct vs. paraphrase), with a red/yellow/green trust rating per edit.
- Add a FastAPI endpoint so the auditor can be called programmatically as part of a CI pipeline before any edited model is promoted to staging.
Risks
- ROME and MEMIT have known instability on larger models (>13B parameters) and may corrupt unrelated facts during editing, making it hard to isolate the reversal curse from general edit degradation.
- Auto-generating reversed and paraphrased queries with a template approach produces unnatural phrasings that models may fail for grammatical rather than knowledge reasons — confounding results.
- Semantic similarity scoring using embeddings is sensitive to threshold choice; a bad threshold means the auditor either misses real failures or fires false alarms, undermining its usefulness as a trust signal.
Business Angle
A hosted audit service that stress-tests model edits for reversal-curse failures before they ship to production.
Customer: ML engineer at a Series A–C AI startup who owns a RAG or fine-tuning pipeline and has recently started using ROME/MEMIT to patch factual errors in a deployed model without full retraining — typically solo or in a 2-person ML team, no dedicated safety hire.
Pricing: saas-mrr — $800 MRR in 4 months (8 paying teams at $100/mo)
Full business breakdown →