Model Edit Reversal Curse Auditor

A testing harness that applies knowledge edits to a model and automatically checks whether the edit propagates to reversed and paraphrased queries.

Difficulty: 1-week | Stack: Python, FastAPI, ROME / MEMIT (model editing libraries), HuggingFace Transformers, SQLite, Streamlit

Who this is for

AI safety researchers and teams using model editing in production who need to know whether an edit is truly internalized or just surface-level — catching the reversal curse before it causes incorrect outputs in deployment.

Build steps

Integrate ROME or MEMIT to apply single-fact edits to GPT-2-XL or LLaMA-3-8B; wrap in a simple Python API that accepts (subject, relation, old_object, new_object) and returns an edited model copy.
For each edit, auto-generate a test suite of 4 query types: forward (Who is X married to?), reversed (Who is married to Y?), paraphrased forward, and multi-hop (What country does X’s spouse come from?).
Run all 4 query types against both the original and edited model; log exact-match and semantic similarity (using a small embedding model) between model output and expected new fact.
Store results in SQLite: edit ID, query type, pass/fail, confidence score, and whether the reversal specifically failed while forward passed — flagging classic reversal curse instances.
Build a Streamlit dashboard showing per-edit propagation scores as a 2×2 grid (forward vs. reverse × direct vs. paraphrase), with a red/yellow/green trust rating per edit.
Add a FastAPI endpoint so the auditor can be called programmatically as part of a CI pipeline before any edited model is promoted to staging.

Risks

ROME and MEMIT have known instability on larger models (>13B parameters) and may corrupt unrelated facts during editing, making it hard to isolate the reversal curse from general edit degradation.
Auto-generating reversed and paraphrased queries with a template approach produces unnatural phrasings that models may fail for grammatical rather than knowledge reasons — confounding results.
Semantic similarity scoring using embeddings is sensitive to threshold choice; a bad threshold means the auditor either misses real failures or fires false alarms, undermining its usefulness as a trust signal.

Model Edit Reversal Curse Auditor

Who this is for

Build steps

Risks

Business Angle