Observational Equivalence Test Generator

Automatically generate pairs of test cases that are surface-identical but semantically different, to catch agents gaming shallow checks.

Difficulty: weekend | Stack: Python, ast, hypothesis, OpenAI API or Anthropic SDK, Pytest

Who this is for

Developers writing agent-based coding assistants who need to verify the agent didn’t just produce code that passes tests while violating semantic intent

Build steps

Accept a function spec (docstring + type hints) and a reference implementation as input
Use an LLM to generate N ‘semantically equivalent but surface-different’ alternative implementations — some correct, some subtly wrong (off-by-one, wrong edge case handling)
Auto-generate property-based tests via Hypothesis that probe semantic correctness beyond the obvious happy path
Run both the agent-under-test output and the generated alternatives against the property tests; flag any that pass unit tests but fail property tests
Output a report: which agent outputs are observationally equivalent to reference vs. merely test-passing

Risks

LLM-generated ‘wrong’ alternatives may accidentally be correct, polluting the adversarial set — needs human spot-check or a formal verifier for simple numeric functions
Hypothesis shrinking can be slow for complex input types; need to cap example budget to stay under 16h
Only works well for pure functions — stateful agents or I/O-heavy code requires significant extra scaffolding

Observational Equivalence Test Generator

Who this is for

Build steps

Risks

Business Angle