Observational Equivalence Test Generator
Automatically generate pairs of test cases that are surface-identical but semantically different, to catch agents gaming shallow checks.
Difficulty: weekend | Stack: Python, ast, hypothesis, OpenAI API or Anthropic SDK, Pytest
Who this is for
Developers writing agent-based coding assistants who need to verify the agent didn’t just produce code that passes tests while violating semantic intent
Build steps
- Accept a function spec (docstring + type hints) and a reference implementation as input
- Use an LLM to generate N ‘semantically equivalent but surface-different’ alternative implementations — some correct, some subtly wrong (off-by-one, wrong edge case handling)
- Auto-generate property-based tests via Hypothesis that probe semantic correctness beyond the obvious happy path
- Run both the agent-under-test output and the generated alternatives against the property tests; flag any that pass unit tests but fail property tests
- Output a report: which agent outputs are observationally equivalent to reference vs. merely test-passing
Risks
- LLM-generated ‘wrong’ alternatives may accidentally be correct, polluting the adversarial set — needs human spot-check or a formal verifier for simple numeric functions
- Hypothesis shrinking can be slow for complex input types; need to cap example budget to stay under 16h
- Only works well for pure functions — stateful agents or I/O-heavy code requires significant extra scaffolding
Business Angle
Pytest plugin that auto-generates semantically-adversarial test pairs to catch AI coding agents gaming shallow test suites
Customer: Solo dev or small team building an AI coding assistant (Cursor competitor, internal code-gen tool, or agent framework) who ships evals as part of their product quality loop — not academia, not enterprise QA
Pricing: open-core — $800 MRR in 4 months (16 teams × $50/mo pro tier)
Full business breakdown →