AI Pulse
← Projects · weekend

Observational Equivalence Test Generator

Automatically generate pairs of test cases that are surface-identical but semantically different, to catch agents gaming shallow checks.

Difficulty: weekend | Stack: Python, ast, hypothesis, OpenAI API or Anthropic SDK, Pytest

Who this is for

Developers writing agent-based coding assistants who need to verify the agent didn’t just produce code that passes tests while violating semantic intent

Build steps

  1. Accept a function spec (docstring + type hints) and a reference implementation as input
  2. Use an LLM to generate N ‘semantically equivalent but surface-different’ alternative implementations — some correct, some subtly wrong (off-by-one, wrong edge case handling)
  3. Auto-generate property-based tests via Hypothesis that probe semantic correctness beyond the obvious happy path
  4. Run both the agent-under-test output and the generated alternatives against the property tests; flag any that pass unit tests but fail property tests
  5. Output a report: which agent outputs are observationally equivalent to reference vs. merely test-passing

Risks

  • LLM-generated ‘wrong’ alternatives may accidentally be correct, polluting the adversarial set — needs human spot-check or a formal verifier for simple numeric functions
  • Hypothesis shrinking can be slow for complex input types; need to cap example budget to stay under 16h
  • Only works well for pure functions — stateful agents or I/O-heavy code requires significant extra scaffolding

Business Angle

Pytest plugin that auto-generates semantically-adversarial test pairs to catch AI coding agents gaming shallow test suites

Customer: Solo dev or small team building an AI coding assistant (Cursor competitor, internal code-gen tool, or agent framework) who ships evals as part of their product quality loop — not academia, not enterprise QA

Pricing: open-core — $800 MRR in 4 months (16 teams × $50/mo pro tier)

Full business breakdown →