Rule Induction Arena
A text-adventure benchmark harness that generates hidden-rule puzzles, runs multiple LLMs through them, and scores rule-induction capability across difficulty tiers.
Difficulty: 1-week | Stack: Python, Pydantic, Anthropic SDK, OpenAI SDK, SQLite + SQLModel, Rich (terminal UI), Pytest
Who this is for
AI researchers and developers who want a cheap, reproducible way to measure how well a model generalizes abstract rules to new situations — motivated directly by the text-game benchmarking paper in the cluster.
Build steps
- Design a puzzle schema (Pydantic models): each puzzle has a hidden rule (e.g., ‘odd-numbered doors always lead to traps’), an observation sequence, and a novel test scenario where the rule must be applied.
- Write a procedural generator that parameterizes rules across three tiers — surface (color/shape), relational (positional, sequential), and compositional (conjunctions of two rules) — producing 100+ unique puzzles per tier.
- Build an agent loop that feeds each puzzle’s observations to an LLM one at a time, asks it to state its current hypothesis, and finally applies the hypothesis to the test scenario; record the full transcript.
- Implement a judge (a second deterministic LLM call with a strict rubric) that scores each final answer as correct/partial/wrong and stores results in SQLite.
- Add a Rich-powered leaderboard CLI command that renders per-model, per-tier accuracy, average hypothesis quality score, and median token cost.
- Write a Pytest suite that validates the generator (no puzzle has an ambiguous rule) and the judge (human-labeled gold set of 20 puzzles all pass).
Risks
- Procedurally generated puzzles can be accidentally ambiguous — multiple rules fit the observation sequence — making ground-truth scoring unreliable without costly human review.
- LLM judges are inconsistent on partial-credit cases; without a calibrated rubric tested against human annotators, the leaderboard scores may not reflect real capability differences.
- Compositional-tier puzzles may be too hard for current models, compressing all scores toward zero and making model differentiation impossible at the top tier.
Business Angle
A self-hosted benchmark harness that pits LLMs against hidden-rule text puzzles and gives AI researchers a reproducible leaderboard they can run locally for pennies.
Customer: Independent ML researcher or senior AI engineer at a 5-50 person AI startup who runs model evaluations weekly, is frustrated that MMLU/HumanEval are saturated and gamed, and wants a cheap internal benchmark they control — not a hosted leaderboard they can't customize.
Pricing: open-core — $800 MRR in 4 months (8 teams × $99/mo for hosted result storage, multi-user dashboards, and private puzzle packs)
Full business breakdown →