Rule Induction Arena

A text-adventure benchmark harness that generates hidden-rule puzzles, runs multiple LLMs through them, and scores rule-induction capability across difficulty tiers.

Difficulty: 1-week | Stack: Python, Pydantic, Anthropic SDK, OpenAI SDK, SQLite + SQLModel, Rich (terminal UI), Pytest

Who this is for

AI researchers and developers who want a cheap, reproducible way to measure how well a model generalizes abstract rules to new situations — motivated directly by the text-game benchmarking paper in the cluster.

Build steps

Design a puzzle schema (Pydantic models): each puzzle has a hidden rule (e.g., ‘odd-numbered doors always lead to traps’), an observation sequence, and a novel test scenario where the rule must be applied.
Write a procedural generator that parameterizes rules across three tiers — surface (color/shape), relational (positional, sequential), and compositional (conjunctions of two rules) — producing 100+ unique puzzles per tier.
Build an agent loop that feeds each puzzle’s observations to an LLM one at a time, asks it to state its current hypothesis, and finally applies the hypothesis to the test scenario; record the full transcript.
Implement a judge (a second deterministic LLM call with a strict rubric) that scores each final answer as correct/partial/wrong and stores results in SQLite.
Add a Rich-powered leaderboard CLI command that renders per-model, per-tier accuracy, average hypothesis quality score, and median token cost.
Write a Pytest suite that validates the generator (no puzzle has an ambiguous rule) and the judge (human-labeled gold set of 20 puzzles all pass).

Risks

Procedurally generated puzzles can be accidentally ambiguous — multiple rules fit the observation sequence — making ground-truth scoring unreliable without costly human review.
LLM judges are inconsistent on partial-credit cases; without a calibrated rubric tested against human annotators, the leaderboard scores may not reflect real capability differences.
Compositional-tier puzzles may be too hard for current models, compressing all scores toward zero and making model differentiation impossible at the top tier.

Rule Induction Arena

Who this is for

Build steps

Risks

Business Angle