AI Pulse
← Projects · 1-week

Rule Induction Arena

A text-adventure benchmark harness that generates hidden-rule puzzles, runs multiple LLMs through them, and scores rule-induction capability across difficulty tiers.

Difficulty: 1-week | Stack: Python, Pydantic, Anthropic SDK, OpenAI SDK, SQLite + SQLModel, Rich (terminal UI), Pytest

Who this is for

AI researchers and developers who want a cheap, reproducible way to measure how well a model generalizes abstract rules to new situations — motivated directly by the text-game benchmarking paper in the cluster.

Build steps

  1. Design a puzzle schema (Pydantic models): each puzzle has a hidden rule (e.g., ‘odd-numbered doors always lead to traps’), an observation sequence, and a novel test scenario where the rule must be applied.
  2. Write a procedural generator that parameterizes rules across three tiers — surface (color/shape), relational (positional, sequential), and compositional (conjunctions of two rules) — producing 100+ unique puzzles per tier.
  3. Build an agent loop that feeds each puzzle’s observations to an LLM one at a time, asks it to state its current hypothesis, and finally applies the hypothesis to the test scenario; record the full transcript.
  4. Implement a judge (a second deterministic LLM call with a strict rubric) that scores each final answer as correct/partial/wrong and stores results in SQLite.
  5. Add a Rich-powered leaderboard CLI command that renders per-model, per-tier accuracy, average hypothesis quality score, and median token cost.
  6. Write a Pytest suite that validates the generator (no puzzle has an ambiguous rule) and the judge (human-labeled gold set of 20 puzzles all pass).

Risks

  • Procedurally generated puzzles can be accidentally ambiguous — multiple rules fit the observation sequence — making ground-truth scoring unreliable without costly human review.
  • LLM judges are inconsistent on partial-credit cases; without a calibrated rubric tested against human annotators, the leaderboard scores may not reflect real capability differences.
  • Compositional-tier puzzles may be too hard for current models, compressing all scores toward zero and making model differentiation impossible at the top tier.

Business Angle

A self-hosted benchmark harness that pits LLMs against hidden-rule text puzzles and gives AI researchers a reproducible leaderboard they can run locally for pennies.

Customer: Independent ML researcher or senior AI engineer at a 5-50 person AI startup who runs model evaluations weekly, is frustrated that MMLU/HumanEval are saturated and gamed, and wants a cheap internal benchmark they control — not a hosted leaderboard they can't customize.

Pricing: open-core — $800 MRR in 4 months (8 teams × $99/mo for hosted result storage, multi-user dashboards, and private puzzle packs)

Full business breakdown →