RL Environment Spec Generator
A web app that takes a natural-language task description and generates a complete reinforcement learning environment specification — reward function, observation space, termination conditions, and verification harness.
Difficulty: 1-week | Stack: Next.js, TypeScript, Vercel AI SDK, shadcn/ui, Zod, Python (generated output), Claude API
Who this is for
ML engineers and researchers who want to bootstrap a new RL environment without writing boilerplate from scratch — the generated spec is immediately runnable with Gymnasium.
Build steps
- Design a Zod schema for the RL environment spec: observation space definition, action space, reward function signature, termination conditions, and a verification test suite skeleton.
- Build a Next.js form where users describe their task in plain English, optionally annotate verifiability properties (objective truth, noise level), and submit.
- Stream the structured spec generation through the Vercel AI SDK using Claude with structured output mode, rendering each section as it arrives in an editable code block.
- Add a ‘verifiability audit’ panel that scores the generated reward function against Verifier’s Law criteria and flags reward functions that are likely too noisy or too sparse.
- Generate a downloadable Python file with a complete Gymnasium-compatible
Envclass stub, populated with the spec’s values and TODO comments for user logic. - Include three built-in example tasks (code formatting, receipt parsing, math proof checking) so users can see the tool’s output before committing their own description.
Risks
- Generated reward functions will often be syntactically correct but semantically wrong for the user’s domain — frame the output explicitly as a ‘spec draft’ requiring human review, not a final artifact.
- Gymnasium’s API surface changes between versions; pin to a specific version in the generated output and note it prominently.
- Users may describe tasks with fundamentally low verifiability (e.g., ‘write a good poem’) — the verifiability audit panel must catch these and surface a clear warning rather than generating a junk spec.