AI Pulse
← Projects · 1-week

Self-Improvement Loop Sandbox

A small automated research pipeline where a language model iteratively rewrites its own few-shot prompts and measures whether downstream task performance actually improves run-over-run

Difficulty: 1-week | Stack: Python, OpenAI API (or Anthropic API), LangChain, SQLite, Rich (CLI dashboard)

Who this is for

Developers and AI safety practitioners who want hands-on empirical evidence of how hard (or easy) LLM self-improvement actually is in a controlled, observable setting — directly instantiating the blog post’s core argument

Build steps

  1. Pick a narrow, evaluable task with an automated scorer: e.g., solving grade-school math word problems (GSM8K subset) or generating syntactically valid SQL from natural language
  2. Implement a baseline few-shot prompt and an automated eval loop that scores the model on a fixed 50-question hold-out set, storing results and the prompt version in SQLite
  3. Add a ‘self-improver’ step: after each eval, pass the failing examples plus the current prompt to the model with instructions to propose a revised prompt; record the proposed change
  4. Run 10-20 improvement iterations automatically, tracking score trajectory, prompt diff size, and the frequency of regressions (cases where the new prompt scores worse)
  5. Render a Rich CLI dashboard showing the iteration history, a diff of each prompt mutation, and a running chart of pass-rate — making the ‘slow climb with frequent regressions’ dynamic visible

Risks

  • API costs can escalate quickly with 20 iterations × 50 eval questions × self-improver call — set a hard budget cap and use a cheap model (e.g., gpt-4o-mini) for the eval loop to control spend
  • The model may overfit the prompt to the eval set rather than genuinely improving reasoning — use a separate validation set to catch this and document it as a finding, since it illustrates a real self-improvement failure mode
  • Results are heavily prompt-sensitive and may not generalize; frame the project as an exploration tool, not a definitive experiment, and make the task and eval set swappable