Language-Agnostic SWE Mini-Bench Runner
A local benchmark executor that pulls real GitHub issues across Python, TypeScript, and Go repos, sandboxes each in Docker, runs a code agent, then verifies the patch with the repo’s own test suite.
Difficulty: 1-week | Stack: Python, Docker SDK, GitPython, Claude or GPT-4o function-calling, GitHub API (PyGithub), Rich (CLI output)
Who this is for
Individual developers who want to measure whether their custom coding agent actually generalises across languages before they build on top of it, without paying for a hosted eval service.
Build steps
- Curate a seed list of 30–50 GitHub issues (10–15 per language) that have a merged fix PR and a passing CI test suite you can reproduce locally; store metadata in a JSON registry.
- Write a Docker-based sandbox manager: for each task, clone the repo at the pre-fix commit, mount a working directory, and expose a tool interface (read_file, write_file, run_tests) the agent calls via function-calling.
- Implement the agent loop: the agent receives the issue body, calls tools iteratively, and signals completion; capture the final diff and run the repo’s test command inside the container.
- Score each run: PASS (tests green + no regression), PARTIAL (some tests pass), FAIL; log token usage and wall-clock time per task.
- Build a CLI dashboard with Rich that shows a live pass-rate table broken down by language and a summary JSON for tracking regressions across agent versions.
- Add a
--compareflag that runs two agent configs head-to-head on the same task set and prints a side-by-side diff of per-language scores.
Risks
- Reproducing CI environments locally is fragile—dependency pinning, OS-level native libraries, and Makefile quirks will break ~20% of tasks; budget time for a manual triage pass on the seed list.
- GitHub rate limits will throttle issue and PR fetching during the curation phase; cache all API responses to disk on first fetch.
- Docker-in-Docker or volume-mount permission issues on macOS/Windows will cause silent test failures that look like agent failures; test the sandbox layer independently before wiring in the agent.
Business Angle
Self-hosted benchmark runner that proves your coding agent works across Python, TypeScript, and Go before you ship it to customers.
Customer: Solo AI developer or two-person founding team who has built a custom coding agent (wrapper around Claude/GPT-4o) and is about to pitch it to their first 10 enterprise or dev-tool customers — they need a credible eval story but can't afford $5k/month for hosted eval platforms.
Pricing: one-time — $1,200 in first 60 days (12 licenses at $99 one-time), then reassess whether a $19/mo 'new issues feed' add-on has legs
Full business breakdown →