Language-Agnostic SWE Mini-Bench Runner

A local benchmark executor that pulls real GitHub issues across Python, TypeScript, and Go repos, sandboxes each in Docker, runs a code agent, then verifies the patch with the repo’s own test suite.

Difficulty: 1-week | Stack: Python, Docker SDK, GitPython, Claude or GPT-4o function-calling, GitHub API (PyGithub), Rich (CLI output)

Who this is for

Individual developers who want to measure whether their custom coding agent actually generalises across languages before they build on top of it, without paying for a hosted eval service.

Build steps

Curate a seed list of 30–50 GitHub issues (10–15 per language) that have a merged fix PR and a passing CI test suite you can reproduce locally; store metadata in a JSON registry.
Write a Docker-based sandbox manager: for each task, clone the repo at the pre-fix commit, mount a working directory, and expose a tool interface (read_file, write_file, run_tests) the agent calls via function-calling.
Implement the agent loop: the agent receives the issue body, calls tools iteratively, and signals completion; capture the final diff and run the repo’s test command inside the container.
Score each run: PASS (tests green + no regression), PARTIAL (some tests pass), FAIL; log token usage and wall-clock time per task.
Build a CLI dashboard with Rich that shows a live pass-rate table broken down by language and a summary JSON for tracking regressions across agent versions.
Add a --compare flag that runs two agent configs head-to-head on the same task set and prints a side-by-side diff of per-language scores.

Risks

Reproducing CI environments locally is fragile—dependency pinning, OS-level native libraries, and Makefile quirks will break ~20% of tasks; budget time for a manual triage pass on the seed list.
GitHub rate limits will throttle issue and PR fetching during the curation phase; cache all API responses to disk on first fetch.
Docker-in-Docker or volume-mount permission issues on macOS/Windows will cause silent test failures that look like agent failures; test the sandbox layer independently before wiring in the agent.

Language-Agnostic SWE Mini-Bench Runner

Who this is for

Build steps

Risks

Business Angle