Tool-Teaching Benchmark Harness
Empirical testbed measuring how many examples an agent needs to reliably use a novel tool it has never seen
Difficulty: 1-week | Stack: Python, Claude API, OpenAI API, pytest, DuckDB, Plotly
Who this is for
Agent framework developers and ML engineers who need data on in-context tool learning vs fine-tuning tradeoffs before committing to architecture
Build steps
- Define 5 synthetic ‘novel’ tools with full JSON schemas (tools that definitely aren’t in training data — e.g., a fake internal ticketing API, a fictional IoT sensor protocol)
- Build a test runner that injects N tool-use examples into system prompt (N = 0, 1, 3, 5, 10, 20) then runs 50 tasks requiring correct tool calls for each N
- Score each run: exact schema match, argument type correctness, correct tool selected vs wrong tool selected, hallucinated arguments
- Store all results in DuckDB — schema: (model, tool_id, n_examples, task_id, score_breakdown, latency, token_cost)
- Generate Plotly dashboard showing learning curves per tool and per model, with cost-per-correct-call overlay
Risks
- Synthetic tools may accidentally resemble real training data tools — validate by checking model zero-shot performance; if >60% without examples, tool isn’t novel enough
- 50 tasks × 6 N-values × 2 models = 600 API calls per tool; costs add up fast — build result caching from the start
- Scoring ‘correct’ tool use is harder than it looks — partial credit logic needs explicit rubric or results are misleading
Business Angle
Hosted benchmark SaaS that measures how many in-context examples an LLM agent needs to reliably invoke a new tool — so teams skip the guesswork before picking fine-tune vs. few-shot architecture
Customer: ML engineer at a 2-10 person AI startup who is building a product-facing agent and needs to decide whether to fine-tune a model on proprietary tool schemas or rely on few-shot prompting — they have a GitHub account, read agent papers on weekends, and are blocked by lack of empirical data
Pricing: saas-mrr — $800 MRR in 4 months (8 teams × $99/mo)
Full business breakdown →