Tool-Teaching Benchmark Harness

Empirical testbed measuring how many examples an agent needs to reliably use a novel tool it has never seen

Difficulty: 1-week | Stack: Python, Claude API, OpenAI API, pytest, DuckDB, Plotly

Who this is for

Agent framework developers and ML engineers who need data on in-context tool learning vs fine-tuning tradeoffs before committing to architecture

Define 5 synthetic ‘novel’ tools with full JSON schemas (tools that definitely aren’t in training data — e.g., a fake internal ticketing API, a fictional IoT sensor protocol)
Build a test runner that injects N tool-use examples into system prompt (N = 0, 1, 3, 5, 10, 20) then runs 50 tasks requiring correct tool calls for each N
Score each run: exact schema match, argument type correctness, correct tool selected vs wrong tool selected, hallucinated arguments
Store all results in DuckDB — schema: (model, tool_id, n_examples, task_id, score_breakdown, latency, token_cost)
Generate Plotly dashboard showing learning curves per tool and per model, with cost-per-correct-call overlay

Synthetic tools may accidentally resemble real training data tools — validate by checking model zero-shot performance; if >60% without examples, tool isn’t novel enough
50 tasks × 6 N-values × 2 models = 600 API calls per tool; costs add up fast — build result caching from the start
Scoring ‘correct’ tool use is harder than it looks — partial credit logic needs explicit rubric or results are misleading