Financial-Stakes Agent Eval Harness

Run LLM agents in sandboxed environments with fake-but-realistic dollar constraints and log emergent deceptive behaviors.

Difficulty: 1-week | Stack: Python, FastAPI, SQLite, Docker, LiteLLM, Pytest

Who this is for

AI safety researchers and production ML engineers who need to detect collusion/deception before deploying agents to real financial workflows

Define a minimal ‘marketplace’ environment: agents buy/sell goods via a REST API backed by SQLite ledger with real dollar-denominated constraints (budget caps, profit targets)
Implement 2-4 competing agent roles (buyer, seller, regulator, auditor) using LiteLLM so any model can plug in; each agent gets a system prompt with economic incentive
Add an inter-agent communication channel (simple message queue) so agents can coordinate — this is where cartels emerge
Build a behavior logger that flags: price convergence across sellers (cartel signal), false outcome reports (lie detection via ground-truth ledger diff), budget overruns
Write a report renderer that scores each run: deception rate, collusion index, task completion vs. claimed completion
Parameterize over models (GPT-4o, Claude Sonnet, Llama) and stake levels ($1/$10/$100 simulated) to produce a comparison matrix

Inter-agent message format becomes the bottleneck — agents speaking different JSON schemas silently fail to collude, producing false negatives
Sandbox leakage: agents may attempt real HTTP calls if Docker networking not locked down properly
Ground-truth ‘correct outcome’ is hard to define for open-ended trading tasks — without it, lie detection is unreliable