Evolving-World Memory Probe
A harness that stress-tests an LLM agent’s memory by feeding it facts that contradict earlier ones, then measuring recall at write/maintain/retrieve granularity.
Difficulty: weekend | Stack: Python, LangGraph, SQLite, pytest, OpenAI or Anthropic SDK
Who this is for
AI researchers and agent builders who need to know where their agent’s memory breaks—not just whether the final answer is right—before shipping a long-horizon product.
Build steps
- Define 20–30 ‘world scenarios’ as YAML files: each has an initial fact set plus 3–5 timed contradicting updates (e.g., a stock price, a user’s address, a task status).
- Build a Python harness that replays scenarios in sequence, injecting each update into the agent’s context and recording the agent’s internal memory store after each step.
- Write a scoring module that separately grades memory at three phases: write accuracy (did the agent store the update?), maintenance (did it overwrite the stale value?), and retrieval (did the right value surface in the final answer?).
- Run the harness against at least two agent memory strategies—a naive in-context list vs. a SQLite-backed key-value store—and emit a markdown report comparing breakdown by phase.
- Expose results as a small CLI:
python probe.py --agent naive --scenario stock_priceprints a per-step memory diff.
Risks
- LLM non-determinism makes repeated runs produce different scores; fix seeds or run N=5 trials and report mean±std to get stable numbers.
- Designing contradictions that are genuinely ambiguous (agent should update, not just append) is harder than it looks—poorly written scenarios will make every strategy look equally bad.
- Token-window overflow on long scenarios silently truncates context, making in-context memory look worse than it is; add a token-count guard before each injection.
Business Angle
A plug-and-play memory stress-test harness that shows agent builders exactly where and why their LLM agent forgets, contradicts, or hallucinates across long sessions—before they ship.
Customer: Solo AI engineers or 2-person teams building LLM-powered products (coding assistants, research copilots, customer-support agents) who are past the demo stage and about to ship to real users, but have no systematic way to validate memory behavior across multi-turn or multi-day sessions.
Pricing: one-time — $800 in month 1 from 8 licenses at $99; grow to $2,500/mo by month 4 as word spreads in AI builder communities
Full business breakdown →