Evolving-World Memory Probe

A harness that stress-tests an LLM agent’s memory by feeding it facts that contradict earlier ones, then measuring recall at write/maintain/retrieve granularity.

Difficulty: weekend | Stack: Python, LangGraph, SQLite, pytest, OpenAI or Anthropic SDK

Who this is for

AI researchers and agent builders who need to know where their agent’s memory breaks—not just whether the final answer is right—before shipping a long-horizon product.

Build steps

Define 20–30 ‘world scenarios’ as YAML files: each has an initial fact set plus 3–5 timed contradicting updates (e.g., a stock price, a user’s address, a task status).
Build a Python harness that replays scenarios in sequence, injecting each update into the agent’s context and recording the agent’s internal memory store after each step.
Write a scoring module that separately grades memory at three phases: write accuracy (did the agent store the update?), maintenance (did it overwrite the stale value?), and retrieval (did the right value surface in the final answer?).
Run the harness against at least two agent memory strategies—a naive in-context list vs. a SQLite-backed key-value store—and emit a markdown report comparing breakdown by phase.
Expose results as a small CLI: python probe.py --agent naive --scenario stock_price prints a per-step memory diff.

Risks

LLM non-determinism makes repeated runs produce different scores; fix seeds or run N=5 trials and report mean±std to get stable numbers.
Designing contradictions that are genuinely ambiguous (agent should update, not just append) is harder than it looks—poorly written scenarios will make every strategy look equally bad.
Token-window overflow on long scenarios silently truncates context, making in-context memory look worse than it is; add a token-count guard before each injection.

Evolving-World Memory Probe

Who this is for

Build steps

Risks

Business Angle