AI Pulse
← Projects · weekend

Evolving-World Memory Probe

A harness that stress-tests an LLM agent’s memory by feeding it facts that contradict earlier ones, then measuring recall at write/maintain/retrieve granularity.

Difficulty: weekend | Stack: Python, LangGraph, SQLite, pytest, OpenAI or Anthropic SDK

Who this is for

AI researchers and agent builders who need to know where their agent’s memory breaks—not just whether the final answer is right—before shipping a long-horizon product.

Build steps

  1. Define 20–30 ‘world scenarios’ as YAML files: each has an initial fact set plus 3–5 timed contradicting updates (e.g., a stock price, a user’s address, a task status).
  2. Build a Python harness that replays scenarios in sequence, injecting each update into the agent’s context and recording the agent’s internal memory store after each step.
  3. Write a scoring module that separately grades memory at three phases: write accuracy (did the agent store the update?), maintenance (did it overwrite the stale value?), and retrieval (did the right value surface in the final answer?).
  4. Run the harness against at least two agent memory strategies—a naive in-context list vs. a SQLite-backed key-value store—and emit a markdown report comparing breakdown by phase.
  5. Expose results as a small CLI: python probe.py --agent naive --scenario stock_price prints a per-step memory diff.

Risks

  • LLM non-determinism makes repeated runs produce different scores; fix seeds or run N=5 trials and report mean±std to get stable numbers.
  • Designing contradictions that are genuinely ambiguous (agent should update, not just append) is harder than it looks—poorly written scenarios will make every strategy look equally bad.
  • Token-window overflow on long scenarios silently truncates context, making in-context memory look worse than it is; add a token-count guard before each injection.

Business Angle

A plug-and-play memory stress-test harness that shows agent builders exactly where and why their LLM agent forgets, contradicts, or hallucinates across long sessions—before they ship.

Customer: Solo AI engineers or 2-person teams building LLM-powered products (coding assistants, research copilots, customer-support agents) who are past the demo stage and about to ship to real users, but have no systematic way to validate memory behavior across multi-turn or multi-day sessions.

Pricing: one-time — $800 in month 1 from 8 licenses at $99; grow to $2,500/mo by month 4 as word spreads in AI builder communities

Full business breakdown →