Multi-Hop RAG with Evolving Evidence Tracker

A multi-hop question-answering tool that maintains a running ‘evidence ledger’ across retrieval iterations to avoid contradicting or re-fetching already-established facts.

Difficulty: 1-week | Stack: Python, LlamaIndex, OpenAI API (GPT-4o), Pydantic, SQLite, Streamlit

Who this is for

Researchers and analysts in legal, scientific, or financial domains who need to answer complex multi-part questions over large document corpora.

Build steps

Define a Pydantic EvidenceLedger model that stores extracted facts (claim, source chunk ID, confidence) accumulated across retrieval hops.
Build a retrieval loop: at each hop, embed the residual question (original question minus already-answered sub-questions) and retrieve new chunks.
After each retrieval, run an LLM extraction step that reads the ledger and new chunks, adds confirmed facts, flags contradictions, and identifies remaining open sub-questions.
Persist the ledger in SQLite so multi-hop chains are inspectable and resumable; surface the chain-of-evidence in the UI.
Build a Streamlit UI showing the question, the evolving ledger per hop, and the final synthesized answer with provenance links back to source chunks.
Evaluate on a small benchmark (e.g., 2WikiMultiHopQA subset) and compare answer accuracy against a flat single-hop RAG baseline.

Risks

LLM extraction quality degrades on long ledgers — context window pressure causes the model to drop or hallucinate earlier facts when the chain grows beyond 5-6 hops.
Contradiction detection is unreliable: the model may miss genuine conflicts between retrieved passages, especially with numeric or date-heavy claims.
Latency compounds with each hop; a 6-hop chain over GPT-4o can cost 30+ seconds and significant API spend, making real-time use impractical without caching.

Multi-Hop RAG with Evolving Evidence Tracker

Who this is for

Build steps

Risks

Business Angle