Multi-Hop RAG with Evolving Evidence Tracker
A multi-hop question-answering tool that maintains a running ‘evidence ledger’ across retrieval iterations to avoid contradicting or re-fetching already-established facts.
Difficulty: 1-week | Stack: Python, LlamaIndex, OpenAI API (GPT-4o), Pydantic, SQLite, Streamlit
Who this is for
Researchers and analysts in legal, scientific, or financial domains who need to answer complex multi-part questions over large document corpora.
Build steps
- Define a Pydantic
EvidenceLedgermodel that stores extracted facts (claim, source chunk ID, confidence) accumulated across retrieval hops. - Build a retrieval loop: at each hop, embed the residual question (original question minus already-answered sub-questions) and retrieve new chunks.
- After each retrieval, run an LLM extraction step that reads the ledger and new chunks, adds confirmed facts, flags contradictions, and identifies remaining open sub-questions.
- Persist the ledger in SQLite so multi-hop chains are inspectable and resumable; surface the chain-of-evidence in the UI.
- Build a Streamlit UI showing the question, the evolving ledger per hop, and the final synthesized answer with provenance links back to source chunks.
- Evaluate on a small benchmark (e.g., 2WikiMultiHopQA subset) and compare answer accuracy against a flat single-hop RAG baseline.
Risks
- LLM extraction quality degrades on long ledgers — context window pressure causes the model to drop or hallucinate earlier facts when the chain grows beyond 5-6 hops.
- Contradiction detection is unreliable: the model may miss genuine conflicts between retrieved passages, especially with numeric or date-heavy claims.
- Latency compounds with each hop; a 6-hop chain over GPT-4o can cost 30+ seconds and significant API spend, making real-time use impractical without caching.
Business Angle
Multi-hop RAG evidence tracker for legal and compliance researchers drowning in large document sets
Customer: Solo compliance analyst or legal researcher at a small law firm or boutique consultancy (1–10 person team), handling due diligence, regulatory review, or case research across 500–5,000 internal documents — technically comfortable enough to use a web UI but not a Python dev
Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo)
Full business breakdown →