Multimodal RAG Evaluator

An evaluation harness that checks whether a RAG pipeline correctly grounds answers in retrieved audio/video content, not just text chunks.

Difficulty: 1-week | Stack: Python, OpenAI Whisper (transcription), LlamaIndex, FastAPI, SQLite (result store), Pytest

Who this is for

Developers building RAG products over podcasts, YouTube transcripts, or instructional videos who currently have no way to measure multimodal retrieval quality separately from generation quality.

Build steps

Build an ingestion pipeline: take a set of YouTube/podcast URLs, run Whisper to transcribe, timestamp-chunk the transcripts, and store chunks with their source timestamps in a LlamaIndex vector store.
Author a small golden eval set (30–50 QA pairs) where each answer is provably grounded in a specific audio segment—include the expected source timestamp range as the ground-truth citation.
Implement two retrieval scorers: (a) citation precision—did the retrieved chunk contain the right timestamp window? (b) answer faithfulness—does the generated answer contradict the audio source?
Add a ‘text-only baseline’ mode that strips timestamps and treats transcripts as plain text, then compare scores to reveal the modality gap the new evaluation exposes.
Expose results via a minimal FastAPI endpoint returning JSON so the harness can slot into a CI pipeline, failing the build if faithfulness drops below a configurable threshold.
Write a pytest fixture that runs the full eval on a tiny 5-question smoke set in under 60 seconds so it’s usable as a pre-commit check.

Risks

Whisper transcription errors compound into retrieval errors—poor audio quality can make a correct retrieval look wrong; log transcription confidence scores alongside eval results to separate the two failure modes.
Golden eval sets go stale when source videos are edited or deleted; store a local audio snapshot and SHA-256 hash of each source file at ingestion time to detect drift.
Faithfulness scoring with an LLM judge is itself biased by the judge model’s knowledge cutoff—the judge may ‘know’ the answer independently of the retrieved chunk; use adversarial distractors where the chunk contains deliberate factual noise to validate the judge.

Multimodal RAG Evaluator

Who this is for

Build steps

Risks

Business Angle