RAG Parser Canary Suite

A test harness that stress-tests document parsers (PDFs, HTML, DOCX) for silent extraction failures and measures downstream retrieval factual accuracy.

Difficulty: 1-week | Stack: Python, LangChain / LlamaIndex, pdfplumber, pytest, SQLite, OpenAI embeddings

Who this is for

Teams running production RAG pipelines who need to catch parser regressions before they silently corrupt retrieval and generate confident wrong answers.

Build steps

Collect a golden corpus: 20-30 docs across PDF (tables, multi-column, scanned), HTML, and DOCX, with ground-truth extracted text manually verified.
Run each doc through 3+ parsers (pdfplumber, pymupdf, unstructured, markitdown) and diff extracted text against ground truth; store results in SQLite.
Build a retrieval accuracy probe: embed chunks, run 50 factual questions with known answers, score with an LLM judge (GPT-4o-mini) — record hit@3 and faithfulness.
Add a CI-friendly CLI: canary run —parser pdfplumber —threshold 0.85 exits non-zero on regression.
Generate an HTML report showing per-doc extraction diffs and per-question retrieval failures, highlighting which parser failed which doc type.

Risks

LLM judge for faithfulness scoring is itself unreliable on edge cases — need human spot-check on 10% of eval set.
Scanned PDFs require OCR; adding Tesseract/AWS Textract blows scope — scope to text-native docs first.
Ground-truth creation is the actual bottleneck — 20 docs can take 4-6 hours to manually verify at quality.

RAG Parser Canary Suite

Who this is for

Build steps

Risks

Business Angle