RAG Parser Canary Suite
A test harness that stress-tests document parsers (PDFs, HTML, DOCX) for silent extraction failures and measures downstream retrieval factual accuracy.
Difficulty: 1-week | Stack: Python, LangChain / LlamaIndex, pdfplumber, pytest, SQLite, OpenAI embeddings
Who this is for
Teams running production RAG pipelines who need to catch parser regressions before they silently corrupt retrieval and generate confident wrong answers.
Build steps
- Collect a golden corpus: 20-30 docs across PDF (tables, multi-column, scanned), HTML, and DOCX, with ground-truth extracted text manually verified.
- Run each doc through 3+ parsers (pdfplumber, pymupdf, unstructured, markitdown) and diff extracted text against ground truth; store results in SQLite.
- Build a retrieval accuracy probe: embed chunks, run 50 factual questions with known answers, score with an LLM judge (GPT-4o-mini) — record hit@3 and faithfulness.
- Add a CI-friendly CLI: canary run —parser pdfplumber —threshold 0.85 exits non-zero on regression.
- Generate an HTML report showing per-doc extraction diffs and per-question retrieval failures, highlighting which parser failed which doc type.
Risks
- LLM judge for faithfulness scoring is itself unreliable on edge cases — need human spot-check on 10% of eval set.
- Scanned PDFs require OCR; adding Tesseract/AWS Textract blows scope — scope to text-native docs first.
- Ground-truth creation is the actual bottleneck — 20 docs can take 4-6 hours to manually verify at quality.
Business Angle
Automated regression suite that catches silent RAG parser failures before they corrupt production retrieval
Customer: Solo ML engineer or founding engineer at a 5-20 person startup running a production RAG product (legal tech, HR, fintech) — personally on-call when retrieval hallucinates, no dedicated QA team
Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo)
Full business breakdown →