AI Pulse
← Projects · 1-week

RAG Parser Canary Suite

A test harness that stress-tests document parsers (PDFs, HTML, DOCX) for silent extraction failures and measures downstream retrieval factual accuracy.

Difficulty: 1-week | Stack: Python, LangChain / LlamaIndex, pdfplumber, pytest, SQLite, OpenAI embeddings

Who this is for

Teams running production RAG pipelines who need to catch parser regressions before they silently corrupt retrieval and generate confident wrong answers.

Build steps

  1. Collect a golden corpus: 20-30 docs across PDF (tables, multi-column, scanned), HTML, and DOCX, with ground-truth extracted text manually verified.
  2. Run each doc through 3+ parsers (pdfplumber, pymupdf, unstructured, markitdown) and diff extracted text against ground truth; store results in SQLite.
  3. Build a retrieval accuracy probe: embed chunks, run 50 factual questions with known answers, score with an LLM judge (GPT-4o-mini) — record hit@3 and faithfulness.
  4. Add a CI-friendly CLI: canary run —parser pdfplumber —threshold 0.85 exits non-zero on regression.
  5. Generate an HTML report showing per-doc extraction diffs and per-question retrieval failures, highlighting which parser failed which doc type.

Risks

  • LLM judge for faithfulness scoring is itself unreliable on edge cases — need human spot-check on 10% of eval set.
  • Scanned PDFs require OCR; adding Tesseract/AWS Textract blows scope — scope to text-native docs first.
  • Ground-truth creation is the actual bottleneck — 20 docs can take 4-6 hours to manually verify at quality.

Business Angle

Automated regression suite that catches silent RAG parser failures before they corrupt production retrieval

Customer: Solo ML engineer or founding engineer at a 5-20 person startup running a production RAG product (legal tech, HR, fintech) — personally on-call when retrieval hallucinates, no dedicated QA team

Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo)

Full business breakdown →