A self-hosted evaluation harness that measures multimodal RAG retrieval quality for developers building over audio/video content.
Customer: Solo Python developer or 2-person AI startup team building a RAG product on top of podcasts, YouTube transcripts, or instructional video libraries — they’ve shipped an MVP but have no idea if their retrieval is actually finding the right clips vs. just returning vaguely related text chunks.
Problem: When a RAG pipeline ingests transcribed audio/video, retrieval quality and generation quality are conflated — a fluent answer can still be grounded in the wrong segment. There’s no off-the-shelf tool that isolates retrieval recall at the chunk/timestamp level for multimodal sources, so developers fly blind when tuning chunk size, overlap, or embedding models.
Pricing: one-time — $800 in one-time sales within 3 months (targeting ~16 purchases at $49)
Why now
The current wave of evaluation framework research is exposing that text-only RAG metrics leave systematic blind spots — specifically around modality-specific retrieval. Developers shipping podcast or video RAG products are feeling this gap acutely right now as they move from prototype to production and start getting user complaints they can’t diagnose.
Go-to-market
- Post a detailed ‘how I measured retrieval quality on a podcast RAG app’ write-up on the LlamaIndex Discord and their GitHub Discussions — show the actual gap between naive transcript chunking vs. timestamp-aware retrieval with real numbers. End with a link to the repo/product.
- Ship a free open-source CLI tool (e.g.
mmrag-eval) on GitHub that runs one eval suite against a sample podcast dataset — include a $49 ‘Pro’ tier as a PyPI package that unlocks the full test harness, result dashboard, and custom metric plugins. - Find 5 developers on Indie Hackers or X/Twitter who have publicly posted about building podcast or video search products and DM them offering a free eval run against their pipeline in exchange for a testimonial and honest feedback.
- Write a short comparison post benchmarking RAGAS vs. this tool on a video dataset — publish on Towards Data Science or a personal blog, cross-post to Hacker News ‘Show HN’ with a reproducible Colab notebook attached.
Moat (or lack thereof)
No real moat. RAGAS could add multimodal support in a minor release, and any competent Python dev could replicate the core logic in a weekend. The real edge is time-to-market and accumulated benchmark datasets/golden sets — if you build and share good reference evaluation datasets for common use cases (e.g. a ‘podcast QA eval set’), that becomes a mild sticky asset. But be honest: this is a first-mover advantage measured in weeks, not years.