Vocabulary-Free Sparse Retrieval Experimenter
A benchmarking harness that compares standard BM25 sparse retrieval against learned sparse methods (SPLADE, FLOPS-regularized models) on a custom document corpus to quantify the vocabulary-mismatch problem.
Difficulty: weekend | Stack: Python, Pyserini, SPLADE (HuggingFace), datasets (HuggingFace), Pandas, matplotlib
Who this is for
ML engineers evaluating retrieval components for RAG who want evidence-based justification for moving beyond BM25 to learned sparse methods.
Build steps
- Index a domain-specific corpus (e.g., a subset of PubMed abstracts or legal opinions) using Pyserini for BM25 and export to a standard BEIR-compatible format.
- Run the same corpus through a pre-trained SPLADE model (available on HuggingFace) to generate learned sparse vectors and index them.
- Design 20-30 test queries that deliberately use vocabulary not present in the documents but semantically equivalent (synonyms, abbreviations, paraphrases).
- Evaluate both retrievers on MRR@10 and Recall@20, broken down by query type (vocabulary match vs. vocabulary mismatch), and log results to a Pandas DataFrame.
- Plot recall curves and a confusion matrix of which query categories each method fails on, producing a one-page report developers can use to justify retrieval stack decisions.
Risks
- SPLADE inference is GPU-memory-hungry; running it on CPU for a large corpus (100k+ docs) can take hours and may be impractical for a weekend build.
- Constructing meaningful vocabulary-mismatch test queries by hand is time-consuming and introduces experimenter bias — automatic paraphrase generation often produces queries that are too easy.
- SPLADE models are trained on MS MARCO; domain shift to niche corpora (medical, legal) may make results look worse than a properly fine-tuned sparse model would achieve.
Business Angle
A hosted benchmarking tool that generates a vocabulary-mismatch audit report comparing BM25 vs. SPLADE on your own document corpus — delivered as a PDF in 10 minutes.
Customer: ML engineer or AI lead at a 5–50 person startup who inherited or built a RAG pipeline on BM25 and is getting pressure from product to improve retrieval quality, but doesn't have weeks to run ablation studies themselves.
Pricing: one-time — $1,500 in one-time sales within 3 months (~15 reports at $99 each)
Full business breakdown →