AI Pulse
← Projects · weekend

Vocabulary-Free Sparse Retrieval Experimenter

A benchmarking harness that compares standard BM25 sparse retrieval against learned sparse methods (SPLADE, FLOPS-regularized models) on a custom document corpus to quantify the vocabulary-mismatch problem.

Difficulty: weekend | Stack: Python, Pyserini, SPLADE (HuggingFace), datasets (HuggingFace), Pandas, matplotlib

Who this is for

ML engineers evaluating retrieval components for RAG who want evidence-based justification for moving beyond BM25 to learned sparse methods.

Build steps

  1. Index a domain-specific corpus (e.g., a subset of PubMed abstracts or legal opinions) using Pyserini for BM25 and export to a standard BEIR-compatible format.
  2. Run the same corpus through a pre-trained SPLADE model (available on HuggingFace) to generate learned sparse vectors and index them.
  3. Design 20-30 test queries that deliberately use vocabulary not present in the documents but semantically equivalent (synonyms, abbreviations, paraphrases).
  4. Evaluate both retrievers on MRR@10 and Recall@20, broken down by query type (vocabulary match vs. vocabulary mismatch), and log results to a Pandas DataFrame.
  5. Plot recall curves and a confusion matrix of which query categories each method fails on, producing a one-page report developers can use to justify retrieval stack decisions.

Risks

  • SPLADE inference is GPU-memory-hungry; running it on CPU for a large corpus (100k+ docs) can take hours and may be impractical for a weekend build.
  • Constructing meaningful vocabulary-mismatch test queries by hand is time-consuming and introduces experimenter bias — automatic paraphrase generation often produces queries that are too easy.
  • SPLADE models are trained on MS MARCO; domain shift to niche corpora (medical, legal) may make results look worse than a properly fine-tuned sparse model would achieve.

Business Angle

A hosted benchmarking tool that generates a vocabulary-mismatch audit report comparing BM25 vs. SPLADE on your own document corpus — delivered as a PDF in 10 minutes.

Customer: ML engineer or AI lead at a 5–50 person startup who inherited or built a RAG pipeline on BM25 and is getting pressure from product to improve retrieval quality, but doesn't have weeks to run ablation studies themselves.

Pricing: one-time — $1,500 in one-time sales within 3 months (~15 reports at $99 each)

Full business breakdown →