Vocabulary-Free Sparse Retrieval Experimenter

A benchmarking harness that compares standard BM25 sparse retrieval against learned sparse methods (SPLADE, FLOPS-regularized models) on a custom document corpus to quantify the vocabulary-mismatch problem.

Difficulty: weekend | Stack: Python, Pyserini, SPLADE (HuggingFace), datasets (HuggingFace), Pandas, matplotlib

Who this is for

ML engineers evaluating retrieval components for RAG who want evidence-based justification for moving beyond BM25 to learned sparse methods.

Build steps

Index a domain-specific corpus (e.g., a subset of PubMed abstracts or legal opinions) using Pyserini for BM25 and export to a standard BEIR-compatible format.
Run the same corpus through a pre-trained SPLADE model (available on HuggingFace) to generate learned sparse vectors and index them.
Design 20-30 test queries that deliberately use vocabulary not present in the documents but semantically equivalent (synonyms, abbreviations, paraphrases).
Evaluate both retrievers on MRR@10 and Recall@20, broken down by query type (vocabulary match vs. vocabulary mismatch), and log results to a Pandas DataFrame.
Plot recall curves and a confusion matrix of which query categories each method fails on, producing a one-page report developers can use to justify retrieval stack decisions.

Risks

SPLADE inference is GPU-memory-hungry; running it on CPU for a large corpus (100k+ docs) can take hours and may be impractical for a weekend build.
Constructing meaningful vocabulary-mismatch test queries by hand is time-consuming and introduces experimenter bias — automatic paraphrase generation often produces queries that are too easy.
SPLADE models are trained on MS MARCO; domain shift to niche corpora (medical, legal) may make results look worse than a properly fine-tuned sparse model would achieve.

Vocabulary-Free Sparse Retrieval Experimenter

Who this is for

Build steps

Risks

Business Angle