Probe-Based Topic Coherence Benchmark Generator

A library and REST API that auto-generates held-out evaluation probes distinguishing thematic vs. taxonomic coherence for any topic model, replacing monolithic NPMI with axis-aware metrics.

Difficulty: 1-month | Stack: Python, FastAPI, PostgreSQL, sentence-transformers, Hugging Face datasets, pytest, React (optional frontend)

Who this is for

Topic modeling researchers who publish results and need a reproducible, axis-aware coherence metric that doesn’t conflate the two semantic types — enabling apples-to-apples comparisons between LDA-family and PLM-augmented models.

Build steps

Curate and ingest three psycholinguistic datasets (SimLex-999, MEN, WordSim-353-REL/SIM split) into PostgreSQL; store each word pair with its human-rated similarity score and axis label (thematic vs. taxonomic).
Build a probe generator: given a topic’s top words, query the DB for all known pairs within the topic, then synthesize additional probes using a nearest-neighbor lookup in a shared embedding space (find pairs that are high-thematic/low-taxonomic and vice versa as foils).
Implement two axis-specific coherence metrics: T-Coherence (thematic) based on PMI over a reference corpus, and X-Coherence (taxonomic) based on WordNet LCS depth normalized by probe pair taxonomy depth.
Expose a FastAPI endpoint: POST /evaluate accepts a topic model’s topic-word matrix JSON and returns per-topic T-Coherence, X-Coherence, and an overall axis-bias score; include a /calibrate endpoint that lets users upload domain corpora to retrain the PMI reference.
Write a benchmark harness that runs 5 classical models (LDA 20/50/100 topics, NMF, BERTopic) against the API on a standard corpus (20 Newsgroups, AG News) and outputs a reproducible leaderboard CSV.
Publish the library to PyPI with a companion paper-ready results table and a pytest suite covering edge cases (single-word topics, OOV words, degenerate distributions).

Risks

Psycholinguistic datasets have small vocabulary coverage (~2k–7k unique words each); for topics with specialized vocabulary the probe generator will produce too few valid pairs to give a statistically reliable score — need a minimum-probe-count threshold and a confidence interval in the output.
PMI-based T-Coherence is sensitive to the choice of reference corpus; if users use a very different domain corpus for calibration, scores become incomparable across studies — the API must version and hash the reference corpus and warn when comparing scores across different calibration runs.
Synthetic probe generation via embedding nearest-neighbors can introduce bias: the embedding model itself blends thematic and taxonomic signal, so ‘high-thematic/low-taxonomic’ foils may not be clean — requires a human validation pass on at least a sampled subset before publishing benchmark numbers.