Topic Semantic Axis Auditor
A CLI tool that scores each topic from a trained model on a thematic-relatedness vs. taxonomic-similarity axis so researchers know what geometry their model actually learned.
Difficulty: weekend | Stack: Python, gensim, sentence-transformers, NLTK/WordNet, rich (CLI output)
Who this is for
NLP researchers and data scientists who use topic models for downstream tasks and need to know whether their topics are associative clusters or categorical clusters before drawing inferences.
Build steps
- Load top-N words per topic from a serialized LDA or BERTopic model using gensim or BERTopic’s built-in export.
- For each within-topic word pair, compute two scores: (a) WordNet path_similarity as a proxy for taxonomic similarity and (b) cosine similarity from a static Word2Vec/GloVe embedding as a proxy for thematic relatedness.
- Aggregate per-topic: produce a 2D score (mean taxonomic, mean thematic) and a ratio that places each topic on the spectrum.
- Render a rich CLI table and an optional matplotlib scatter plot showing all topics positioned on the two axes.
- Add a flag —compare to run the same audit on two model files side-by-side (e.g. LDA vs BERTopic on same corpus) and print a diff summary.
Risks
- WordNet coverage is sparse for domain-specific corpora (medical, legal, code); many word pairs will return None and skew the taxonomic score toward zero — need a fallback (e.g. zero-fill with a warning).
- Static embeddings (Word2Vec/GloVe) conflate the two axes themselves; the thematic proxy will be noisy unless you use a relatedness-specific dataset like MEN as a calibration check.
- Topic models with very short top-word lists (k < 5) produce too few pairs for stable mean scores — the tool needs a minimum-k guard and clear output uncertainty indicators.
Business Angle
A paid CLI audit tool that tells NLP researchers whether their topic model learned thematic or taxonomic structure — before they publish or ship.
Customer: Academic NLP researcher or industry data scientist (e.g., a PhD student or ML engineer at a mid-size company) who uses BERTopic, CTM, or LDA for downstream tasks like document routing, trend detection, or content recommendation — and has to justify their model choice to a PI, stakeholder, or reviewer.
Pricing: one-time — $400 in one-time sales within 3 months (roughly 8–10 licenses at $40–50 each)
Full business breakdown →