Topic Semantic Axis Auditor

A CLI tool that scores each topic from a trained model on a thematic-relatedness vs. taxonomic-similarity axis so researchers know what geometry their model actually learned.

Difficulty: weekend | Stack: Python, gensim, sentence-transformers, NLTK/WordNet, rich (CLI output)

Who this is for

NLP researchers and data scientists who use topic models for downstream tasks and need to know whether their topics are associative clusters or categorical clusters before drawing inferences.

Build steps

Load top-N words per topic from a serialized LDA or BERTopic model using gensim or BERTopic’s built-in export.
For each within-topic word pair, compute two scores: (a) WordNet path_similarity as a proxy for taxonomic similarity and (b) cosine similarity from a static Word2Vec/GloVe embedding as a proxy for thematic relatedness.
Aggregate per-topic: produce a 2D score (mean taxonomic, mean thematic) and a ratio that places each topic on the spectrum.
Render a rich CLI table and an optional matplotlib scatter plot showing all topics positioned on the two axes.
Add a flag —compare to run the same audit on two model files side-by-side (e.g. LDA vs BERTopic on same corpus) and print a diff summary.

Risks

WordNet coverage is sparse for domain-specific corpora (medical, legal, code); many word pairs will return None and skew the taxonomic score toward zero — need a fallback (e.g. zero-fill with a warning).
Static embeddings (Word2Vec/GloVe) conflate the two axes themselves; the thematic proxy will be noisy unless you use a relatedness-specific dataset like MEN as a calibration check.
Topic models with very short top-word lists (k < 5) produce too few pairs for stable mean scores — the tool needs a minimum-k guard and clear output uncertainty indicators.

Topic Semantic Axis Auditor

Who this is for

Build steps

Risks

Business Angle