Semantic Geometry Side-by-Side Viewer
A Streamlit app that runs LDA and BERTopic on the same uploaded corpus and visually contrasts the thematic vs. taxonomic signature of each model’s topics using psycholinguistic benchmark anchors.
Difficulty: 1-week | Stack: Python, gensim, BERTopic, sentence-transformers, Streamlit, Plotly, pandas
Who this is for
Social scientists and computational humanists who must choose between or combine LDA and neural topic models and want concrete evidence of the semantic difference before committing to an interpretation.
Build steps
- Build a Streamlit file-upload interface that accepts a plain-text or CSV corpus; tokenize and preprocess with spaCy (lemmatize, remove stopwords).
- Train LDA (via gensim Mallet wrapper for quality) and BERTopic (with a sentence-transformers backbone) on the same preprocessed corpus with matched k topics.
- For each model, score every topic on the thematic/taxonomic axes: use MEN dataset word pairs as ‘thematic anchors’ and SimLex-999 pairs as ‘taxonomic anchors’; compute how well each topic’s top words align with each anchor set via cosine similarity in the embedding space.
- Render a side-by-side Plotly scatter where each bubble is a topic, x-axis = taxonomic score, y-axis = thematic score, colored by model; clicking a bubble shows its top words.
- Add a ‘topic alignment’ panel that pairs the most similar LDA and BERTopic topics and shows their axis divergence with an explanation blurb (e.g. ‘BERTopic topic 3 is 40% more taxonomic than its LDA counterpart’).
- Package with a one-command Docker compose so the app can be shared without a local Python install.
Risks
- BERTopic training time scales poorly past ~50k documents without GPU; need to cap corpus size in the UI or add an HDBSCAN min_cluster_size warning when the corpus is too small for stable clusters.
- MEN and SimLex-999 are general English datasets — domain corpora (biomedical, legal) will have low anchor overlap, making the axis scores unreliable; must surface a ‘low anchor coverage’ warning prominently.
- Matching LDA and BERTopic topics for the alignment panel is non-trivial; Jensen-Shannon divergence over top-word distributions often fails when the vocabularies diverge significantly between models — may need to fall back to embedding centroid cosine similarity.