Cultural Commonsense Probe Harness
A CLI tool that stress-tests any LLM against culturally-grounded commonsense questions you author, then surfaces per-language failure heatmaps.
Difficulty: weekend | Stack: Python, LiteLLM, Pydantic, Rich (CLI), Plotext or Matplotlib
Who this is for
NLP engineers and multilingual product teams who need evidence—beyond BLEU/accuracy—that their model actually reasons correctly in non-English cultural contexts before shipping.
Build steps
- Define a JSON schema for culturally-grounded question sets: {language, country, question, correct_answer, distractors, cultural_note}. Seed with 20–30 hand-authored examples across 5+ languages to validate the schema.
- Build a LiteLLM-backed runner that sends each question as a multiple-choice prompt, collects the model’s choice, and logs pass/fail alongside metadata (language, language family, country).
- Implement a scoring aggregator that groups results by language and language family, then renders a heatmap in the terminal (Plotext) or saves a PNG (Matplotlib).
- Add a —compare flag to run two models head-to-head and diff their per-language scores, highlighting where one model outperforms the other culturally.
- Write a minimal contribution guide so colleagues can add new question sets in their own language via a YAML template, mimicking the participatory construction lesson from Global PIQA.
Risks
- Hand-authored questions may inadvertently encode the author’s own cultural blind spots, defeating the purpose—mitigate by getting at least one native-speaker review per language before treating results as ground truth.
- Multiple-choice prompting elicits different behaviors across model families; a model might ‘guess’ correctly without genuine reasoning, inflating scores—add a chain-of-thought logging option to spot this.
- LiteLLM rate limits and costs can spike quickly when running many models × many questions—cap default test size and add a —dry-run mode that prints prompt count and estimated cost.
Business Angle
Sell Cultural Commonsense Probe Harness as a pay-per-report CLI tool for NLP engineers who need to ship multilingual LLM features without cultural embarrassment incidents
Customer: NLP engineer or ML lead at a 10–50 person startup that just added a non-English language to their LLM product (e.g., Japanese customer support bot, Arabic legal assistant) and has no internal eval infrastructure beyond accuracy scores
Pricing: one-time — $800 in month 3 via 8 x $99 one-time report purchases, targeting $2k/mo by month 6
Full business breakdown →