Cultural Commonsense Probe Harness

A CLI tool that stress-tests any LLM against culturally-grounded commonsense questions you author, then surfaces per-language failure heatmaps.

Difficulty: weekend | Stack: Python, LiteLLM, Pydantic, Rich (CLI), Plotext or Matplotlib

Who this is for

NLP engineers and multilingual product teams who need evidence—beyond BLEU/accuracy—that their model actually reasons correctly in non-English cultural contexts before shipping.

Build steps

Define a JSON schema for culturally-grounded question sets: {language, country, question, correct_answer, distractors, cultural_note}. Seed with 20–30 hand-authored examples across 5+ languages to validate the schema.
Build a LiteLLM-backed runner that sends each question as a multiple-choice prompt, collects the model’s choice, and logs pass/fail alongside metadata (language, language family, country).
Implement a scoring aggregator that groups results by language and language family, then renders a heatmap in the terminal (Plotext) or saves a PNG (Matplotlib).
Add a —compare flag to run two models head-to-head and diff their per-language scores, highlighting where one model outperforms the other culturally.
Write a minimal contribution guide so colleagues can add new question sets in their own language via a YAML template, mimicking the participatory construction lesson from Global PIQA.

Risks

Hand-authored questions may inadvertently encode the author’s own cultural blind spots, defeating the purpose—mitigate by getting at least one native-speaker review per language before treating results as ground truth.
Multiple-choice prompting elicits different behaviors across model families; a model might ‘guess’ correctly without genuine reasoning, inflating scores—add a chain-of-thought logging option to spot this.
LiteLLM rate limits and costs can spike quickly when running many models × many questions—cap default test size and add a —dry-run mode that prints prompt count and estimated cost.

Cultural Commonsense Probe Harness

Who this is for

Build steps

Risks

Business Angle