Rubric-Driven Creative Quality Scorer
A web app that lets you define structured rubrics for subjective creative tasks and compares rubric-based LLM scoring against pairwise human preference to reveal where each method diverges.
Difficulty: weekend | Stack: Python, FastAPI, Anthropic SDK, HTMX, SQLite
Who this is for
Prompt engineers and content teams evaluating LLM-generated copy, stories, or marketing text who need reproducible quality scores rather than gut-feel comparisons.
Build steps
- Build a rubric editor: a simple HTMX form where users define up to 6 scored dimensions (e.g., Originality 1–5, Tone Match 1–5) with per-dimension descriptions stored in SQLite.
- Implement a rubric-scorer: for each piece of creative text submitted, send a structured prompt to Claude asking it to score each dimension and return JSON; store scores with the rubric ID and text ID.
- Add a pairwise comparison mode: show the user two outputs side-by-side and record which they prefer; store the preference in SQLite alongside the rubric scores for the same pair.
- Build a divergence report: a simple table showing cases where rubric total score predicts one winner but human pairwise preference chose the other, surfacing the blind spots of each method.
- Expose a /export endpoint that returns all scores and preferences as CSV so users can do their own analysis in a notebook.
Risks
- LLM judges are self-consistent but not necessarily aligned with actual human taste—the divergence report may show high disagreement, which is the point, but users may distrust the tool rather than update their rubric; ship with a worked example to set expectations.
- Rubric dimensions can be correlated (Clarity and Conciseness often move together), inflating composite scores; warn users when two dimensions have a Pearson correlation above 0.8 across their eval set.
- Weekend scope can explode if multi-user support is added—keep auth out of scope and treat it as a single-user local tool; document this as a known limitation.
Business Angle
Sell rubric-based LLM evaluation as a productized service to content teams that need audit-proof quality scores for AI-generated copy.
Customer: Solo or small-team content ops manager at a 10-50 person DTC or SaaS company who ships 50-200 AI-generated pieces per month (product descriptions, email copy, blog posts) and is being asked by their CMO to prove quality isn't slipping as they scale with AI.
Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo on a 500-eval/month plan)
Full business breakdown →