Rubric-Driven Creative Quality Scorer

A web app that lets you define structured rubrics for subjective creative tasks and compares rubric-based LLM scoring against pairwise human preference to reveal where each method diverges.

Difficulty: weekend | Stack: Python, FastAPI, Anthropic SDK, HTMX, SQLite

Who this is for

Prompt engineers and content teams evaluating LLM-generated copy, stories, or marketing text who need reproducible quality scores rather than gut-feel comparisons.

Build steps

Build a rubric editor: a simple HTMX form where users define up to 6 scored dimensions (e.g., Originality 1–5, Tone Match 1–5) with per-dimension descriptions stored in SQLite.
Implement a rubric-scorer: for each piece of creative text submitted, send a structured prompt to Claude asking it to score each dimension and return JSON; store scores with the rubric ID and text ID.
Add a pairwise comparison mode: show the user two outputs side-by-side and record which they prefer; store the preference in SQLite alongside the rubric scores for the same pair.
Build a divergence report: a simple table showing cases where rubric total score predicts one winner but human pairwise preference chose the other, surfacing the blind spots of each method.
Expose a /export endpoint that returns all scores and preferences as CSV so users can do their own analysis in a notebook.

Risks

LLM judges are self-consistent but not necessarily aligned with actual human taste—the divergence report may show high disagreement, which is the point, but users may distrust the tool rather than update their rubric; ship with a worked example to set expectations.
Rubric dimensions can be correlated (Clarity and Conciseness often move together), inflating composite scores; warn users when two dimensions have a Pearson correlation above 0.8 across their eval set.
Weekend scope can explode if multi-user support is added—keep auth out of scope and treat it as a single-user local tool; document this as a known limitation.

Rubric-Driven Creative Quality Scorer

Who this is for

Build steps

Risks

Business Angle