Style-Codebook Writing Assistant

A CLI tool that learns your writing style, compresses it into a codebook embedding, and injects it as a compact prefix into every LLM call—no prompt bloat.

Difficulty: weekend | Stack: Python, sentence-transformers, scikit-learn, Anthropic Claude API, SQLite, Typer

Who this is for

Writers, developers writing docs, or anyone who wants LLM outputs that sound like them—without pasting 10 example paragraphs into every prompt.

Build steps

Collect 20-50 writing samples from the user (emails, docs, notes) and embed each chunk with a sentence-transformer model (e.g. all-MiniLM-L6-v2).
Fit a small k-means codebook (k=32–64) over embeddings from a diverse seed corpus (e.g. WikiText + CommonCrawl samples) to establish shared archetypes.
Assign each user sample to its nearest centroid; represent the user as a weighted sum of centroid indices—store this as a short integer vector in SQLite.
At inference time, decode the user vector back into a soft embedding, format it as a compact natural-language style description (e.g. ‘concise, active voice, avoids jargon’) via a small lookup table, and prepend to the system prompt.
Build a Typer CLI: style learn <files>, style run <prompt>, style compare to A/B output with vs. without the user model.

Risks

Sentence embeddings may not capture stylistic nuance (tone, humor) as well as syntactic features—output may feel generic if the codebook isn’t seeded with stylistically diverse text.
The natural-language decoding of the embedding (step 4) is a heuristic; mismatched descriptions silently degrade quality without any error signal.
With only 20-50 samples, the user vector is noisy—rare or domain-specific styles may collapse to the wrong centroid.

Style-Codebook Writing Assistant

Who this is for

Build steps

Risks

Business Angle