Style-Codebook Writing Assistant
A CLI tool that learns your writing style, compresses it into a codebook embedding, and injects it as a compact prefix into every LLM call—no prompt bloat.
Difficulty: weekend | Stack: Python, sentence-transformers, scikit-learn, Anthropic Claude API, SQLite, Typer
Who this is for
Writers, developers writing docs, or anyone who wants LLM outputs that sound like them—without pasting 10 example paragraphs into every prompt.
Build steps
- Collect 20-50 writing samples from the user (emails, docs, notes) and embed each chunk with a sentence-transformer model (e.g. all-MiniLM-L6-v2).
- Fit a small k-means codebook (k=32–64) over embeddings from a diverse seed corpus (e.g. WikiText + CommonCrawl samples) to establish shared archetypes.
- Assign each user sample to its nearest centroid; represent the user as a weighted sum of centroid indices—store this as a short integer vector in SQLite.
- At inference time, decode the user vector back into a soft embedding, format it as a compact natural-language style description (e.g. ‘concise, active voice, avoids jargon’) via a small lookup table, and prepend to the system prompt.
- Build a Typer CLI:
style learn <files>,style run <prompt>,style compareto A/B output with vs. without the user model.
Risks
- Sentence embeddings may not capture stylistic nuance (tone, humor) as well as syntactic features—output may feel generic if the codebook isn’t seeded with stylistically diverse text.
- The natural-language decoding of the embedding (step 4) is a heuristic; mismatched descriptions silently degrade quality without any error signal.
- With only 20-50 samples, the user vector is noisy—rare or domain-specific styles may collapse to the wrong centroid.
Business Angle
A CLI writing assistant that encodes your personal style once and silently applies it to every AI draft—no copy-pasting examples.
Customer: Solo developer-advocates and technical bloggers (think: one-person DevRel, indie OSS maintainers, substack writers with a technical bent) who publish 2–4 long-form pieces per month and already use Claude or GPT daily but hate that every output sounds like the same corporate AI voice.
Pricing: freemium — $600 MRR within 4 months (roughly 30 paid users at $20/mo)
Full business breakdown →