Novelty Memory Bot for Your Reading List
A CLI tool that scores each new paper you add against your personal reading history, flagging genuine novelty vs. incremental rehash.
Difficulty: weekend | Stack: Python, Claude API (claude-haiku for embeddings/scoring), SQLite + sqlite-vec or ChromaDB, arXiv API, Typer (CLI)
Who this is for
Researchers and technical leads who read 10-30 papers a week and waste time on papers that repackage known ideas — the tool surfaces only papers that introduce something not already in their personal corpus.
Build steps
- Build a CLI command
add <arXiv_id>that fetches abstract + metadata from the arXiv API and stores it in a local ChromaDB collection with vector embeddings. - On each
add, retrieve the top-5 most similar past papers by cosine similarity and send them alongside the new abstract to Claude with a prompt: ‘Given these prior papers the user has read, what — if anything — is genuinely novel in the new one? Score novelty 1-10 and list specific novel claims.’ - Persist novelty scores and Claude’s reasoning in SQLite so the user can query
list --min-novelty 7to surface high-value papers. - Add a
digestcommand that batches the last N unreviewed papers and returns a ranked reading list ordered by novelty score. - Export to markdown or Obsidian-compatible notes with novelty annotations embedded as frontmatter.
Risks
- Embedding similarity alone can miss novelty in cross-domain papers (a physics technique applied to biology looks dissimilar but may be well-known in its origin field) — Claude’s reasoning helps but isn’t foolproof.
- ChromaDB persistence across sessions can silently corrupt if the process is killed mid-write; need WAL mode or periodic export to a flat JSON backup.
- For users with large existing libraries (500+ papers), the initial bulk-import + embedding pass can take hours and cost non-trivial API dollars — must warn upfront and add a
--dry-runflag.
Business Angle
A CLI tool that scores arXiv papers for genuine novelty against your personal reading history, so researchers stop re-reading recycled ideas
Customer: Solo ML researcher or technical lead at a seed-stage AI startup — reads 15-25 papers/week on arXiv, tracks papers in Notion or Zotero, constantly annoyed that 40% of papers are incremental tweaks on things they already know cold
Pricing: one-time — $800 in first 60 days via lifetime licenses, then reassess
Full business breakdown →