Watermark Robustness Sandbox

An interactive web tool that lets you embed a token-level watermark into LLM output, then attack it with paraphrasing and synonym substitution to measure survival rate.

Difficulty: 1-week | Stack: Python, FastAPI, HuggingFace Transformers, Next.js, Tailwind CSS

Who this is for

Developers building attribution pipelines for synthetic content (news, legal drafts, code) who want empirical data on how robust a chosen watermarking scheme is before committing to it in production.

Build steps

Implement two watermarking schemes: the classic Kirchenbauer green-list scheme and a simplified seed-pooling variant (inspired by WaterSearch) that spreads the signal across multiple token windows.
Build a FastAPI backend with three endpoints: /generate (watermarked text), /detect (returns p-value and scheme confidence), and /attack (runs paraphrase via a small model + synonym swap and returns post-attack detectability).
Create a Next.js UI with a split-pane: left shows watermarked text with highlighted ‘green-list’ tokens; right shows attack output with detectability score delta.
Add a comparison table that benchmarks both schemes on: text quality (perplexity delta), detection AUC, and survival rate under three attack intensities.
Write a one-page methodology note auto-generated as PDF from the run results, suitable for sharing with a compliance team.

Risks

Seed-pooling schemes require careful parameter tuning — a naive implementation may produce barely-detectable watermarks that look good in unit tests but fail on real diverse text.
Paraphrase attacks using a separate model introduce a confound: the attack model quality determines the ceiling, not just the watermark strength.
Perplexity as a quality metric can be gamed; you may need human eval or a reference-free metric to make quality claims credible.

Watermark Robustness Sandbox

Who this is for

Build steps

Risks

Business Angle