Backdoor Trigger Generalization Stress-Tester
Red-team harness that probes whether a backdoor defense trained on known triggers fails when trigger surface, position, or paraphrase shifts.
Difficulty: 1-week | Stack: Python, HuggingFace Transformers, PEFT/LoRA, BadNL or custom trigger injection scripts, Weights & Biases
Who this is for
Security-focused ML engineers and red-teamers evaluating whether deployed backdoor defenses actually generalize or just memorize the known trigger distribution.
Build steps
- Fine-tune a small classifier (DistilBERT or Phi-3-mini) on SST-2 or a custom sentiment dataset; inject a known trigger phrase (‘cf’) using BadNL-style label-flipping on 5% of training data.
- Train a defense baseline: activation clustering or STRIP on the known trigger; verify it catches the original trigger (>90% detection rate).
- Generate trigger variants: positional shifts (start/mid/end), paraphrase variants via back-translation, character-level perturbations, and semantic synonyms — 5 variants per axis.
- Run each variant through the defense; log detection rate and false-positive rate per variant family to measure generalization gap.
- Train a ‘generalized’ defense using a diverse trigger augmentation set; re-run the stress-test and compare detection curves.
- Log all runs to W&B; output a summary table: trigger family × defense version × detection rate.
Risks
- BadNL-style injection on small datasets can produce obvious artifacts detectable by the defense for trivial reasons (frequency bias), not genuine generalization — need to verify the base backdoor is actually subtle before testing the defense.
- Paraphrase variants via back-translation may inadvertently preserve the semantic trigger, making ‘variant’ tests redundant — manually inspect 10% of generated variants.
- Fine-tuning even DistilBERT 10+ times for ablations can take 4-6h on CPU; budget for GPU time or limit to 3 defense configurations.
Business Angle
SaaS red-team harness that stress-tests backdoor defenses against unseen trigger variants — so ML security engineers know if their defense actually generalizes
Customer: ML security engineer at a mid-size AI lab or fintech/healthtech company — has deployed a backdoor defense (e.g. ONION, STRIP, or fine-pruning), needs to prove it to an internal audit or external compliance review, no dedicated red-team budget
Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo)
Full business breakdown →