Backdoor Trigger Generalization Stress-Tester

Red-team harness that probes whether a backdoor defense trained on known triggers fails when trigger surface, position, or paraphrase shifts.

Difficulty: 1-week | Stack: Python, HuggingFace Transformers, PEFT/LoRA, BadNL or custom trigger injection scripts, Weights & Biases

Who this is for

Security-focused ML engineers and red-teamers evaluating whether deployed backdoor defenses actually generalize or just memorize the known trigger distribution.

Build steps

Fine-tune a small classifier (DistilBERT or Phi-3-mini) on SST-2 or a custom sentiment dataset; inject a known trigger phrase (‘cf’) using BadNL-style label-flipping on 5% of training data.
Train a defense baseline: activation clustering or STRIP on the known trigger; verify it catches the original trigger (>90% detection rate).
Generate trigger variants: positional shifts (start/mid/end), paraphrase variants via back-translation, character-level perturbations, and semantic synonyms — 5 variants per axis.
Run each variant through the defense; log detection rate and false-positive rate per variant family to measure generalization gap.
Train a ‘generalized’ defense using a diverse trigger augmentation set; re-run the stress-test and compare detection curves.
Log all runs to W&B; output a summary table: trigger family × defense version × detection rate.

Risks

BadNL-style injection on small datasets can produce obvious artifacts detectable by the defense for trivial reasons (frequency bias), not genuine generalization — need to verify the base backdoor is actually subtle before testing the defense.
Paraphrase variants via back-translation may inadvertently preserve the semantic trigger, making ‘variant’ tests redundant — manually inspect 10% of generated variants.
Fine-tuning even DistilBERT 10+ times for ablations can take 4-6h on CPU; budget for GPU time or limit to 3 defense configurations.

Backdoor Trigger Generalization Stress-Tester

Who this is for

Build steps

Risks

Business Angle