Confidence-Gated Distillation Trainer

A training script that replicates confidence-gated teacher distillation, filtering noisy teacher tokens before they reach the student model.

Difficulty: 1-week | Stack: Python, PyTorch, Hugging Face Transformers, TRL, Weights & Biases

Who this is for

Researchers and engineers training small reasoning models via knowledge distillation who are losing performance to noisy teacher rollouts — this directly implements the confidence-gating idea from the second paper.

Build steps

Set up a teacher model (e.g. DeepSeek-R1-7B) and student model (e.g. Qwen2.5-1.5B) with a shared tokenizer and a verifiable reward task like GSM8K
Generate teacher rollouts and compute per-token or per-step confidence scores (e.g. softmax entropy or top-1 probability) for each reasoning step
Implement a gating function that masks or down-weights teacher supervision tokens below a confidence threshold before computing the distillation loss
Wrap the gated distillation loss into a TRL-compatible custom trainer, adding a W&B panel that tracks gate activation rate and reward per batch
Run ablations comparing ungated distillation vs. gated distillation on a held-out GSM8K test split and log final accuracy and training loss curves

Risks

Generating teacher rollouts at scale is expensive — without quantization (bitsandbytes, AWQ) the teacher model alone may exceed a single consumer GPU’s VRAM
Choosing the confidence threshold is non-obvious; too aggressive and you discard valid signal, too lenient and noisy tokens still pass through — budget time for a threshold sweep
TRL’s SFTTrainer and custom loss hooks change across minor versions; mismatched TRL/Transformers versions can silently break gradient flow

Confidence-Gated Distillation Trainer

Who this is for

Build steps

Risks

Business Angle