AI Pulse
← Projects · 1-week

Confidence-Gated Distillation Trainer

A training script that replicates confidence-gated teacher distillation, filtering noisy teacher tokens before they reach the student model.

Difficulty: 1-week | Stack: Python, PyTorch, Hugging Face Transformers, TRL, Weights & Biases

Who this is for

Researchers and engineers training small reasoning models via knowledge distillation who are losing performance to noisy teacher rollouts — this directly implements the confidence-gating idea from the second paper.

Build steps

  1. Set up a teacher model (e.g. DeepSeek-R1-7B) and student model (e.g. Qwen2.5-1.5B) with a shared tokenizer and a verifiable reward task like GSM8K
  2. Generate teacher rollouts and compute per-token or per-step confidence scores (e.g. softmax entropy or top-1 probability) for each reasoning step
  3. Implement a gating function that masks or down-weights teacher supervision tokens below a confidence threshold before computing the distillation loss
  4. Wrap the gated distillation loss into a TRL-compatible custom trainer, adding a W&B panel that tracks gate activation rate and reward per batch
  5. Run ablations comparing ungated distillation vs. gated distillation on a held-out GSM8K test split and log final accuracy and training loss curves

Risks

  • Generating teacher rollouts at scale is expensive — without quantization (bitsandbytes, AWQ) the teacher model alone may exceed a single consumer GPU’s VRAM
  • Choosing the confidence threshold is non-obvious; too aggressive and you discard valid signal, too lenient and noisy tokens still pass through — budget time for a threshold sweep
  • TRL’s SFTTrainer and custom loss hooks change across minor versions; mismatched TRL/Transformers versions can silently break gradient flow

Business Angle

A plug-and-play distillation training library that filters noisy teacher tokens so ML engineers stop throwing GPU hours at broken knowledge transfer pipelines.

Customer: ML engineer or applied researcher at a startup or research lab (2–20 people) who is fine-tuning a sub-7B reasoning model using a large teacher like DeepSeek-R1 or Qwen-72B, has already burned ≥$500 in compute on runs that underperform SFT baselines, and suspects noisy rollouts are the culprit but doesn't have time to implement filtering from scratch.

Pricing: open-core — $800 MRR in 4 months (8 teams at $99/mo for hosted experiment tracking + priority support tier; core library stays MIT)

Full business breakdown →