A plug-and-play distillation training library that filters noisy teacher tokens so ML engineers stop throwing GPU hours at broken knowledge transfer pipelines.

Customer: ML engineer or applied researcher at a startup or research lab (2–20 people) who is fine-tuning a sub-7B reasoning model using a large teacher like DeepSeek-R1 or Qwen-72B, has already burned ≥$500 in compute on runs that underperform SFT baselines, and suspects noisy rollouts are the culprit but doesn’t have time to implement filtering from scratch.

Problem: Knowledge distillation from frontier reasoning models is theoretically compelling but practically fragile — noisy, low-confidence teacher tokens pollute student gradients and erase reasoning gains. Implementing confidence-gated filtering correctly against HuggingFace/TRL’s training loop is non-trivial boilerplate that most teams re-derive badly or skip entirely.

Pricing: open-core — $800 MRR in 4 months (8 teams at $99/mo for hosted experiment tracking + priority support tier; core library stays MIT)

Why now

Two papers landed in the same cluster showing that naive SFT+RL coupling and unfiltered distillation both actively hurt reasoning benchmarks — practitioners are actively searching for solutions right now. The reproducibility window (before big labs productize this) is ~6 months.

Go-to-market

Post a detailed write-up on the HuggingFace forums and r/MachineLearning walking through a concrete benchmark (e.g., GSM8K or MATH) showing gated vs. ungated distillation loss curves — link to the open-source repo at the bottom.
Open a GitHub repo with a dead-simple ‘drop-in replacement for TRL’s SFTTrainer’ API, pin a Discord invite, and personally DM the 10–15 most active commenters in recent distillation-related HF discussion threads.
Reach out directly to 5 ML-focused Discord servers (EleutherAI, HuggingFace, Nous Research community) and offer free async support to anyone who tries the library — collect testimonials and failure cases to improve the tool.
Once you have 3+ real users with measurable wins, write a ‘how we saved X GPU-hours’ case study and submit it to The Batch, Interconnects newsletter, and Ahead of AI for distribution to the exact practitioner audience.

Moat (or lack thereof)

No real moat. This is a well-scoped open-source library anyone could clone in a week once the papers are widely read. The defensible edge is purely execution speed and community trust: if you ship first, collect real benchmark results, and become the default citation in distillation blog posts, you get organic distribution. The paid tier survives only if support and hosted experiment tooling genuinely save teams time — which is a service moat, not a tech moat.