Audit tool that tells ML practitioners which training samples should get SFT vs RL updates—before they waste GPU budget on interference-damaged runs.

Customer: Independent ML engineer or small-team AI startup (2–5 people) fine-tuning reasoning models (e.g., Qwen, Mistral, DeepSeek) on domain-specific data—medical Q&A, legal reasoning, code—who runs training on rented A100s and can’t afford to discover the SFT/RL interference problem after a $300 run.

Problem: They naively mix SFT and RL losses on the same dataset, tank their reward signal without understanding why, and have no visibility into which samples are ‘too easy’ (SFT waste) or ‘too hard’ (RL signal collapse) until results disappoint. There’s no off-the-shelf tool to audit this before training.

Pricing: one-time — $1,200 in one-time sales in month 3 (roughly 12 licenses at $99 each)

Why now

The interference-between-SFT-and-RL finding is actively circulating in the post-training research cluster right now (mid-2026). Practitioners are reading these papers and immediately asking ‘do I have this problem?’ The confidence-gated distillation paper gives a concrete algorithmic hook (difficulty/confidence scoring) that maps directly into a tool. The window before this becomes a built-in HuggingFace feature is 6–12 months.

Go-to-market

Post a demo GIF + 3-paragraph explanation on r/MachineLearning and Hacker Show HN tying the tool directly to the two papers—name the papers in the title so the people who read them self-identify as your buyers.
Ship a free open-source version on GitHub with the core difficulty-labeling logic; gate the Gradio dashboard, batch export (CSV/JSON), and multi-run comparison behind a $99 one-time Gumroad license.
Write one concrete case study: ‘I ran this on [open dataset], here’s what the SFT/RL split looked like, here’s the estimated GPU waste’—post it as a thread on X/Twitter tagging the paper authors to seed organic reach.
DM 10–15 ML engineers who publicly posted about fine-tuning reasoning models in the last 60 days (LinkedIn, X) offering a free license in exchange for 15-minute feedback—convert warm word-of-mouth from actual users.

Moat (or lack thereof)

No real moat. This is a 2–4 week build that any competent ML engineer could replicate. The advantage is being first-to-ship while the papers are hot and practitioners are Googling for exactly this. HuggingFace or Axolotl could absorb this as a built-in feature within a year. The realistic exit is either (a) a few thousand dollars in one-time sales before commoditization, or (b) using it as a credibility signal to land ML consulting clients at $150–250/hr.