A plug-and-play Python library that replaces fixed SFT→RL sequences with sample-level dynamic scheduling, sold as open-core with a paid experiment dashboard.

Customer: Solo ML engineer or 2-person AI startup fine-tuning Llama/Mistral/Qwen variants for a vertical use-case (legal, code, math) who has a GPU budget but zero infra team and is losing days to hand-rolled training loop hacks.

Problem: Every serious post-training experiment requires re-implementing the same scheduling logic from scratch — loss mixing ratios, curriculum signals, reward gating — resulting in brittle one-off scripts that can’t be compared, reproduced, or iterated on quickly.

Pricing: open-core — $800 MRR in 4 months (8 teams × $99/mo Pro tier for the W&B-style run comparison dashboard and priority Discord support)

Why now

Two high-signal papers (naive SFT+RL coupling hurts reward, confidence-gated distillation filters noisy supervision) dropped within the same cluster, signaling the field is actively questioning fixed-schedule defaults — practitioners will Google solutions within weeks of reading the preprints.

Go-to-market

Write a 1,500-word blog post titled ‘Why Your SFT→RL Pipeline Is Sabotaging Itself (and a fix)’ that references the two papers, benchmarks naive mixing vs. the scheduler on a small public task (GSM8K or MATH), and ends with a GitHub link — post to Hacker News, r/MachineLearning, and the Hugging Face Discord simultaneously.
Open a #cooperative-scheduler channel in the TRL GitHub Discussions or post a well-scoped Feature Request issue linking your repo — TRL maintainers get notified, practitioners already hanging out there see it organically.
DM 10 ML engineers who publicly complained about SFT+RL instability on Twitter/X in the last 90 days (search ‘GRPO loss spike’ or ‘reward hacking SFT’) offering a free 30-min setup call in exchange for honest feedback and a testimonial.
Ship a one-command Hydra config that wires into an existing TRL TrainingArguments object so the adoption friction is a 3-line change — record a 4-minute Loom showing a before/after training curve and pin it as the repo README hero.

Moat (or lack thereof)

No real moat — this is a Python library and a determined ML engineer can read the paper and replicate the core in a weekend. The defensible surface is speed-of-iteration (you ship the next scheduler variant before they finish theirs), community stickiness around the dashboard and shared config library, and the reputation that accrues from being the go-to reference implementation when practitioners cite the technique. That’s a 12-month head start, not a moat.