AI Pulse
← Projects · 1-month

Cooperative SFT+RL Interleaving Scheduler

A modular training loop that dynamically schedules SFT and GRPO/RLVR updates per sample based on real-time difficulty estimates, replacing naive loss mixing.

Difficulty: 1-month | Stack: Python, PyTorch, TRL, vLLM, Ray, Weights & Biases, Hydra

Who this is for

ML engineers running multi-stage post-training pipelines on open models who want to move beyond fixed SFT→RL sequences and test cooperative scheduling without building all infrastructure from scratch.

Build steps

  1. Implement a difficulty estimator that runs lightweight pass@k rollouts (via vLLM) every N steps and maintains a per-sample difficulty score cache updated incrementally
  2. Build a sample router that reads the difficulty cache and assigns each mini-batch sample to an SFT path, an RLVR path (GRPO), or a skip path based on configurable thresholds
  3. Implement the cooperative training loop using Ray for distributed rollout generation and a custom PyTorch training step that applies the correct loss per routed sample
  4. Add Hydra-based config for threshold schedules, mixing ratios, and routing strategies so ablations (fixed mix vs. dynamic gate vs. pure RL) can be launched from a single config file
  5. Benchmark on a reasoning task (GSM8K or MATH500), produce learning curves per regime, and export a report comparing cooperative scheduling vs. naive joint loss baseline

Risks

  • Maintaining a live difficulty cache introduces training non-stationarity — scores computed early in training may become stale and route samples incorrectly as the model improves
  • Ray + vLLM orchestration adds significant operational complexity; debugging distributed deadlocks or NCCL errors can consume a large fraction of the month budget
  • GRPO reward variance can spike when sample routing changes the effective batch distribution — reward normalization needs to be re-tuned whenever the router thresholds are adjusted

Business Angle

A plug-and-play Python library that replaces fixed SFT→RL sequences with sample-level dynamic scheduling, sold as open-core with a paid experiment dashboard.

Customer: Solo ML engineer or 2-person AI startup fine-tuning Llama/Mistral/Qwen variants for a vertical use-case (legal, code, math) who has a GPU budget but zero infra team and is losing days to hand-rolled training loop hacks.

Pricing: open-core — $800 MRR in 4 months (8 teams × $99/mo Pro tier for the W&B-style run comparison dashboard and priority Discord support)

Full business breakdown →