Cooperative SFT+RL Interleaving Scheduler

A modular training loop that dynamically schedules SFT and GRPO/RLVR updates per sample based on real-time difficulty estimates, replacing naive loss mixing.

Difficulty: 1-month | Stack: Python, PyTorch, TRL, vLLM, Ray, Weights & Biases, Hydra

Who this is for

ML engineers running multi-stage post-training pipelines on open models who want to move beyond fixed SFT→RL sequences and test cooperative scheduling without building all infrastructure from scratch.

Build steps

Implement a difficulty estimator that runs lightweight pass@k rollouts (via vLLM) every N steps and maintains a per-sample difficulty score cache updated incrementally
Build a sample router that reads the difficulty cache and assigns each mini-batch sample to an SFT path, an RLVR path (GRPO), or a skip path based on configurable thresholds
Implement the cooperative training loop using Ray for distributed rollout generation and a custom PyTorch training step that applies the correct loss per routed sample
Add Hydra-based config for threshold schedules, mixing ratios, and routing strategies so ablations (fixed mix vs. dynamic gate vs. pure RL) can be launched from a single config file
Benchmark on a reasoning task (GSM8K or MATH500), produce learning curves per regime, and export a report comparing cooperative scheduling vs. naive joint loss baseline

Risks

Maintaining a live difficulty cache introduces training non-stationarity — scores computed early in training may become stale and route samples incorrectly as the model improves
Ray + vLLM orchestration adds significant operational complexity; debugging distributed deadlocks or NCCL errors can consume a large fraction of the month budget
GRPO reward variance can spike when sample routing changes the effective batch distribution — reward normalization needs to be re-tuned whenever the router thresholds are adjusted

Cooperative SFT+RL Interleaving Scheduler

Who this is for

Build steps

Risks

Business Angle