AI Pulse
← Projects · weekend

SFT-RL Sample Gating Dashboard

A local tool that labels training samples by difficulty and visualizes which should receive SFT vs RL gradient updates.

Difficulty: weekend | Stack: Python, Hugging Face Transformers, Datasets, Gradio, Matplotlib

Who this is for

ML practitioners fine-tuning reasoning models who want to audit their training data before naively mixing SFT and RL losses—helps them avoid the interference problem described in the paper.

Build steps

  1. Load a small open reasoning dataset (e.g. GSM8K or MATH) and a base model (e.g. Qwen2.5-1.5B) locally
  2. Compute per-sample model confidence (pass@1 and pass@k) using greedy and sampled decoding to bucket samples into ‘already solved’, ‘near-solvable’, and ‘out-of-reach’
  3. Implement a gating function that assigns each sample a recommended regime label: SFT, RL, or Skip
  4. Build a Gradio UI that shows sample text, difficulty bucket, regime label, and gradient-conflict risk score side by side
  5. Export a filtered JSONL split per regime so the user can feed them into separate training stages

Risks

  • Pass@k estimation is slow without a GPU — on CPU even a 1.5B model will make the bucketing step impractically long for large datasets
  • Difficulty bucketing thresholds are heuristic; wrong cutoffs will mislabel samples and undermine the whole point of the tool
  • Gradio’s state management can get messy when dataset size grows — large datasets may need pagination or async loading to stay responsive

Business Angle

Audit tool that tells ML practitioners which training samples should get SFT vs RL updates—before they waste GPU budget on interference-damaged runs.

Customer: Independent ML engineer or small-team AI startup (2–5 people) fine-tuning reasoning models (e.g., Qwen, Mistral, DeepSeek) on domain-specific data—medical Q&A, legal reasoning, code—who runs training on rented A100s and can't afford to discover the SFT/RL interference problem after a $300 run.

Pricing: one-time — $1,200 in one-time sales in month 3 (roughly 12 licenses at $99 each)

Full business breakdown →