InfoDensity Reasoning Compressor
Fine-tune a small LLM to produce information-dense CoT traces using RL reward on token-efficiency + correctness
Difficulty: 1-week | Stack: Python, PyTorch, trl (TRL library), Hugging Face transformers, vLLM, GSM8K or MATH dataset
Who this is for
ML engineers running reasoning models at scale who pay per-token — compress reasoning 40-60% without accuracy drop
Build steps
- Fine-tune baseline (Qwen2.5-3B or Phi-3-mini) on GSM8K with standard SFT to get chain-of-thought traces
- Implement dual reward: correctness check (answer match) + information density proxy (unique n-gram ratio / trace length)
- Run GRPO or PPO via TRL with the combined reward signal; log both metrics separately in W&B
- Eval on held-out GSM8K and MATH: compare trace length vs. accuracy curve against baseline and naive length-penalty baseline
- Build minimal FastAPI endpoint that returns compressed trace + final answer, expose token savings per request
Risks
- Density proxy (n-gram ratio) may not correlate with actual reasoning quality — need ablations against human eval or stronger judge model
- RL training on small GPU (< 24GB VRAM) unstable with PPO; GRPO more tractable but still requires careful KL tuning
- GSM8K saturation means gains look small — may need harder benchmark (MATH level 4-5) to see signal
Business Angle
Fine-tune small LLMs to compress reasoning traces 40-60% via RL, sold as a drop-in model or fine-tuning service to ML teams paying per-token inference bills
Customer: ML engineer at a Series A-C startup running GPT-4o or Claude for reasoning-heavy workflows (coding assistants, math tutors, agentic pipelines) — their inference bill is $3k-$20k/month and their CTO is asking why
Pricing: one-time — $2k in month 1 (2-3 one-time fine-tune jobs at $500-$1k each), $4k MRR by month 3 via small SaaS tier hosting compressed model checkpoints
Full business breakdown →