InfoDensity Reasoning Compressor

Fine-tune a small LLM to produce information-dense CoT traces using RL reward on token-efficiency + correctness

Difficulty: 1-week | Stack: Python, PyTorch, trl (TRL library), Hugging Face transformers, vLLM, GSM8K or MATH dataset

Who this is for

ML engineers running reasoning models at scale who pay per-token — compress reasoning 40-60% without accuracy drop

Fine-tune baseline (Qwen2.5-3B or Phi-3-mini) on GSM8K with standard SFT to get chain-of-thought traces
Implement dual reward: correctness check (answer match) + information density proxy (unique n-gram ratio / trace length)
Run GRPO or PPO via TRL with the combined reward signal; log both metrics separately in W&B
Eval on held-out GSM8K and MATH: compare trace length vs. accuracy curve against baseline and naive length-penalty baseline
Build minimal FastAPI endpoint that returns compressed trace + final answer, expose token savings per request

Density proxy (n-gram ratio) may not correlate with actual reasoning quality — need ablations against human eval or stronger judge model
RL training on small GPU (< 24GB VRAM) unstable with PPO; GRPO more tractable but still requires careful KL tuning
GSM8K saturation means gains look small — may need harder benchmark (MATH level 4-5) to see signal