AI Pulse
← Projects · 1-week

InfoDensity Reasoning Compressor

Fine-tune a small LLM to produce information-dense CoT traces using RL reward on token-efficiency + correctness

Difficulty: 1-week | Stack: Python, PyTorch, trl (TRL library), Hugging Face transformers, vLLM, GSM8K or MATH dataset

Who this is for

ML engineers running reasoning models at scale who pay per-token — compress reasoning 40-60% without accuracy drop

Build steps

  1. Fine-tune baseline (Qwen2.5-3B or Phi-3-mini) on GSM8K with standard SFT to get chain-of-thought traces
  2. Implement dual reward: correctness check (answer match) + information density proxy (unique n-gram ratio / trace length)
  3. Run GRPO or PPO via TRL with the combined reward signal; log both metrics separately in W&B
  4. Eval on held-out GSM8K and MATH: compare trace length vs. accuracy curve against baseline and naive length-penalty baseline
  5. Build minimal FastAPI endpoint that returns compressed trace + final answer, expose token savings per request

Risks

  • Density proxy (n-gram ratio) may not correlate with actual reasoning quality — need ablations against human eval or stronger judge model
  • RL training on small GPU (< 24GB VRAM) unstable with PPO; GRPO more tractable but still requires careful KL tuning
  • GSM8K saturation means gains look small — may need harder benchmark (MATH level 4-5) to see signal

Business Angle

Fine-tune small LLMs to compress reasoning traces 40-60% via RL, sold as a drop-in model or fine-tuning service to ML teams paying per-token inference bills

Customer: ML engineer at a Series A-C startup running GPT-4o or Claude for reasoning-heavy workflows (coding assistants, math tutors, agentic pipelines) — their inference bill is $3k-$20k/month and their CTO is asking why

Pricing: one-time — $2k in month 1 (2-3 one-time fine-tune jobs at $500-$1k each), $4k MRR by month 3 via small SaaS tier hosting compressed model checkpoints

Full business breakdown →