Managed cloud runs + consulting for ML engineers who want step-level RL fine-tuning without building the scaffolding themselves.

Customer: ML engineer (solo or small team) at a seed/Series A AI startup building a domain-specific agent — e.g. a coding agent, document-processing agent, or tool-use agent — who has binary outcome labels (‘did it succeed?’) but no annotation budget for step-level feedback and no RL infra expertise.

Problem: Wiring VLLM + TRL + PEFT + a trajectory buffer + a learned critic into a coherent RL loop is 2-4 weeks of painful plumbing that is orthogonal to the actual research. Most applied teams give up and stay with SFT or prompt engineering, leaving significant quality gains on the table.

Pricing: open-core — $1,500 MRR in 4 months (3 consulting engagements at $500/month retainer or ~10 hosted experiment credits at $150/run)

Why now

GRPO, DAPO, and process reward models all landed in the last 6 months and created a wave of practitioners who read the papers but can’t operationalize them. The gap between ‘I understand this works’ and ‘I have it running on my task’ is at peak width right now, and the OSS tooling (TRL, PEFT) just crossed the threshold of being good enough to build on top of.

Go-to-market

Publish the harness as OSS on GitHub with a single-command demo on a well-known agentic benchmark (e.g. ALFWorld or WebArena-lite) and a clear before/after reward curve. This is your proof-of-concept artifact.
Write one long-form technical post on the HuggingFace blog or Substack walking through how the critic learns to densify sparse signals — target the specific audience searching ‘reward shaping LLM agent’. Cross-post to r/MachineLearning and EleutherAI Discord.
DM 10 ML engineers on Twitter/X who have publicly complained about sparse rewards or RL fine-tuning pain. Offer a free 1-hour ‘bring your task’ session in exchange for a testimonial. Convert 2-3 to paid retainers.
Add a ‘Run in the cloud’ button to the README (Replicate or Modal) so researchers with no GPU budget can try it in 10 minutes for ~$5. This is your low-friction top-of-funnel and surfaces which tasks people actually care about.

Moat (or lack thereof)

No meaningful moat. The harness is replicable by any competent ML engineer in a few weeks, and Hugging Face or a well-funded lab could absorb this into TRL directly. The real defensibility is speed — being the go-to OSS reference for dense reward synthesis before a bigger player standardizes it, and accumulating task-specific benchmark results that take time to reproduce. Treat this as a consulting wedge and reputation play, not a durable SaaS business.