SaaS tool that auto-generates branch-aware DPO/RLHF training datasets from your agentic LLM workflows — no ML infrastructure team required.

Customer: Solo ML engineer or research-adjacent indie hacker who is fine-tuning an open-source model (Llama, Mistral, Qwen) for a specific agentic task — e.g., coding assistant, customer support bot, or tool-use agent — and currently hand-curating preference pairs in spreadsheets or flat JSONL files because they can’t afford a full data flywheel setup.

Problem: Fine-tuning open-source LLMs for agentic tasks requires structured, step-level preference data (DPO pairs, process reward signals), but existing tools produce flat trajectory logs that ignore branching decisions. Practitioners either skip RLHF entirely or spend weeks hacking together custom pipelines that break with every model swap.

Pricing: saas-mrr — $800 MRR in 4 months (16 paying users at $49/mo on a ‘Researcher’ tier with 10k rollouts/month)

Why now

The research cluster around step-aligned and branch-aware RL for agents (RLVR, DPO variants, process reward models) is about 12–18 months ahead of available tooling. Practitioners are actively reading these papers and trying to implement them but have no off-the-shelf pipeline — the gap between paper and production is the product.

Go-to-market

Post a detailed technical write-up on Substack/HuggingFace blog titled ‘How to generate DPO training data from branching agent rollouts’ — release the core tree-sampling library as open-source on GitHub with a link to the hosted dashboard for dataset export and visualization. This is the top-of-funnel.
Drop the GitHub repo + a 2-minute Loom demo in r/LocalLLaMA, Alignment Forum, and the EleutherAI Discord where the target persona already hangs out — focus the message on ‘stop hand-labeling preference pairs.’
DM 10–15 people who have publicly posted about fine-tuning open-source models for agentic tasks on Twitter/X or HuggingFace discussions; offer free beta access in exchange for a 20-minute feedback call and a testimonial.
Launch on HuggingFace Spaces with a live demo that lets anyone paste an agent prompt and see a branching rollout tree + exported DPO JSONL — no sign-up required. Capture emails at the export step.

Moat (or lack thereof)

No real moat. The core tree-sampling logic is replicable in a weekend by anyone who read the same papers. The only durable advantages are: (1) being first to have a clean hosted UI so practitioners don’t have to self-host, and (2) accumulating integrations (Axolotl, LLaMA-Factory, Unsloth export formats) that create switching friction. Expect a well-funded competitor or a HuggingFace-native feature to eat this space within 18 months — the play is to grow fast, stay lean, and potentially sell to a fine-tuning platform.