Branch-Aware Trajectory Sampler for Multi-Turn Agents
A training data generation pipeline that samples branching rollouts from an LLM agent, stores them as a tree, and exports step-level preference pairs for DPO/RLHF fine-tuning.
Difficulty: 1-week | Stack: Python, FastAPI, SQLite (via SQLModel), Hugging Face Transformers, VLLM or Ollama for local inference, NetworkX for tree storage
Who this is for
Practitioners who want to fine-tune open-source models on agentic tasks but lack structured training data — this generates branch-aware preference pairs rather than flat trajectory logs.
Build steps
- Define a task environment interface (e.g., file-system navigation or a mock API) with a deterministic success checker, so rollouts can be scored automatically.
- Implement a tree-structured rollout sampler: at each step, sample N candidate actions from the model, execute each in a forked environment state, and store the resulting branches as children in a NetworkX tree.
- Score leaf nodes using the environment’s success checker plus an optional LLM-as-judge call for partial credit; propagate scores up the tree to parent steps.
- Extract step-level preference pairs: for each branching point, pair the highest-scoring child action against the lowest-scoring one, yielding (prompt, chosen_step, rejected_step) triples aligned to the StepPO formulation.
- Build a FastAPI export endpoint that serializes the dataset to JSONL in both raw-step and DPO-ready formats, and a simple HTML tree visualizer so users can inspect branch quality before training.
- Validate the pipeline end-to-end by running a small fine-tuning job with Hugging Face TRL’s DPO trainer on the exported pairs and measuring task success rate improvement.
Risks
- Tree-branching inference is expensive — forking N times per step on a local 7B model can be 10–50× slower than a single rollout; you may need to aggressively cap branching factor or step depth.
- Environment state forking is non-trivial if the task involves real side effects (file writes, API calls); mocking the environment convincingly enough to fork cleanly is the hardest engineering problem here.
- Step-level preference pairs can be noisy if the success checker is coarse (binary) — without partial-credit scoring, most pairs will have identical scores and be useless for training.
Business Angle
SaaS tool that auto-generates branch-aware DPO/RLHF training datasets from your agentic LLM workflows — no ML infrastructure team required.
Customer: Solo ML engineer or research-adjacent indie hacker who is fine-tuning an open-source model (Llama, Mistral, Qwen) for a specific agentic task — e.g., coding assistant, customer support bot, or tool-use agent — and currently hand-curating preference pairs in spreadsheets or flat JSONL files because they can't afford a full data flywheel setup.
Pricing: saas-mrr — $800 MRR in 4 months (16 paying users at $49/mo on a 'Researcher' tier with 10k rollouts/month)
Full business breakdown →