Branch-Aware Trajectory Sampler for Multi-Turn Agents

A training data generation pipeline that samples branching rollouts from an LLM agent, stores them as a tree, and exports step-level preference pairs for DPO/RLHF fine-tuning.

Difficulty: 1-week | Stack: Python, FastAPI, SQLite (via SQLModel), Hugging Face Transformers, VLLM or Ollama for local inference, NetworkX for tree storage

Who this is for

Practitioners who want to fine-tune open-source models on agentic tasks but lack structured training data — this generates branch-aware preference pairs rather than flat trajectory logs.

Build steps

Define a task environment interface (e.g., file-system navigation or a mock API) with a deterministic success checker, so rollouts can be scored automatically.
Implement a tree-structured rollout sampler: at each step, sample N candidate actions from the model, execute each in a forked environment state, and store the resulting branches as children in a NetworkX tree.
Score leaf nodes using the environment’s success checker plus an optional LLM-as-judge call for partial credit; propagate scores up the tree to parent steps.
Extract step-level preference pairs: for each branching point, pair the highest-scoring child action against the lowest-scoring one, yielding (prompt, chosen_step, rejected_step) triples aligned to the StepPO formulation.
Build a FastAPI export endpoint that serializes the dataset to JSONL in both raw-step and DPO-ready formats, and a simple HTML tree visualizer so users can inspect branch quality before training.
Validate the pipeline end-to-end by running a small fine-tuning job with Hugging Face TRL’s DPO trainer on the exported pairs and measuring task success rate improvement.

Risks

Tree-branching inference is expensive — forking N times per step on a local 7B model can be 10–50× slower than a single rollout; you may need to aggressively cap branching factor or step depth.
Environment state forking is non-trivial if the task involves real side effects (file writes, API calls); mocking the environment convincingly enough to fork cleanly is the hardest engineering problem here.
Step-level preference pairs can be noisy if the success checker is coarse (binary) — without partial-credit scoring, most pairs will have identical scores and be useless for training.

Branch-Aware Trajectory Sampler for Multi-Turn Agents

Who this is for

Build steps

Risks

Business Angle