StepPO Visualizer: Agentic Credit Assignment Explorer
An interactive tool that runs a small LLM agent on multi-step tasks and visualizes how step-level vs token-level reward signals differ across a trajectory.
Difficulty: weekend | Stack: Python, LangGraph, Gradio, OpenAI API (or local Ollama), Matplotlib/Plotly
Who this is for
ML engineers and researchers who want an intuition for why token-level RL is a poor fit for agentic tasks — seeing the gradient signal difference visually is more compelling than reading the math.
Build steps
- Build a minimal tool-calling agent with LangGraph that solves short multi-step tasks (e.g., calculator + web-search mock), logging every step (tool chosen, input, output) as a discrete node.
- Implement two reward-attribution modes: token-spread (reward smeared across all tokens in the trajectory) vs step-aligned (reward assigned only to the action token at each step boundary).
- Generate a batch of trajectories (successful and failed) using the agent, storing step metadata and token-level log-probabilities from the model.
- Build a Gradio UI that replays a selected trajectory and overlays a heat-map of reward attribution per token, toggling between the two attribution modes.
- Add a simple comparison panel showing mean attribution variance and signal-to-noise ratio across modes for a set of trajectories, making the granularity mismatch concrete and measurable.
Risks
- Extracting per-token log-probabilities from API models (e.g., OpenAI) is rate-limited and may require switching to a local model like Llama via Ollama to get full logprob access.
- Designing tasks that are long enough to show a meaningful difference between attribution modes but short enough to run dozens of trajectories within a weekend budget.
- Reward function design is deceptively hard — a naive binary success/failure reward will make both modes look similar; you need partial-credit rewards per step to surface the difference.
Business Angle
A $49 interactive visualizer that helps ML engineers build intuition for credit assignment in agentic RL — before they waste weeks on the wrong training loop.
Customer: ML engineer at a 10-50 person AI startup or research lab, building their first agentic RL pipeline, who has read the StepPO/RLHF papers but hasn't internalized *why* token-level reward is broken for multi-step tasks until they see their own agent's gradient signal fall apart visually
Pricing: one-time — $800 in month 1 (16 sales at $49), breakeven by week 3 if launched on HuggingFace Spaces + one well-timed tweet thread
Full business breakdown →