Mini RoboTrustBench: Four-Scenario Robustness Suite for Pluggable World Models

A self-contained evaluation harness that runs any video world model through all four RoboTrustBench scenario types and produces a per-category robustness scorecard.

Difficulty: 1-month | Stack: Python, PyTorch, Hugging Face Datasets, OpenCV, Gradio, SQLite, pytest

Who this is for

Academic robotics labs and industry teams adopting video world models as planning substrates who need a standardized internal benchmark to compare model versions or fine-tuning strategies before committing to a deployment candidate.

Build steps

Define a model adapter interface (a Python ABC with predict(start_frame, instruction) -> predicted_frame) so any world model can be plugged in; implement two concrete adapters: a random-baseline and one real model (e.g., a Hugging Face video diffusion checkpoint).
Build a scenario generator for each of the four types: (1) Normal—sample valid DROID episodes directly; (2) Constraint-Sensitive—augment valid instructions with physical constraint clauses using an LLM; (3) Counterfactual—negate or alter the action axis of valid instructions; (4) Adversarial—use an LLM to generate instructions designed to maximally confuse the model (physically impossible or semantically contradictory).
Implement a multi-metric scorer: frame-level SSIM vs. ground truth for Normal, CLIP constraint-adherence score for Constraint-Sensitive, directional consistency score for Counterfactual, and prediction confidence entropy for Adversarial.
Persist all predictions, scores, and metadata to SQLite; write pytest-based regression tests so a new model checkpoint can be validated against a fixed baseline score per scenario.
Build a Gradio dashboard with four tabs (one per scenario) showing score distributions, worst-k failure cases as image grids, and a model comparison table when multiple adapter results are stored.
Package as a pip-installable CLI (robotrustbench evaluate --model my_adapter.py --scenarios all --episodes 200) with a YAML config for scenario mix ratios and scoring thresholds.

Risks

LLM-generated adversarial and counterfactual instructions vary in quality run-to-run, making benchmark results non-deterministic unless you cache generated instructions; design the pipeline to serialize and reuse instruction sets from the start.
DROID episode licensing and download logistics are non-trivial; scoping to the public ‘DROID-100’ mini-split is safer, but limits scenario diversity and may not surface domain-shift failures.
Building a genuinely useful Adversarial scenario generator requires iterative red-teaming to ensure instructions are actually hard and not just nonsensical—underestimating this is the most likely cause of schedule slip.