Constraint-Violation Detector for Robot Trajectory Descriptions

A CLI tool that feeds constraint-sensitive natural-language instructions into an open-source world model and flags predicted outcomes that violate stated physical constraints.

Difficulty: 1-week | Stack: Python, FastAPI, Hugging Face Transformers, OpenCV, SQLite, Rich (CLI)

Who this is for

Researchers evaluating a new video prediction model who want an automated signal for safety-relevant failure modes (e.g., the model predicts a robot confidently moving through a fragile object) without hand-labeling every prediction.

Build steps

Stand up a thin FastAPI wrapper around an open-source video world model (e.g., UniSim weights via Hugging Face, or a frame-prediction ViT fine-tuned on DROID) that accepts a start frame + instruction string and returns a predicted next frame.
Build a constraint template library: a JSON file mapping constraint types (fragile object, joint limit, workspace boundary) to natural-language instruction variants and expected constraint keywords.
For each (start frame, constraint instruction) pair, call the world model and extract the predicted frame.
Run a CLIP or GPT-4o Vision check: embed the predicted frame and compare it against embeddings of ‘constraint satisfied’ vs. ‘constraint violated’ reference images; log score and binary verdict to SQLite.
Build a Rich CLI report that prints per-constraint-type violation rates and saves flagged predicted frames as a side-by-side grid PNG for manual review.
Write a one-command test runner that cycles through all template pairs and exits non-zero if any constraint category exceeds a configurable violation threshold—making it CI-friendly.

Risks

Open-source video world models often require significant VRAM (16GB+); if local hardware is unavailable, you’ll need a GPU cloud instance which adds cost and setup time.
CLIP-based constraint violation scoring is a proxy—it may have high false-positive rates for visually subtle violations like slight over-extension of a joint; budget time for threshold tuning.
Constraint template coverage is narrow by design; the tool’s value depends entirely on the quality and diversity of the handcrafted JSON templates, which is an ongoing maintenance burden.