Constraint-Violation Detector for Robot Trajectory Descriptions
A CLI tool that feeds constraint-sensitive natural-language instructions into an open-source world model and flags predicted outcomes that violate stated physical constraints.
Difficulty: 1-week | Stack: Python, FastAPI, Hugging Face Transformers, OpenCV, SQLite, Rich (CLI)
Who this is for
Researchers evaluating a new video prediction model who want an automated signal for safety-relevant failure modes (e.g., the model predicts a robot confidently moving through a fragile object) without hand-labeling every prediction.
Build steps
- Stand up a thin FastAPI wrapper around an open-source video world model (e.g., UniSim weights via Hugging Face, or a frame-prediction ViT fine-tuned on DROID) that accepts a start frame + instruction string and returns a predicted next frame.
- Build a constraint template library: a JSON file mapping constraint types (fragile object, joint limit, workspace boundary) to natural-language instruction variants and expected constraint keywords.
- For each (start frame, constraint instruction) pair, call the world model and extract the predicted frame.
- Run a CLIP or GPT-4o Vision check: embed the predicted frame and compare it against embeddings of ‘constraint satisfied’ vs. ‘constraint violated’ reference images; log score and binary verdict to SQLite.
- Build a Rich CLI report that prints per-constraint-type violation rates and saves flagged predicted frames as a side-by-side grid PNG for manual review.
- Write a one-command test runner that cycles through all template pairs and exits non-zero if any constraint category exceeds a configurable violation threshold—making it CI-friendly.
Risks
- Open-source video world models often require significant VRAM (16GB+); if local hardware is unavailable, you’ll need a GPU cloud instance which adds cost and setup time.
- CLIP-based constraint violation scoring is a proxy—it may have high false-positive rates for visually subtle violations like slight over-extension of a joint; budget time for threshold tuning.
- Constraint template coverage is narrow by design; the tool’s value depends entirely on the quality and diversity of the handcrafted JSON templates, which is an ongoing maintenance burden.