Counterfactual Consistency Probe for Vision-Language Models

Automatically test whether a VLM used for robot planning produces physically consistent predictions under counterfactual instructions.

Difficulty: weekend | Stack: Python, LiteLLM, PIL, pandas, Gradio

Who this is for

Robotics engineers who use VLMs (GPT-4o, Gemini, LLaVA) as lightweight world model proxies and want to know before deployment whether the model reasons causally or just pattern-matches.

Build steps

Download 50–100 short robot manipulation clips from the public DROID dataset or a YouTube scrape; extract start-frame + ground-truth next-frame pairs.
Write a prompt template that presents the start frame and asks the VLM to predict the next state given (a) the actual action and (b) a counterfactual action (e.g., ‘lift the cup’ vs. ‘push the cup left’).
Call the VLM via LiteLLM for both conditions and capture the text description of predicted next state.
Score consistency with a second LLM call: given the counterfactual action and the model’s prediction, ask whether the described outcome is physically plausible and directionally opposite to the actual action.
Aggregate pass/fail rates per action category and render a Gradio dashboard showing worst-performing action types and example failure cases.

Risks

VLM text-based predictions are ambiguous—scoring with a second LLM call introduces noise; plan for human spot-checking of ~10% of outputs to calibrate.
DROID video files are large; downloading the full dataset is impractical for a weekend—scope to the publicly available ‘mini’ split or extract frames offline first.
Counterfactual instructions must be semantically distinct but visually plausible; generating them procedurally is tricky and low-quality pairs will make results meaningless.