Counterfactual Consistency Probe for Vision-Language Models
Automatically test whether a VLM used for robot planning produces physically consistent predictions under counterfactual instructions.
Difficulty: weekend | Stack: Python, LiteLLM, PIL, pandas, Gradio
Who this is for
Robotics engineers who use VLMs (GPT-4o, Gemini, LLaVA) as lightweight world model proxies and want to know before deployment whether the model reasons causally or just pattern-matches.
Build steps
- Download 50–100 short robot manipulation clips from the public DROID dataset or a YouTube scrape; extract start-frame + ground-truth next-frame pairs.
- Write a prompt template that presents the start frame and asks the VLM to predict the next state given (a) the actual action and (b) a counterfactual action (e.g., ‘lift the cup’ vs. ‘push the cup left’).
- Call the VLM via LiteLLM for both conditions and capture the text description of predicted next state.
- Score consistency with a second LLM call: given the counterfactual action and the model’s prediction, ask whether the described outcome is physically plausible and directionally opposite to the actual action.
- Aggregate pass/fail rates per action category and render a Gradio dashboard showing worst-performing action types and example failure cases.
Risks
- VLM text-based predictions are ambiguous—scoring with a second LLM call introduces noise; plan for human spot-checking of ~10% of outputs to calibrate.
- DROID video files are large; downloading the full dataset is impractical for a weekend—scope to the publicly available ‘mini’ split or extract frames offline first.
- Counterfactual instructions must be semantically distinct but visually plausible; generating them procedurally is tricky and low-quality pairs will make results meaningless.