Physical Plausibility Filter for Synthetic Video Datasets

A pipeline that ingests synthetic video clips, scores each clip’s temporal coherence and physical plausibility using a video foundation model, and culls low-quality samples before they enter a training set.

Difficulty: 1-week | Stack: Python, PyTorch, Hugging Face transformers (VideoLlava or Cosmos 3 API), FFmpeg-python, SQLite + SQLModel, FastAPI, Celery + Redis

Who this is for

ML engineers building perception models for robotics or AV who generate synthetic data at scale and need automated quality gating instead of manual spot-checks.

Build steps

Set up an ingestion endpoint (FastAPI) that accepts video file uploads or S3 URIs and enqueues scoring jobs via Celery; store job state in SQLite.
Implement a frame-sampling strategy (1 fps keyframes + optical-flow change-detection frames) to keep token budgets manageable when calling the video model.
Write a scoring module that prompts the video model with a fixed rubric: object persistence (does a held object teleport?), gravity consistency (do objects fall correctly?), and motion blur plausibility. Parse the model’s response into a 0–1 score per dimension.
Aggregate per-dimension scores into a composite plausibility score; write thresholded PASS/FAIL labels and score metadata back to SQLite alongside each clip’s URI.
Build a minimal React or Streamlit dashboard that surfaces the score distribution, lets you preview borderline clips, and exports a filtered manifest CSV for downstream training jobs.

Risks

Cosmos 3 API access may be gated or rate-limited at launch; design the scoring module behind an abstract interface so you can swap in VideoLlava-7B or LLaVA-NeXT-Video from HuggingFace as a fallback without rewriting the pipeline.
Ground-truth labels for ‘physically plausible’ are expensive to obtain, making it hard to objectively tune score thresholds; use a small hand-labeled validation set of obvious failures (clipping artifacts, teleporting objects) to calibrate before trusting the model on subtle cases.
Synthetic video generators often produce systematic artifacts (e.g., shadow pop-in) that the model may flag as physics violations when they are actually renderer bugs; distinguish ‘rendering artifact’ from ‘physics error’ categories or you will filter out valid-physics but visually imperfect clips.

Physical Plausibility Filter for Synthetic Video Datasets

Who this is for

Build steps

Risks

Business Angle