Physical Plausibility Filter for Synthetic Video Datasets
A pipeline that ingests synthetic video clips, scores each clip’s temporal coherence and physical plausibility using a video foundation model, and culls low-quality samples before they enter a training set.
Difficulty: 1-week | Stack: Python, PyTorch, Hugging Face transformers (VideoLlava or Cosmos 3 API), FFmpeg-python, SQLite + SQLModel, FastAPI, Celery + Redis
Who this is for
ML engineers building perception models for robotics or AV who generate synthetic data at scale and need automated quality gating instead of manual spot-checks.
Build steps
- Set up an ingestion endpoint (FastAPI) that accepts video file uploads or S3 URIs and enqueues scoring jobs via Celery; store job state in SQLite.
- Implement a frame-sampling strategy (1 fps keyframes + optical-flow change-detection frames) to keep token budgets manageable when calling the video model.
- Write a scoring module that prompts the video model with a fixed rubric: object persistence (does a held object teleport?), gravity consistency (do objects fall correctly?), and motion blur plausibility. Parse the model’s response into a 0–1 score per dimension.
- Aggregate per-dimension scores into a composite plausibility score; write thresholded PASS/FAIL labels and score metadata back to SQLite alongside each clip’s URI.
- Build a minimal React or Streamlit dashboard that surfaces the score distribution, lets you preview borderline clips, and exports a filtered manifest CSV for downstream training jobs.
Risks
- Cosmos 3 API access may be gated or rate-limited at launch; design the scoring module behind an abstract interface so you can swap in VideoLlava-7B or LLaVA-NeXT-Video from HuggingFace as a fallback without rewriting the pipeline.
- Ground-truth labels for ‘physically plausible’ are expensive to obtain, making it hard to objectively tune score thresholds; use a small hand-labeled validation set of obvious failures (clipping artifacts, teleporting objects) to calibrate before trusting the model on subtle cases.
- Synthetic video generators often produce systematic artifacts (e.g., shadow pop-in) that the model may flag as physics violations when they are actually renderer bugs; distinguish ‘rendering artifact’ from ‘physics error’ categories or you will filter out valid-physics but visually imperfect clips.
Business Angle
Automated physical plausibility scoring for synthetic video datasets, so ML engineers stop wasting GPU hours training on broken sim data.
Customer: A solo ML engineer or small team (2–5 people) at a robotics or AV startup who owns the synthetic data pipeline — they run a sim (Isaac Sim, CARLA, BlenderProc) at scale, generate thousands of clips per week, and currently do QA by spot-checking 50 clips manually before a training run.
Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo, or 3–4 at $199/mo for higher throughput tiers)
Full business breakdown →