AI Pulse
← Projects · 1-week

Physical Plausibility Filter for Synthetic Video Datasets

A pipeline that ingests synthetic video clips, scores each clip’s temporal coherence and physical plausibility using a video foundation model, and culls low-quality samples before they enter a training set.

Difficulty: 1-week | Stack: Python, PyTorch, Hugging Face transformers (VideoLlava or Cosmos 3 API), FFmpeg-python, SQLite + SQLModel, FastAPI, Celery + Redis

Who this is for

ML engineers building perception models for robotics or AV who generate synthetic data at scale and need automated quality gating instead of manual spot-checks.

Build steps

  1. Set up an ingestion endpoint (FastAPI) that accepts video file uploads or S3 URIs and enqueues scoring jobs via Celery; store job state in SQLite.
  2. Implement a frame-sampling strategy (1 fps keyframes + optical-flow change-detection frames) to keep token budgets manageable when calling the video model.
  3. Write a scoring module that prompts the video model with a fixed rubric: object persistence (does a held object teleport?), gravity consistency (do objects fall correctly?), and motion blur plausibility. Parse the model’s response into a 0–1 score per dimension.
  4. Aggregate per-dimension scores into a composite plausibility score; write thresholded PASS/FAIL labels and score metadata back to SQLite alongside each clip’s URI.
  5. Build a minimal React or Streamlit dashboard that surfaces the score distribution, lets you preview borderline clips, and exports a filtered manifest CSV for downstream training jobs.

Risks

  • Cosmos 3 API access may be gated or rate-limited at launch; design the scoring module behind an abstract interface so you can swap in VideoLlava-7B or LLaVA-NeXT-Video from HuggingFace as a fallback without rewriting the pipeline.
  • Ground-truth labels for ‘physically plausible’ are expensive to obtain, making it hard to objectively tune score thresholds; use a small hand-labeled validation set of obvious failures (clipping artifacts, teleporting objects) to calibrate before trusting the model on subtle cases.
  • Synthetic video generators often produce systematic artifacts (e.g., shadow pop-in) that the model may flag as physics violations when they are actually renderer bugs; distinguish ‘rendering artifact’ from ‘physics error’ categories or you will filter out valid-physics but visually imperfect clips.

Business Angle

Automated physical plausibility scoring for synthetic video datasets, so ML engineers stop wasting GPU hours training on broken sim data.

Customer: A solo ML engineer or small team (2–5 people) at a robotics or AV startup who owns the synthetic data pipeline — they run a sim (Isaac Sim, CARLA, BlenderProc) at scale, generate thousands of clips per week, and currently do QA by spot-checking 50 clips manually before a training run.

Pricing: saas-mrr — $800 MRR in 4 months (8 customers at $99/mo, or 3–4 at $199/mo for higher throughput tiers)

Full business breakdown →