AI Pulse
← Projects · weekend

Modality Gap Probe

A tool that stress-tests a VLM by varying font, resolution, and background of rendered text to find the rendering recipe that minimises the pixel-text vs token-text accuracy gap.

Difficulty: weekend | Stack: Python, Pillow, OpenAI GPT-4o or Claude 3.5 Sonnet API, Gradio

Who this is for

ML engineers and document-AI teams who feed scanned or styled text into a VLM pipeline and want to know which rendering parameters actually matter before they touch model weights.

Build steps

  1. Build a renderer that takes a text string and produces N image variants by sweeping a grid of (font family, font size, background colour, JPEG quality) combinations using Pillow.
  2. Write a QA harness: for each variant send the image to the VLM API alongside a fixed set of comprehension questions; also send the raw text string as a control condition.
  3. Score every variant against ground-truth answers and compute a ‘modality gap’ metric (token accuracy − image accuracy) per rendering parameter.
  4. Build a Gradio dashboard that shows a heatmap of the gap across the parameter grid and highlights the top-3 rendering configs.
  5. Export a best-config JSON artefact that downstream pipelines can consume as a preprocessing spec.

Risks

  • API rate limits blow the budget when sweeping a large parameter grid — cap N to ~30 variants and cache responses aggressively.
  • Results are benchmark-specific; a config optimal for English prose may degrade on code or tables, so communicate scope clearly in the UI.
  • VLM API providers may change vision encoders silently, making saved configs stale — add a version field and a re-run button.

Business Angle

A render-parameter optimizer that tells ML engineers exactly which font/resolution/background settings minimize their VLM's pixel-text accuracy gap—before they touch model weights.

Customer: Solo ML engineer or small document-AI team (2–5 people) at a Series A startup building invoice parsing, receipt OCR, or form extraction on top of GPT-4o or Claude—they're seeing 10–20% accuracy drops on styled or low-res scans vs clean text and don't know if it's the model or their preprocessing pipeline.

Pricing: one-time — $800 in first 60 days via 8–10 one-time sales at $79–$99 each

Full business breakdown →