Modality Gap Probe

A tool that stress-tests a VLM by varying font, resolution, and background of rendered text to find the rendering recipe that minimises the pixel-text vs token-text accuracy gap.

Difficulty: weekend | Stack: Python, Pillow, OpenAI GPT-4o or Claude 3.5 Sonnet API, Gradio

Who this is for

ML engineers and document-AI teams who feed scanned or styled text into a VLM pipeline and want to know which rendering parameters actually matter before they touch model weights.

Build steps

Build a renderer that takes a text string and produces N image variants by sweeping a grid of (font family, font size, background colour, JPEG quality) combinations using Pillow.
Write a QA harness: for each variant send the image to the VLM API alongside a fixed set of comprehension questions; also send the raw text string as a control condition.
Score every variant against ground-truth answers and compute a ‘modality gap’ metric (token accuracy − image accuracy) per rendering parameter.
Build a Gradio dashboard that shows a heatmap of the gap across the parameter grid and highlights the top-3 rendering configs.
Export a best-config JSON artefact that downstream pipelines can consume as a preprocessing spec.

Risks

API rate limits blow the budget when sweeping a large parameter grid — cap N to ~30 variants and cache responses aggressively.
Results are benchmark-specific; a config optimal for English prose may degrade on code or tables, so communicate scope clearly in the UI.
VLM API providers may change vision encoders silently, making saved configs stale — add a version field and a re-run button.

Modality Gap Probe

Who this is for

Build steps

Risks

Business Angle