Rendering-Aware Document Preprocessor

A drop-in FastAPI microservice that receives a scanned-document image, tries a small set of pre-baked rendering transforms, picks the one that scores highest on a quick VLM confidence probe, and returns the enhanced image.

Difficulty: weekend | Stack: Python, FastAPI, Pillow, pdf2image, OpenAI GPT-4o-mini API (low-cost probe), Docker

Who this is for

Developers building document-ingestion pipelines (expense reports, invoices, receipts) who want higher OCR/VLM accuracy without swapping their base model.

Build steps

Define a small transform library in Pillow: contrast normalisation, deskew, binarisation (Otsu), upscale-to-300dpi, and background whitening — five transforms, combinable.
For each candidate transform, crop a 20 % thumbnail of the document and ask GPT-4o-mini to transcribe three words; score by character-level edit distance against a naive baseline.
Select the highest-scoring transform, apply it to the full image, and return the result as a PNG alongside a metadata JSON {‘transform’: …, ‘confidence_delta’: …}.
Add a /batch endpoint that accepts a multi-page PDF (via pdf2image), applies per-page selection, and streams back a ZIP of enhanced pages.
Dockerise the service and write a one-page README showing how to plug it in front of any existing VLM extraction step.

Risks

The confidence probe itself uses a VLM, adding latency and cost — if the pipeline is already tight on budget, replace the probe with a heuristic (mean pixel entropy) as a zero-API fallback.
Aggressive binarisation destroys colour cues that may matter (e.g. red ‘PAID’ stamps) — add a flag to skip binarisation for colour-sensitive document types.
The ‘best’ transform chosen on the thumbnail may not generalise to the full page if content density varies — validate on a held-out set of real documents before deploying.

Rendering-Aware Document Preprocessor

Who this is for

Build steps

Risks

Business Angle