Rendering-Aware Document Preprocessor
A drop-in FastAPI microservice that receives a scanned-document image, tries a small set of pre-baked rendering transforms, picks the one that scores highest on a quick VLM confidence probe, and returns the enhanced image.
Difficulty: weekend | Stack: Python, FastAPI, Pillow, pdf2image, OpenAI GPT-4o-mini API (low-cost probe), Docker
Who this is for
Developers building document-ingestion pipelines (expense reports, invoices, receipts) who want higher OCR/VLM accuracy without swapping their base model.
Build steps
- Define a small transform library in Pillow: contrast normalisation, deskew, binarisation (Otsu), upscale-to-300dpi, and background whitening — five transforms, combinable.
- For each candidate transform, crop a 20 % thumbnail of the document and ask GPT-4o-mini to transcribe three words; score by character-level edit distance against a naive baseline.
- Select the highest-scoring transform, apply it to the full image, and return the result as a PNG alongside a metadata JSON {‘transform’: …, ‘confidence_delta’: …}.
- Add a /batch endpoint that accepts a multi-page PDF (via pdf2image), applies per-page selection, and streams back a ZIP of enhanced pages.
- Dockerise the service and write a one-page README showing how to plug it in front of any existing VLM extraction step.
Risks
- The confidence probe itself uses a VLM, adding latency and cost — if the pipeline is already tight on budget, replace the probe with a heuristic (mean pixel entropy) as a zero-API fallback.
- Aggressive binarisation destroys colour cues that may matter (e.g. red ‘PAID’ stamps) — add a flag to skip binarisation for colour-sensitive document types.
- The ‘best’ transform chosen on the thumbnail may not generalise to the full page if content density varies — validate on a held-out set of real documents before deploying.
Business Angle
A $29/mo FastAPI microservice that auto-enhances scanned documents before OCR so developers stop babysitting image quality issues in their ingestion pipelines
Customer: Solo dev or small-team backend engineer at a 5-50 person SaaS company who built a document ingestion pipeline (expense reports, invoices, contracts) using GPT-4o or similar, and is getting ~80-85% extraction accuracy because scanned inputs are skewed, low-contrast, or poorly lit — not because their prompt is wrong
Pricing: saas-mrr — $800 MRR in 4 months (targeting ~28 paying customers at $29/mo)
Full business breakdown →