AI Pulse
← Projects · weekend

Rendering-Aware Document Preprocessor

A drop-in FastAPI microservice that receives a scanned-document image, tries a small set of pre-baked rendering transforms, picks the one that scores highest on a quick VLM confidence probe, and returns the enhanced image.

Difficulty: weekend | Stack: Python, FastAPI, Pillow, pdf2image, OpenAI GPT-4o-mini API (low-cost probe), Docker

Who this is for

Developers building document-ingestion pipelines (expense reports, invoices, receipts) who want higher OCR/VLM accuracy without swapping their base model.

Build steps

  1. Define a small transform library in Pillow: contrast normalisation, deskew, binarisation (Otsu), upscale-to-300dpi, and background whitening — five transforms, combinable.
  2. For each candidate transform, crop a 20 % thumbnail of the document and ask GPT-4o-mini to transcribe three words; score by character-level edit distance against a naive baseline.
  3. Select the highest-scoring transform, apply it to the full image, and return the result as a PNG alongside a metadata JSON {‘transform’: …, ‘confidence_delta’: …}.
  4. Add a /batch endpoint that accepts a multi-page PDF (via pdf2image), applies per-page selection, and streams back a ZIP of enhanced pages.
  5. Dockerise the service and write a one-page README showing how to plug it in front of any existing VLM extraction step.

Risks

  • The confidence probe itself uses a VLM, adding latency and cost — if the pipeline is already tight on budget, replace the probe with a heuristic (mean pixel entropy) as a zero-API fallback.
  • Aggressive binarisation destroys colour cues that may matter (e.g. red ‘PAID’ stamps) — add a flag to skip binarisation for colour-sensitive document types.
  • The ‘best’ transform chosen on the thumbnail may not generalise to the full page if content density varies — validate on a held-out set of real documents before deploying.

Business Angle

A $29/mo FastAPI microservice that auto-enhances scanned documents before OCR so developers stop babysitting image quality issues in their ingestion pipelines

Customer: Solo dev or small-team backend engineer at a 5-50 person SaaS company who built a document ingestion pipeline (expense reports, invoices, contracts) using GPT-4o or similar, and is getting ~80-85% extraction accuracy because scanned inputs are skewed, low-contrast, or poorly lit — not because their prompt is wrong

Pricing: saas-mrr — $800 MRR in 4 months (targeting ~28 paying customers at $29/mo)

Full business breakdown →