AI Pulse
← Projects · weekend

CultureCaptions: Native-Sourced Image-Text Collector

A lightweight web tool that lets native speakers submit and annotate culturally-specific image-caption pairs to build WAON-style adaptation datasets.

Difficulty: weekend | Stack: Python, FastAPI, SQLite, Sentence-Transformers (multilingual CLIP), Jinja2 HTML frontend

Who this is for

NLP researchers and ML engineers building vision-language models for non-English languages who need natively-sourced ground-truth data rather than machine-translated captions.

Build steps

  1. Scaffold a FastAPI app with two routes: a submission form (image URL + caption in target language) and an admin review queue.
  2. Store submissions in SQLite with metadata: submitter locale, timestamp, image URL, raw caption, and review status.
  3. Integrate a multilingual CLIP model (e.g., clip-ViT-B-32-multilingual-v1 via sentence-transformers) to auto-score image-caption alignment and flag low-confidence pairs for human review.
  4. Build a simple admin page that shows the image, caption, CLIP score, and Approve/Reject buttons so a native-speaker moderator can curate the dataset.
  5. Export approved pairs to a Hugging Face Dataset-compatible JSON/Parquet file with a one-command CLI export script.

Risks

  • Image URLs rot quickly — submissions that pass review may become 404s before the dataset is used; mitigate by downloading and storing images locally at submission time.
  • Multilingual CLIP scores are less reliable for very low-resource languages, so the auto-filter may let through mismatched pairs or block valid ones.
  • Without a real native-speaker community to seed submissions, the dataset stays empty — you need at least a small closed beta group to validate the workflow.

Business Angle

A hosted data-collection platform where NLP/ML teams pay per verified native-speaker submission to build culturally-grounded vision-language datasets.

Customer: A solo ML researcher or small academic lab (1–3 people) at a non-US university working on a low-resource language vision-language model — they have a modest compute grant, no annotation budget for Mechanical Turk at scale, and a paper deadline in 6 months.

Pricing: marketplace-fee — $600 MRR in 4 months (6 research teams × $100/month average, each collecting ~500 verified pairs/month at $0.20/pair)

Full business breakdown →