CultureCaptions: Native-Sourced Image-Text Collector

A lightweight web tool that lets native speakers submit and annotate culturally-specific image-caption pairs to build WAON-style adaptation datasets.

Difficulty: weekend | Stack: Python, FastAPI, SQLite, Sentence-Transformers (multilingual CLIP), Jinja2 HTML frontend

Who this is for

NLP researchers and ML engineers building vision-language models for non-English languages who need natively-sourced ground-truth data rather than machine-translated captions.

Build steps

Scaffold a FastAPI app with two routes: a submission form (image URL + caption in target language) and an admin review queue.
Store submissions in SQLite with metadata: submitter locale, timestamp, image URL, raw caption, and review status.
Integrate a multilingual CLIP model (e.g., clip-ViT-B-32-multilingual-v1 via sentence-transformers) to auto-score image-caption alignment and flag low-confidence pairs for human review.
Build a simple admin page that shows the image, caption, CLIP score, and Approve/Reject buttons so a native-speaker moderator can curate the dataset.
Export approved pairs to a Hugging Face Dataset-compatible JSON/Parquet file with a one-command CLI export script.

Risks

Image URLs rot quickly — submissions that pass review may become 404s before the dataset is used; mitigate by downloading and storing images locally at submission time.
Multilingual CLIP scores are less reliable for very low-resource languages, so the auto-filter may let through mismatched pairs or block valid ones.
Without a real native-speaker community to seed submissions, the dataset stays empty — you need at least a small closed beta group to validate the workflow.

CultureCaptions: Native-Sourced Image-Text Collector

Who this is for

Build steps

Risks

Business Angle