Beyond English: How Researchers Are Building AI That Works for the Rest of the World

Most large language and vision models are trained on data that skews heavily toward English, or at best toward a handful of high-resource languages. The assumption has been that scale compensates for imbalance — that a model trained on enough global data will implicitly absorb cultural and linguistic nuance. A cluster of recent research challenges that assumption directly, and the cumulative picture is useful for anyone building or deploying models outside the English-speaking world.

Culture-specific understanding requires culture-specific data

The WAON project is a direct test of the “global pretraining is enough” hypothesis for Japanese. Researchers built a dataset of 100,000 image-text pairs sourced and annotated by native Japanese speakers, then used it to adapt contrastive vision-language models. The key finding from (WAON: A Large-Scale Japanese Image-Text Dataset for Cultural Adaptation in Contrastive Vision-Language Models) is that global pretraining alone is not sufficient for culture-specific understanding — natively sourced adaptation data provides measurable gains on top of it.

This matters because it reframes the problem. The question isn’t just whether a model has seen Japanese data; it’s whether that data reflects how Japanese speakers actually communicate, what visual contexts they reference, and what cultural knowledge is implicit in their language. Removing English-only caption filters during pretraining helps, but it doesn’t replicate the signal you get from data created by and for native speakers. WAON offers a template that other language communities can follow.

Principled training budgets for low-resource languages

Adapting a model to a low-resource language involves a series of interlocking tradeoffs: how many epochs to train, how to allocate compute across language-specific versus multilingual data, and how to stage training phases. Until recently, these decisions were largely empirical — practitioners tuned by trial and error. The M³ scaling law work addresses this directly, extending classical compute-optimal training analysis to cover multi-epoch, multi-lingual, and multi-stage settings. The goal is principled budget allocation rather than guesswork, which becomes especially valuable when compute resources for a given language community are limited.

The practical implication is that researchers working on, say, a Central Asian or Indigenous language model can now reason more systematically about where to spend their training budget, rather than inheriting defaults calibrated for English-scale data.

Fine-tuning on a shoestring: LoRA vs QLoRA for Bashkir

Parameter-efficient fine-tuning has become the standard approach for adapting large models to specific domains or languages without full retraining. But empirical comparisons across architectures and languages are still sparse, especially for agglutinative languages with morphologically rich structure. (Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir) fills part of that gap.

Bashkir is a Turkic language spoken primarily in the Russian Federation, with an agglutinative morphology that creates tokenization challenges for models designed around Indo-European patterns. The study evaluates LoRA and QLoRA across several architectures — including DistilGPT2, GPT-2 variants, Phi-2, and Qwen2.5-7B — on a 46.9 million token Bashkir corpus. The results provide concrete, architecture-specific guidance for practitioners working on similar low-resource adaptation tasks. The agglutinative structure of Bashkir makes it a reasonable proxy for other Turkic languages, so the findings have reach beyond a single language community.

Infrastructure before models

Before fine-tuning or adaptation is even possible, a language needs basic NLP infrastructure: tokenizers, text normalization pipelines, sentence segmentation. Tajik, written in Cyrillic script, has been almost entirely without publicly available tooling. (TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)) introduces a Python library that provides the first comprehensive processing pipeline for authentic Tajik text, preserving the original orthography rather than transliterating or simplifying it.

This kind of infrastructure work is unglamorous but cumulative. Every downstream model, dataset, and application for Tajik now has a foundation to build on that didn’t exist before. The pattern repeats across dozens of languages: the gap isn’t primarily about model architecture, it’s about the absence of the tooling and data that higher-resource languages take for granted.

The thread connecting all of this

These four efforts sit at different levels of the stack — dataset creation, training theory, fine-tuning methods, and NLP infrastructure — but they’re solving the same underlying problem. Global AI development has a systematic blind spot for languages and cultures that don’t generate large amounts of digitized, English-adjacent text. Closing that gap requires work at every level, and the research community is increasingly treating it as a tractable engineering problem rather than an intractable data problem.