You cannot get enough training data in low-resource languages — 95% of NLP data is in English, Chinese, or European languages

devtools0 views
You want to build a customer support chatbot in Swahili for a Kenyan fintech company. You need conversational training data in Swahili. Common Crawl has 50TB of English text and 200MB of Swahili. Wikipedia has 6.7M English articles and 75K Swahili articles. There is no Swahili equivalent of Reddit, Stack Overflow, or the thousands of English forums that provide diverse conversational data. The best available Swahili LLM is a translated/fine-tuned version of an English model that still thinks in English patterns and makes grammatical errors that a Swahili speaker immediately notices. So what? 7,000 languages are spoken worldwide. ML works well in ~20 of them. The remaining 6,980 languages have insufficient digital text for training language models. This means 3+ billion people who speak languages like Yoruba, Tagalog, Amharic, Quechua, or Bengali cannot use AI tools in their native language. The 'AI revolution' is an English-first revolution. Products built on English LLMs and deployed globally produce outputs that range from awkward to offensive in non-English languages. Why does this persist? The internet was built in English. The platforms that generate the most text data (Reddit, Twitter, Wikipedia, GitHub) are English-dominated. Generating synthetic training data in low-resource languages using translation introduces errors and loses cultural context. Recording and transcribing spoken language (which is how most low-resource languages exist — orally, not digitally) is expensive: $50-100/hour for transcription. Building a usable Swahili dataset from scratch would cost $1-5M.

Evidence

Common Crawl language distribution: English 45%, Chinese 5%, German/French/Japanese 3-4% each, Swahili <0.01%. Wikipedia: 6.7M English articles vs 75K Swahili. FLORES-200 benchmark shows LLM quality drops 30-50% on low-resource languages. Masakhane project working on African NLP but resources are limited. Lacuna Fund provides grants for low-resource language datasets — confirming the gap.

Comments