The Multiplex

Checking in on multilingual large language models

Motivation

In 2025 I had a lot of success using tools like Claude, ChatGPT and Gemini to augment my language learning system. I wanted to look into how things have improved for language learners since mass LLM adoption began a few years ago.

GPT-3's training dataset consisted of roughly 93% English-language text, with the remaining 7% representing all other languages. My first thought upon reading this was that LLMs were not great language-learning aides when ChatGPT was initially released running on GPT-3.5 in late 2022. How has multilingual performance improved since then?

Glimpse of Then vs Now

GPT-3.5 to GPT-5

GPT-4 performed significantly better on 3-shot MMLU across nearly all languages than GPT-3.5 did in English alone. GPT-5's system card shows gpt-5-main and gpt-5-thinking shows significant improvement over GPT-4 on 0-shot MMLU.

Global PIQA Performance

Global PIQA is a commonsense reasoning benchmark for over 100 languages. This feels appropriately more rigorous for evaluating multilingual model performance than translating English MMLU into a target language and using the result for evaluation. Below is a set of state-of-the-art models and their respective scores as reported by Google.

Benchmark Gemini 3 Flash Thinking Gemini 3 Pro Thinking Gemini 2.5 Flash Thinking Gemini 2.5 Pro Thinking Claude Sonnet 4.5 Thinking GPT-5.2 Extra-high Grok 4.1 Fast Reasoning
Global PIQA 92.8% 93.4% 90.2% 91.5% 90.1% 91.2% 85.6%

Source: https://deepmind.google/models/gemini/

Conclusion

It's safe to say that a language learner leveraging LLMs is more well-positioned to achieve their goals than the same learner in the ancient history of November 2022. Of course we can say the same thing about anybody using LLMs for any task then vs now. Perhaps a better conclusion is if you tried a language-learning app in 2022 and weren't impressed with the AI features, check back in to see how things have improved.

Note: Low-resource languages seem to be, unsurprisingly, the primary problem to be solved in model multilingualism today. Indeed, the data I've encountered shows that performance loss between two languages correlates well with a representation difference between them.

#ai #language-learning #llms