Wikilangs
NLP models for 340+ languages
Wikilangs provides NLP bedrock models (word embeddings, tokenizers, and base models) for 340+ languages derived from Wikipedia. It enables NLP research and applications for languages that lack resources from major providers.
Related posts
Why I stopped trusting the official Wikipedia dataset, and what I did about it
It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.
A Wordle for the Worldle
I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.