Dialectal Language Models: Challenges and Opportunities

However, there’s been a surge of new Bilingual Language Learning Models, such as CroissantLLM for French, Jais for Arabic, and the Japanese Stable LM for Japanese. These specialized models tend to excel in handling their intended languages. Consequently, the emergence of dialect-focused LLMs raises intriguing questions — will future models master regional dialects?

In 2020, Benjamin et al. demonstrated that mBERT (multilingual BERT, pretrained on 104 Wikipedia languages) successfully transferred skills to Narabizi, an online Arabic dialect spoken in certain North African nations, even though it wasn’t included in the pretraining corpus. But what if we aim to develop a language model actually pre-trained on dialectal text data?

Learn More