Unifying Arabic topolects through AI
« previous post | next post »
Meet Habibi – the Chinese AI uniting 20 Arabic dialects in a Middle East first
Lead author says there are many differences between Arabic dialects and Modern Standard Arabic, which is used in official circumstances
Zhao Ziwen, SCMP, 28 Feb 2026
The paper that presents this new model is called “Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis”. It was published last month on arXiv, an open-access repository that is not peer-reviewed. I will be interested to hear what Language Log readers think of its prospects.
Led by Shanghai Jiao Tong University’s X-LANCE Lab – one of China’s top audiovisual and language processing research entities – the model is named Habibi, meaning “my dear” in Arabic.
In presenting their findings, the research team spearheaded by Chen Yushen described the project in a paper as “the first open-source framework for unified-dialectal Arabic speech synthesis”.
They introduce a concept that is new to me: "zero-shot".
Habibi has the “zero-shot” ability, meaning the model can easily clone a voice by using just a short reference audio clip, without prior explicit or extensive training. This allows applications in highly efficient and on-the-fly scenarios.
According to Wikipedia,
Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.
Zero-shot methods generally work by associating observed and non-observed classes through some form of auxiliary information, which encodes observable distinguishing properties of objects. For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an artificial intelligence model which has been trained to recognize horses, but has never been given a zebra, can still recognize a zebra when it also knows that zebras look like striped horses. This problem is widely studied in computer vision, natural language processing, and machine perception.
Selected readings
- "LLMs and tree-structuring" (9/18/25)
- "Radial dendrograms (7/26/23)
- "Language trees and script trees" (12/27/21)
- "AMI not AGI?" (8/2/25)
Addendum
In case you're interested, "Habibi" itself is an Arabic word worth learning in one of its 20 plus topolects: Syrian, Egyptian, Jordanian, Levantine…. Because of its wide range of meanings, nuances, and usages, be careful of how, when, and to whom you use it.
Listen here.
[Thanks to Mark Metcalf]

David Marjanović said,
February 28, 2026 @ 4:20 pm
So… it can read aloud in any of 20 Arabic topolects plus the standard, as desired? Is that what "unite" means?
Jarek Weckwerth said,
February 28, 2026 @ 5:18 pm
My question exactly. I haven't read the paper because the link leads to a 404 error page. But the only interpretation I can think of (from the mention of using just a short sample for "cloning") is that it will be multidialectal: the same "voice" will be used for multiple dialects. This is what is done in the newest TTS systems which are increasingly multilingual.
In principle it's not possible to speak multiple dialects at the same time if the dialectal forms differ. Just like it's not possible to speak multiple languages at the same time.
Victor Mair said,
February 28, 2026 @ 7:13 pm
Try this: https://www.scmp.com/news/china/article/3344955/meet-habibi-chinese-ai-uniting-20-arabic-dialects-middle-east-first
Works for me.
Peter Cyrus said,
March 1, 2026 @ 6:19 am
Trying to imagine where this is heading…
The world is losing languages, and dialects are homogenizing, despite a much larger population. The force driving this is clearly the need to communicate with speakers of other languages/dialects, more of them and further away. Catalans study English not only to speak to Brits, but also to speak to Germans and Japanese.
But what if we lived in a world in which anything you said or wrote could be translated instantly into any listener's/reader's native tongue, maybe even in his mother's voice? Would this centripetal force reverse itself? Would we end up with 8 billion idiolects?