Mandarin is weirder than Cantonese

« previous post | next post »

So says idibon.

Beijing Cream took the hint and ran with it: "Cantonese, Which Sounds Like A Jackhammer Mating With A Chainsaw, Is Apparently Less 'Weird' Than Mandarin".

When I first read these sensationalistic claims, I stood back, took a deep breath, and said to myself, "Wait a minute! There are lots of people (mostly Mandarin speakers!) who swear that Mandarin is the most pleasant sounding of all the Sinitic languages." Just what is it that has led idibon to declare Cantonese to be less weird than Mandarin?

To assure myself that idibon was not a bunch of crackpots, I went to their homepage and found that they are actually a serious NLP outfit.  Their management and advisors include recognized linguists who are associated with Stanford University, others who have worked for Google and Yahoo, and leaders in crowdsourcing, a word I didn't know until I wrote this post.  Chris Walker worked at LDC (here at Penn) for several years.  Ben Zimmer is familiar with the work of Tyler Schnoebelen and first mentioned him on Language Log in Oct. 2011, referring to his NWAV paper about affective patterns on Twitter, and Ben also covered his 2012 NWAV paper, presenting his research with David Bamman and Jacob Eisenstein on gender and Twitter.

Let's take a look at the research report (or, rather, blog post) that spawned all the wild headlines.  Mark Liberman blogged about this "weirdest languages" piece, both explaining and critiquing it.  Rather than repeat what Mark has said, I encourage readers to go to his post, and also to look at the lively discussion which followed it in the comments.

For a full list of the 21 weirdness features and all of the languages that had values for at least one of them, click on the link at the bottom of the idibon post.

For those who are in a hurry, I'll mention just a couple of things:

1. In this context, "weird" means roughly "has linguistic features that are unlike those of most other languages".

2. On a scale that measures 21 distinctive features, Mandarin came out as being among the top 25 weirdest languages, whereas Cantonese was among the bottom 10 of weird languages.

Put THAT in your pipe and smoke it!

But an even bigger surprise than that for me was finding Hungarian near the bottom of the weirdness scale, since I have recently returned from Budapest where I found the language to be singularly opaque, quite unlike my experience when travelling almost anywhere else on the globe.

[Thanks to Kaiser Kuo and Anne Henochowicz]



  1. Tyler Schnoebelen said,

    August 14, 2013 @ 9:41 pm

    The first folks in the Chinese press to contact us were from the South China Morning Post in HK. This got me to look more carefully at the Chinese data. Here I'll excerpt what I wrote to the journalist (since I think it was all a bit technical, it didn't really make it in the article):

    First an overview of the "weird" features of Mandarin.
    An example of a "uvular continuants" in Mandarin would be something like 和. Mandarin is one of only 12/567 languages that have a uvular sound but it is only a continuant (basically a continuant has continuing airflow). Cantonese doesn't have any uvular consonants at all (in that way it's like 468 of the 567 languages that WALS has data for about this feature).

    [My understanding since writing this is that is sometimes (often?) velar.]

    Like English, there is "no initial velar nasal in Mandarin". 88 out of 469 languages are like this–where they have a velar nasal, but you can't put it at the beginning of a word. Cantonese is like the majority of the world's languages (468/567) in that it is happy having velar nasals anywhere in a word. What is an initial velar nasal? A good example is the Cantonese pronunciation of 我.

    So these are really the features that make the biggest difference in distinguishing Mandarin and Cantonese on the weirdness scale. Remember that I limited myself to the data that was in WALS. But because of you and your readers: Cantonese has just got weirder.

    Each researcher in WALS took some particular area–like "sound patterns"–and tried to get as many different languages coded as possible. So the person who coded Mandarin and Cantonese for uvular continuants is not going to be the same person who is coding stuff about syntax.

    Looking at Mandarin and Cantonese more closely, I see that Cantonese is missing values for a couple features that make Mandarin particularly weird:
    - In Mandarin, you can put a pronoun in the same place you'd put a full noun phrase, but you can also drop it. That's rare in the world's languages–only 61 of 711 languages surveyed do this. The value for Cantonese was missing. I believe Cantonese lets you drop pronouns. So let's fill in the blank as "the same as Mandarin".

    - If you want to make a "causative construction" (like "I made her read the book") in Mandarin, you use compounding. (Lots of languages, like Japanese, have you change the verbs themselves). Only 9 of 310 languages surveyed do compounding like Mandarin. But as far as I know, Cantonese also does this. So it should be 10 out of 311 languages.

    Mandarin had a weirdness index of 0.79, the updates now drop it just a tiny bit to 0.78. And Cantonese USED to have a weirdness index of only 0.14. But now because we've added the stuff about pronouns and causatives, Cantonese has a score of about 0.66. You weirded Cantonese!

    This is of course a major limitation of the work–there are lots of holes in the data. There's lots of work to be done in linguistics. Some of the opportunities are to build better language technologies and that's relevant for all sizes of languages. There's also a lot of work to be done preserving languages that are much smaller than Cantonese. In addition to the personal aspects of what it means to not be able to speak the language of your ancestors, you can also think of the data loss: if we care about understanding the human brain, understanding all the various configurations of human languages is important.

    (Very special thanks to Rob Voigt who is one of our interns here this summer and speaks several Chinese languages and helped me fill in the blanks and provide examples!)

  2. JS said,

    August 14, 2013 @ 10:37 pm

    So apparently the uvular fricative /χ/ was key to Mandarin's "weirdness"… but what is really weird is this analysis; I can't fathom why /χ/ should be preferred to /x/, which matches /k/, /k'/ in addition simply to being more accurate.

  3. Simon P said,

    August 14, 2013 @ 11:59 pm

    Mandarin phonetics sure are weirder than Cantonese, but the grammars are so similar that I was very surprised to read this. Tyler's comment above explains some of the discrepancy. It was surprising to me that Cantonese loses weirdness by allowing initial "ng". Who knew that's so common?

    The Beijing Cream piece is of course tongue in cheek, but I must say I find Cantonese much more pleasant to the ears than Mandarin, which sounds a bit "whiny" to me.

  4. Daniel said,

    August 15, 2013 @ 12:03 am

    Isn't it now common for initial velar nasals to be dropped in casual spoken Cantonese? According to Wikipedia, the beginnings of this change were documented in Hong Kong as early as 1856, and it has also been spreading to other Cantonese-speaking regions. Initial velar nasals are still retained in careful speech, but perhaps it is only a matter of time before Cantonese speakers end up speaking a language just as "weird" as Mandarin.

  5. NC said,

    August 15, 2013 @ 1:35 am

    The 'Weirdness' could be due to China trying to standardize 'Mandarin.' They are many dialects of Mandarin from Guizhou to Shanghai to Beijing, with Beijing dialect being the official language. Cantonese, though, is another language in its own, with their own dialects, from Toishanese to Hakka to Singapore slang-la. :-)

  6. Victor Mair said,

    August 15, 2013 @ 6:49 am


    You have some good points there, but Hakka is another language, and Singapore slang is "something else".

  7. Sybil said,

    August 15, 2013 @ 7:04 am

    You're just now encountering "crowdsource? You've got to get out more! (Said as an alleged Early Adopter who can't bring herself to create a Facebook account.)

  8. Victor Mair said,

    August 15, 2013 @ 8:08 am


    At least I know what "cloud computing" is — sort of.

  9. Ran Ari-Gur said,

    August 15, 2013 @ 10:37 am

    > Cantonese is like the majority of the world's languages (468/567) in that it is happy having velar nasals anywhere in a word.

    I think you made a mistake here. In WALS' sample, only 146/469 allowed initial velar nasals: to be sure, this is more than the 88/469 that have velar nasals but don't allow them in initial position, but it's far from a majority. And the commentary explicitly indicates that there are languages that allow initial velar nasals but not final ones (even though they permit final consonants in general), so not even 146/469 are "happy having velar nasals anywhere in a word".


  10. Ran Ari-Gur said,

    August 15, 2013 @ 10:37 am

    (Sorry, I should have specified: my previous comment is @Tyler Schnoebelen.)

  11. Neil Dolinger said,

    August 15, 2013 @ 12:08 pm

    "- If you want to make a "causative construction" (like "I made her read the book") in Mandarin, you use compounding. (Lots of languages, like Japanese, have you change the verbs themselves). Only 9 of 310 languages surveyed do compounding like Mandarin. But as far as I know, Cantonese also does this. So it should be 10 out of 311 languages."

    Maybe I am misunderstanding your point, but wouldn't this be the case for all analytic languages? And since analytics are the norm throughout most of east Asia, this would seem to lower the "weird" score for both Mandarin and Cantonese quite a bit, no?

  12. JS said,

    August 15, 2013 @ 1:28 pm

    ^ Plus Mandarin doesn't use compounding to say things like "I made her read the book" (Talmy's "caused agency") anyway. The "compounding" Tyler is referring to must be the Mand. V1-V2 constructions that encode caused events in the form agent action + caused event (da3-sui4 'hit-shatter', etc.)? (Without checking, my sense is that these have developed from forms in which the semantic patient sat between the two verbs, where now it tends to be preposed with ba3.) This (1) is a very particular type of "causative"; (2) may not be "compounding"; (3) doesn't seem to me to be at all rare in SEA.

  13. leoboiko said,

    August 15, 2013 @ 1:55 pm

    > finding Hungarian near the bottom of the weirdness scale, since I have recently returned from Budapest where I found the language to be singularly opaque, quite unlike my experience when travelling almost anywhere else on the globe.

    I wonder why is that?

  14. Teddybeer said,

    August 15, 2013 @ 6:23 pm

    I must say Cantonese isn't the most pleasant language to my ears, although I'm quite okay with Mandarin. I still don't understand why Cantonese isn't weird.

  15. Deck Zech said,

    August 15, 2013 @ 6:39 pm

    According to "The World Atlas of Language Structures", 235 out of 469 languages have no velar nasal at all, 146 out of 469 have an initial velar nasal, and 88 out of 469 have no initial velar nasal. It also says "the other striking aspect of phonemic ŋ is its phonotactic distribution: in many languages possessing this sound, it may not appear in all positions in the word, but rather is restricted to initial, medial, or final position, or some combination thereof. In the case of restriction of ŋ to non-initial position, this, too, has a relatively pronounced areal skewing among the world’s languages."

    So is the velar nasal that only appears in the final position more weird than otherwise?

  16. Simon P said,

    August 15, 2013 @ 11:40 pm

    @Daniel: Yeah, initial 'ng' is commonly dropped in speech in Hong Kong. Initial 'n' is also usually replaced with 'l'. This is how most people who have not been trained to "speak properly" (newsreaders and actors) speak, including people who swear they'd never use "lazy sounds". But the "standard" is still to use the initial 'ng' and 'n', even though very few people do it (and most who do hypercorrect). There are, however, lots of dialects outside of HK where the 'ng' and 'n' initials are still used.

    I guess if we measure HK "street" Canto, which doesn't allow initial 'ng' or 'n', but allows them at the end of syllables, its weirdness score goes up quite a bit? Mandarin allows initial 'n', after all, so maybe Canto would win out?

  17. michael farris said,

    August 16, 2013 @ 1:51 am

    "finding Hungarian near the bottom of the weirdness scale, since I have recently returned from Budapest where I found the language to be singularly opaque, quite unlike my experience when travelling almost anywhere else on the globe"

    "I wonder why is that?"

    Not an expert on Hungarian but a few years ago at my Hungarian peak I could real genre literature and follow the main points of the plot (missing small details but getting the big picture). And I can still speak survival Hungarian and I visit the country as often as I can.

    Basically there's no one thing that's especially weird in global terms but there are a lot of things that you wouldn't expect in a language geographically completely surrounded by Germanic, Romance and Slavic.

    Also, add to that even if you're reasonably fluent in a Romance, a non-English Germanic and a Slavic language (as I was went I first visited Hungary) there's not a lot of vocabulary similarity in public signs (unlike Romania where learning a few simple Romance sound conversion rules makes many public signs understandable). The etymological connections are there, especially with Slavic but they're not immediately obvious.

    Add to that, a lackadaisical attitude towards foreign language learning beyond the most basic communicative level (officially English is the most commonly taught foreign language but German is maybe more widely understood away from the touristy parts)

    Finally add big doses of Central European abruptness (which can be startling and intimidating to most NAmericans) and some lingering communist era apathy towards 'the public' in the public sector and hungary can be a hard nut to crack communicatively.

  18. Deck Zech said,

    August 16, 2013 @ 2:59 am

    @Simon P

    Although the Hong Kong people refer this phenomenal to as "lazy sound" (懶音), I really think these changes should be called "sound changes" instead.

  19. Neil Kubler said,

    August 16, 2013 @ 6:32 am

    Though I don't particularly like comparing the relative "weirdness" of different languages, I will point out that Cantonese — like English, German, French, Spanish, Italian (and also like several other Southern Chinese languages and dialects) — can employ the verb "have" as an auxiliary to indicate completed aspect/perfect tense, as in Keuih yauh heui "She has gone" (in questions, negatives, and statements). Mandarin can do this for negatives (Ta meiyou qu) and sometimes for questions (Ta you meiyou qu? — for which older Northerners would prefer Ta qule meiyou? or Ta qule mei qu?), but traditionally standard Mandarin does not allow "have" as an auxiliary in statements. So Cantonese is in this respect closer to English and other European languages, but Mandarin is "weird"! (Addendum: In the last 5 years, I've noticed that some speakers in their teens or early twenties in Beijing [including junior Chinese language teachers!] will utter sentences like "Wo you qu" and "Tamen you chi", which very much grate on their elders' ears. This is probably at least in part due to the perceived "coolness" of Hong Kong and Taiwanese culture, especially songs [in Taiwanese and Taiwan Mandarin use of you as auxiliary is extremely common]. Whether this usage, which is certainly not yet fully accepted in Beijing, will eventually take root and become part of standard Mandarin is still unclear…)

  20. Simon P said,

    August 16, 2013 @ 6:34 am

    I agree. I called them "lazy sounds" because I was paraphrasing people I've talked to. I myself don't call them that. It's one of those psychological things. Everyone does it, but nobody thinks they do it.

  21. Victor Mair said,

    August 16, 2013 @ 11:34 am

    @michael farris

    You nailed it! Said it better than I could have.

  22. 康邁克 said,

    August 17, 2013 @ 3:06 pm

    I commented on this study with friends a couple weeks ago, and the conclusion it was a weird study. I did notice the linguists involved, but I feel the variables tested were lacking. Just wondering where Southern Min came in as it should beat Mandarin, and wondering if languages like Atayal or Cou were considered. I find Cou extremely weird for pronunciation and its particles infuriating at times, then there are the clusters of Atayal in words like tmbqlit, pqnqihan, rrgyax, lmnnglung, and snd Atayal easier to pronounce than Cou despite the clusters, but I find it even more clustering than Kartuli.

RSS feed for comments on this post

Leave a Comment