Prosody and "elastic words" in Chinese

« previous post | next post »

During my recent visit to Michigan, San Duanmu told me about some really neat work that he published last year as "Word-length preferences in Chinese: a corpus study", Journal of East Asian Linguistics 21.1: 89-114, 2012.

San starts from the observation that many Chinese words have two forms, a two-syllable form and a one-syllable frorm, whose meanings are more or less the same.  For example:

煤炭 (méi tàn) 煤 (méi) ‘coal’
學習 (xué xí) 學 (xué) ‘to learn; study’
工人 (gōng rén) 工 (gōng) ‘worker’
商店 (shāng diàn) 店 (diàn) ‘store’
老虎 (lǎo hǔ) 虎 (hǔ) ‘tiger’
印度 (Yìn dù) 印 (Yìn) ‘India’

In "How many Chinese words have elastic length?", in Feng Shi and Gang Peng, Eds., Festschrift in honor of Prof. William S-Y. Wang's 80th birthday, 2011, San found that that

80%-90% of all Chinese words have elastic length. In addition, the percentage for verbs is higher than the average, and the percentage for nouns is higher still.

And he observes that

The long form may look like a compound, but it is not. For example, the long form of hu ‘tiger’ is laohu, which literally means ‘old tiger’. However, laohu simply means ‘tiger’, not ‘old tiger’, because even a baby tiger can be called laohu. Similarly, the long form of mei ‘coal’ is meitan, which literally means ‘coal-charcoal’ but actually means ‘coal’, not ‘coal and charcoal’.

But the one-syllable and two-syllable versions are not equally likely to be used in all contexts. Specifically, in a closely-associated two-word sequence, there are four logical possibilities:

2+2  2+1  1+2  1+1

Based on a quantitative analysis of the Lancaster Corpus of Mandarin Chinese, San's study found that

1+2 is overwhelmingly disfavored in [N N] and 2+1 is overwhelmingly disfavored in [V O]. In addition, it is found that apparent exceptions, ranging between 1% and 2%, are limited to certain specific structures, and when these are factored out, both 1+2 [N N] and 2+1 [V O] are well below 1% in either token count or type count.

Others have noted these regularities, though San's paper is the first to support the intuitions with systematic corpus-based counts.

A non-corpus-based illustration: in combining méi (tàn) "coal" with (shāng) diàn "store" to make a noun+noun combination meaning "coal store", we could have méi tàn + shāng diàn = 2+2, or méi tàn + diàn = 2+1, or méi + diàn = 1+1, but NOT méi + shāng diàn = 1+2.

And in combining zhòng (zhí) "to plant" with (dà) suàn "garlic", to make a verb+object combination meaning "to plant garlic", we could have zhòng zhí + dà suàn = 2+2, or zhòng + dà suàn = 1+2, or zhòng + suàn = 1+1, but NOT zhòng zhí + suàn = 2+1.

So to repeat, [N N] can be anything except 1+2, while [V O] can be anything but 2+1.

Why?

Some people have argued that this is a prosodic restriction: basically, in an [N N] structure, the first N should be stronger (and hence not smaller than) the second one; but in a [V O] structure, the object should be stronger than the verb. Others have argued for a syntactic or semantic treatment.

San observes that there are several reasons why a careful study of usage is helpful in answering these questions. For example, there are quite a few exceptions, such as 皮手套 pí shǒutào "leather glove" or 喜歡錢 xǐhuan qián "love money". Then there's the question of the relative frequency of the preferred patterns (2+2, 2+1, 1+1 for [N N], 2+2, 1+2, 1+1 for [V N]). And there are cases where different patterns exist for the same phrase, with slightly different meanings. And so on.

Here are a couple of graphs from the paper:

For San's explanation of the exceptions as well as the rules, you should read his paper, which is quite accessible for outsiders.

Corpus-based studies of this kind are becoming more frequent, for theoretical as well as practical reasons. I have a modest suggestion for making them more useful, which is to publish the details of the underlying annotations as well as the summary statistics and the resulting conclusions.

San does something like this in part, by adding an appendix that lists all of the 45 exceptional 1+2 [N N] examples and all of the 56 exceptional 2+1 [V O] examples from the Lancaster corpus.  But he doesn't provide their detailed locations in the corpus, nor the identity and locations of the (much larger number of) regular cases.

The annotation of the regular cases is non-trivial, since the Lancaster corpus provides no annotation of syntactic bracketing. San had to find all the instances of noun-noun and verb-noun sequences not separated by punctuation, and then screen (a 10% sample of) these manually to determine the "error" rates for different subcases. His final numbers were determined by applying the resulting proportions to the total string-based counts.

Other researchers may well want to check or extend this annotation, or to look at other aspects of the differentiation among length patterns, in this or other collections of Chinese text. In doing so, they should be able to start from an exact documentation of San's annotations.

Obviously, it wouldn't make sense to offer such data as a traditional "printed" appendix (even if no one ever actually printed it out). But it would be easy to define a form of stand-off annotation that would encode exactly which cases were classified in which way; and the results could be published as "supplementary materials".

If such documentation became common, then it could be interpreted by interactive programs for browsing, annotating, or analysing corpora, just as people now use Praat TextGrids (or other common annotation formats) in the case of audio collections.



30 Comments

  1. Bob Ladd said,

    May 13, 2013 @ 7:45 am

    Is the preference for 1+1 related to the very common practice of creating 1+1 abbreviations for 2+2 compounds? Example: Bei+Jing Da+Xue 'north-capital big-school' (i.e. Beijing University) is normally referred to as Bei+Da [I don't know how to get tone diacritics in basic HTML markup]. This is not like the cases discussed by San Duanmu, because you can't just use Bei to mean 'Beijing' or Da to mean 'University'. I've always assumed this was related to the writing system, because the same technique is used to create abbreviated names in Japanese (e.g. Too+Kyoo Dai+Gakku 'east-capital big-school', i.e. Tokyo University, is normally called Too+Dai). But perhaps instead (or in addition), there's an interesting relation to whatever it is that explains the more general preference for 1+1 and the more generally avoidance of 1+2 or 2+1.

  2. J.W. Brewer said,

    May 13, 2013 @ 8:05 am

    To Bob Ladd's point, that sort of "syllabic abbreviation" was historically not uncommon in German and also I believe Russian (the standard Godwin's-Law-invoking example is Geheime Staats-Polizei -> Gestapo, but there are many others; a Soviet example would be Gosudarstvenniy Komitet po Planirovaniyu -> Gosplan). Whatever was going on in German/Russian was presumably not a side effect of a kanji-based writing system . . .

    [(myl) Similar things are also encountered in English, e.g. in referring to academic programs or courses (SocSci, CogSci, BioChem, …). I've always assumed that this is just the consequence of verbalizing convenient written abbreviations.]

  3. J.W. Brewer said,

    May 13, 2013 @ 8:12 am

    That sort of thing is much less common in English but one can find it as one of the common patterns for short-forms for institutions of higher education in the U.S., e.g. UConn, UMass, or Wash U (the one in St. Louis). Also Cal Tech. But since it's MIT not Mass Tech, I'm not sure if there's a single rule that explains which of the available abbreviation patterns is used when.

  4. Brett said,

    May 13, 2013 @ 9:02 am

    @J.W. Brewer: I think Mark is right that the academic examples (whether for courses of study or whole academic institutions) come from "verbalizing convenient written abbreviations." The one prominent example of American English usage of this type that springs to mind is the naming of certain New York neighborhoods. However, that kind of naming seems to be intentionally affected and has produced some pop cultural parodies (e.g. in How I Met Your Mother, where there is a DOwnWInd from the SEwage TREatment PLAnt neighborhood).

  5. Michael Watts said,

    May 13, 2013 @ 10:15 am

    This phenomenon doesn't make me think of the one the other commenters are noting (SoHo for SOuth of HOuston). It makes me think of the (in my gut) much more common english phenomenon of producing a word by combining parts of two other words. Given a week or so, I could probably come up with an example I liked more, but a reference to people wearing Google Glass as "glassholes" seems more typical. It feels more natural-to-the-spoken-language and less this-is-how-we-write-it-so-I'm-going-to-say-it-this-way-too.

  6. Michael Watts said,

    May 13, 2013 @ 11:06 am

    I want to add some substance to my prior comment. I feel two things distinguish the two english phenomena I mention: (a) the one that doesn't seem parallel to what's discussed in the post produces names, while the one that does produces words (this is a weak distinction); (b) SoHo is not transparent, to the native-but-unfamiliar speaker as to what the source words were, whereas in what I'm trying to describe, both source words should be transparent to someone who hasn't heard the spliced word before.

  7. J.W. Brewer said,

    May 13, 2013 @ 11:07 am

    Looking at Bob Ladd's point from a different angle, maybe there is a writing system aspect just because hanzi doesn't (at least to my naive outsider's eyes) really lend itself to initials-as-such or initialisms. In English there's a choice of approaches between e.g. MIT and Mass Tech, or CIT and Cal Tech, which gets resolved differently with respect to those two institutions, but in hanzi the former option I suppose isn't really available. Now, in Japanese, you can always transliterate kanji (which often have polysyllabic readings, unlike what I take to be the usual case in Mandarin) into kana and abbreviate down to the first/"initial" kana, but IIRC the kana do not have standard "names" you could use for initialisms that are distinct from their pronunciations, so the sort of ambiguity where the "U" in "UMass could either be the letter U pronounced as it would be in an in an initialism or simply the standard pronunication of "university" clipped down to its initial syllable becomes universal.

  8. JS said,

    May 13, 2013 @ 11:45 am

    The thing is that with such elasticity (one of a pair of alternating word forms seeming to be selected based on syntactic considerations), what looks like "abbreviation" appears no more often than what looks like just the opposite — e.g., in the case of Yìn 印 ~ 印度 Yìndù 'India', the former could easily be considered an abbreviation, whereas with hǔ 虎 ~ lǎohǔ 老虎 ‘tiger’, it is rather the latter that feels like an "expansion" to fulfill particular prosodic needs.

    At a quick read-through I don't see that Duanmu (surname, incidentally) mentions it, but one could certainly think about certain kinds of "suffixation" in Mandarin, as of zi 子 to monosyllabic nouns, as a means of producing just such disyllabic alternates: so we have bān zhuō​zi 搬桌子 'move [a] table' but xuéxí zhuō 学习桌 'desk' (lit. "study table"), for instance…

    A question I had was what sorts of things are being counted as 1+1 N-N combinations, because while the graphs and stats create the impression of a strong preference for this shape (as mentioned by Bob Ladd), the interesting phenomenon and that focused on by the author is really the striking differential tendency towards 2+1 vs. 1+2 in N-N vs. V-N collocations.

  9. JS said,

    May 13, 2013 @ 11:57 am

    It just struck me that San Duanmu's interest in this topic is surely rooted in his possessing, precisely opposite to the norm, a disyllabic surname and a monosyllabic given name…

  10. David Eddyshaw said,

    May 13, 2013 @ 1:19 pm

    Classical Tibetan does the Gestapo/Kolkhoz clip thing extensively, for yet another example of a language where the motivation is obviously not traditional Chinese writing.

    A relatively familiar example is pan-tshen "Great Scholar" for pandita tshenpo (sorry, can't do diacritics on this machine.)

    Lots of others, eg byan-sems for byan-tshub sems-dpa "bodhisattva"

  11. Bob Ladd said,

    May 13, 2013 @ 3:57 pm

    @JWBrewer and MYL: I don't know about Russian, but in German this method of creating abbreviations of long compounds is indeed productive. It's not just a case of "convenient written abbreviations", though: it's almost invariably based on the onset + vowel of the first syllable of the words to be abbreviated. The 1999 solar eclipse (Sonnenfinsternis, lit. 'sun darkness', Sonnen + Finsternis) was informally abbreviated Sofi or SoFi. This kind of abbreviation is common enough to have its own name (Silbenwort, lit. 'syllable word') in descriptions of German grammar (brief discussion if you read German). This is not like English, where the relation between source word and abbreviation is not regular (e.g. economics can give either ec or econ) and where the pronunciation can be based on the source or on the abbreviation (e.g. sociology can give either [soS] or [sak]). It's also not like Chinese, where almost invariably we're dealing with a 2+2 source and a 1+1 abbreviation based on the first characters of the two 2-character components. (E.g. Di-Xia Tie-Dao 'ground-under iron-road', i.e. 'metro' or 'subway' (Tie-Dao is the regular term for 'railway'), which is normally referred to even in writing as Di-Tie.)

    @JS: You're absolutely correct that Duanmu's paper is primarily about the prohibition on 1+2 or 2+1. My comment was simply to wonder whether the prevalence of 1+1 – which emerges from Duanmu's data even though it is not his main point – is related to this common method of abbreviating longer compounds.

  12. Bob Ladd said,

    May 13, 2013 @ 4:04 pm

    @David Eddyshaw: The Classical Tibetan examples you cite are much more like the Chinese/Japanese pattern than the German/Russian examples. This suggests that the point in my original comment was correct: it's not (as I had always thought) just a matter of the writing system, but something specific (and prosodic?) about taking the pattern (s1 s2)(s3 s4) and shortening it to (s1 s3). Chinese influence on Tibet and Japan makes the pattern spread, even if the writing system doesn't spread with it (as it didn't in Tibet). Does any reader know if this method of creating abbreviations is also found in Korean and/or Vietnamese, which also show extensive Chinese influence?

  13. leoboiko said,

    May 13, 2013 @ 4:13 pm

    Is the elasticity related to Classical/Literary Chinese (wényán) at all? I mean, are the one-syllable words often the same as a Classical word, or these are different one-syllable words?

  14. David Eddyshaw said,

    May 13, 2013 @ 5:26 pm

    @Bob Ladd: although there certainly is Chinese influence in Classical Tibetan, I wonder if the shared frequency of the pattern might not be at least partly due to structural similarity, both languages having words made up overwhelmingly of monosyllabic morphemes combining much more transparently and freely with one another than morphemes do in most languages; the same is of course true of the Chinese component in Japanese, unsurprisingly.

    To test whether this is predominantly a cultural phenomenon, you'd want to look at a language structurally like these, but outside the cultural Sinosphere. None immediately comes to mind …

    That may very well just reflect the fact that I'm only familiar with a very narrow subset of languages … on the other hand, this sort of structure does seem to be an areal phenomenon, so the structural similarity in itself could *be* the Chinese influence. At this point my brain is beginning to hurt …

  15. David Eddyshaw said,

    May 13, 2013 @ 5:44 pm

    I may have found a counterexample, though my complete ignorance of the language in question means this is, to say the least, uncertain:

    Judging by "A Reference Grammar of Thai", Iwasaki and Ingkaphirom, the language seems to have lots of monosyllabic morphemes happily combining away transparently, but the section on "clipping" seems to consist entirely of what are really just abbreviations, like khoom for khoomphiwtee "computer" (once again, sorry no IPA or diacritics) and nothing even like "Gestapo."

    If this really is enough to show that Chinese-style clips aren't just a function of the relatively easy segmentability of Chinese words, then I suppose that does lend support to your thesis that this is actually a matter of direct Chinese influence.

  16. JS said,

    May 13, 2013 @ 6:03 pm

    @Bob Ladd
    My (relatively uninformed) impression is that this method of abbreviation is extremely prevalent in Vietnamese, Việt cộng 'Vietnamese communist [party]' ( s1-s3 is indifferent to the word classes which are Duanmu's focus). As far as examples that could, I can't seem to do any better than nán​cè 男厕 'men's bathroom' at the moment, if that even qualifies…

  17. JS said,

    May 13, 2013 @ 6:05 pm

    ^woops; mark-up issues

    @Bob Ladd
    My (relatively uninformed) impression is that this method of abbreviation is extremely prevalent in Vietnamese, Việt cộng 'Vietnamese communist [party]' (from Việt Nam cộng sản) being one of the more salient examples.

    … I don't know; just from casual consideration it doesn't seem that such abbreviation would much contribute to the class of 1+1 N-Ns. Běi​ Dà 北大, at least, certainly wouldn't be so analyzed (illustrating that s1s2-s3s4 to s1-s3 is indifferent to the word classes which are Duanmu's focus). As far as examples that could, I can't seem to do any better than nán​cè 男厕 'men's bathroom' at the moment, if that even qualifies…

  18. mollymooly said,

    May 14, 2013 @ 8:34 am

    "syllabic abbreviations" in English used to be (and sometimes still are) included under "acronyms". This makes less sense if initialisms/alphabetisms are included under "acronyms". NATO can pattern with either NaBisCo or NBC but not both.

  19. GeorgeW said,

    May 14, 2013 @ 8:39 am

    In the small set of two and one syllable words above, in some instances, the first syllable occurs in the one-syllable form, in others, it is the second syllable.

    What is the rule? Semantic?

    [(myl) According to Yan Dong, it's "informativity". See the link for what this means.]

  20. Jerry Friedman said,

    May 14, 2013 @ 4:27 pm

    Another source of "syllabic abbreviations" in English is the U. S. government—for example, "BUPERS" (Bureau of Naval Personnel).

  21. julie lee said,

    May 14, 2013 @ 5:35 pm

    My favorite U.S. govt. abbreviation is "Sosh" for "Social Security number". Whenever I call a U.S. govt. office about some personal business, the person on the phone will always ask "What's your sosh?"

  22. Bathrobe said,

    May 14, 2013 @ 7:30 pm

    An exception not listed in the paper is 煤码头 méi-mǎtóu 'coal terminal', although 煤炭码头 méitàn-mǎtóu seems to be more common.

  23. Bathrobe said,

    May 14, 2013 @ 7:37 pm

    @ Bob Ladd

    You don't need HTML markup for diacritics, you just need to be able to input them on your computer. On a Mac it is fairly easy (except for ǚ, which requires looking up the Characters panel); I have no idea about Windows.

  24. zythophile said,

    May 15, 2013 @ 5:57 am

    The lack of such "syllabic abbreviations" in English, such that they sound distinctly "unEnglish", is presumably exactly why Orwell parodied them in 1984's Newspeak, eg IngSoc, thinkpol.

  25. Bathrobe said,

    May 15, 2013 @ 8:06 pm

    Are they totally unEnglish?

    For instance I remember calling our PE classes Phys Ed at school. It's the running together that makes those words look unEnglish.

  26. JS said,

    May 15, 2013 @ 9:24 pm

    Reading through comments and the perplexing "informativity" link I still feel perversely driven to emphasize that, except in the rather exceptional cases of abbreviation proper pointed to by Bob Ladd, single Mandarin syllables aren't being "selected" from within disyllables to stand proxy for them elsewhere (that is, crisis still ≠ danger + opportunity); instead, "selection" is between mono- and disyllabic variants enjoying equal status in the synchronic lexicon (with the monosyllable very often having historical precedence)… apologies from a paranoid Sinologist if this is all obvious.

  27. leoboiko said,

    May 16, 2013 @ 8:35 am

    @JS: Is it the case that the monosyllable are very often earlier? At least in the case of the Classical lexicon (and I really don't know to what extent, if any, it is related to the present monosyllabic words), Mair has pointed to the possibility that polysyllabic words could be prior ("Where we can test specific instances in the early stages of the formation of Literary Sinitic, it seems to be the result of drastic truncation of Vernacular Sinitic" […])

  28. JS said,

    May 16, 2013 @ 2:56 pm

    @leoboiko: There is a range of viewpoints regarding to what extent, say, Warring States-era texts represent a contemporary vernacular; Prof. Mair would see the literary-vernacular divide as relatively stark throughout the historical period. Nevertheless, it is generally (in my view!) difficult to escape the conclusion that characters such as 學 and 習 above at first represented independent (but not necessarily monosyllabic or morphologically simplex!) words — to an approximation, 'learn, imitate' [v.] and 'put into practice' [v.]; both appear in the well-known Analects passage — with serial usages ('learn and practice') only later leading to lexicalization as a compound word written 學習. So, it now happens that the Modern Mandarin (close) synonyms xué 學 'learn, study', all along very much a word in its own right, and xuéxí 學習 'study' enter into some of the interesting alternating behaviors Prof. Duanmu draws attention to… but the former is in no sense being "selected" from the latter to "stand for" it based on relative "informativity" … or anything else. (Disclaimer: My two cents!)

  29. Victor Mair said,

    May 20, 2013 @ 8:56 am

    @JS

    There were a lot of disyllabic words in Old Sinitic, but I've never thought that xuéxí 學習 ("study") was one of them.

    As for how far back the divide between Literary / Classical and Vernacular (not necessarily what we now call "Mandarin") goes, I'll be making a whole series of Language Log posts on this subject in the coming weeks.

  30. Bathrobe said,

    May 21, 2013 @ 1:43 am

    As for how far back the divide between Literary / Classical and Vernacular (not necessarily what we now call "Mandarin") goes, I'll be making a whole series of Language Log posts on this subject in the coming weeks.

    Looking forward to it!

RSS feed for comments on this post