Pinyin in subtitles

« previous post | next post »

The Chinese says:

忧劳夙夜 时用遘疾

"worrying and toiling day and night while having fallen ill"

The supposedly obscure character, 遘, is pronounced gòu and means "meet; encounter".  It is indeed fairly rare, no.7177 out of a list of the 9,933 most frequent characters (the total number of existing characters amounts to roughly 100,000, around three quarters of which are useless; few people know more than three thousand characters; basic literacy requires approximately one thousand characters).

If it happened once, it must have happened many times, and — because of the nature of the writing system — I'm sure it will happen again.  If you have no clue how to pronounce a character or what it means, you need a phonetic and / or semantic gloss.


Selected readings


[h.t. Geoff Wade]


  1. AntC said,

    March 30, 2022 @ 8:56 pm

    遘, is pronounced gòu and means "meet; encounter" … indeed fairly rare … few people know more than three thousand characters;

    GT has "meet unexpectedly". Is its appearance here a kind of fixed phrase? GT has 遘疾 = "sickness". So 'suddenly/unexpectedly fallen ill'?

    I would have guessed you fairly often would want to say "meet; encounter", and "fall ill". So is there a more common way to say that amongst the 'standard issue' three thousand characters?

  2. Richard Warmington said,

    March 30, 2022 @ 9:07 pm

    The subtitles are written using simplified characters. 遘 is included in the simplified character set. It is number 7475 in China's 通用规范汉字表 (Table of Standard Characters), but it is unchanged from its traditional form.

    As a traditional character, 遘 is part of a phonological series (冓鞲溝篝耩遘覯媾搆購構講). This series consists of characters that share a phonetic component, 冓. These characters are mostly pronounced gōu or gòu, the exceptions being 耩 and 講, both pronounced jiǎng. So if the subtitles were written in traditional characters, it's likely that the pinyin would be unnecessary. Viewers could guess the pronunciation of 遘 because of their familiarity with other characters in the same series. Although 講 (jiǎng) is the most common character in the phonological series, there are several other characters in the series that are almost as common (溝, 購 and 構), and these are pronounced gōu (溝) or gòu (購 and 構).

    The reason the pronunciation of 遘 is hard to guess for viewers who know only simplified characters is that these three traditional characters (溝, 購 and 構) had their original phonetic component (冓) replaced with a different one (勾) in the simplification process. After simplification, 溝, 購 and 構 became 沟, 购 and 构. Thus, the phonetic connection between 遘 and the other three (溝, 購 and 構) became obscured.

  3. Richard Warmington said,

    March 30, 2022 @ 9:27 pm

    @ AntC: Yes, of course there are ways of saying "meet" or "encounter" that don't use a rare character like 遘. For example, 遇 (yù) is a pretty common character with a similar meaning (encounter). It's used in words like 遇險 yùxiǎn (to be in danger, literally encounter danger) and 遇救 yùjiù (to be rescued, literally encounter rescue).

  4. Jonathan Smith said,

    March 30, 2022 @ 10:44 pm

    The pinyin thing seems odd to me as hearing audience members at least are literally hearing gou4 as the line is read…

  5. Ben said,

    March 30, 2022 @ 11:07 pm

    Before the picture loaded, I thought the tweet meant to say the pinyin was used *in place of* the character. I see this happen countless times every day, in captions accompanying semi-legal online films as well as douyin/tiktok videos. Since most people type in pinyin (and in haste, in the contexts described), it is quite common for people to fail to select the character, leaving behind only the typed letters.

  6. Victor Mair said,

    March 31, 2022 @ 8:56 am

    Jonathan Smith "further wonders, can typical audience members even understand wenyanwen expressions like gou4 ji2 'meet with affliction' without the aid of subtitles?"

    I think in most cases not. As I've mentioned many times on Language Log and elsewhere, Literary Sinitic / Classical Chinese is fundamentally not sayable.

  7. David said,

    March 31, 2022 @ 9:41 am

    For the background of those who do not regularly work with these languages, one might note that Japanese has 3(+) writing systems in simultaneous daily use, an approach that easily allows Japanese texts to gloss the pronunciation of a potentially unfamiliar glyph for readers, whereas in Chinese the use of pinyin (romanization) is perhaps the only practical solution (most Chinese, at lease in urban areas, have basic familiarity with pinyin).

  8. Chris Button said,

    March 31, 2022 @ 12:05 pm

    @ Richard Warmington

    The 冓 and 講 relationship is indeed an interesting one. The -ŋʷʔ ending of 講 in OC is already a phonotactic challenge by combining a velar nasal with a glottal (albeit one with quite a few violations). An alternation of -ɣ with -ŋ is attested elsewhere, but the -w of 冓 seems to extend it to -w and -ŋʷ. In pre-OC, -w and -ŋʷ came from w_ɣ and w_ŋ combinations (thereby explaining why velar codas weren’t subject to disruption like the counterparts at other places of articulation when all those rounded vowel hypotheses came along—I still haven’t figured out why adherents to the rounded and front vowel hypotheses don’t pay attention to that distributional issue), but that seems to be taking the alternation back too far beyond OC. I imagine that the relationship of 講 is ultimately with something like 共 and related words, but the semantic and graphic (in terms of the concept depicted) overlap allowed for the use of 冓 as phonetic since -w/-ŋʷ correlates with -ɣ/-ŋ around the presence or lack of a labial feature even if the labial alternation is not a (somewhat) regular alternation.

  9. Terpomo said,

    March 31, 2022 @ 12:32 pm

    Victor, I'd question your assumption that it's fundamentally not sayable. It's elliptical, sure, to the point that if read in Mandarin there may be issues with homophony. But in more conservative readings, there's enough information for intelligibility; I've given texts in Middle Chinese transliteration to a friend who's familiar with Classical Chinese and Middle Chinese, and he's generally reconstructed the text near-perfectly but for variants and the occasional expression he's not familiar with. I also have a friend from Hong Kong who reports that classmates from his school will sometimes converse in Classical Chinese for a lark. There's also the fact that many Classical texts are originally oral in nature; for instance the Airs of the States, from the book of Poetry, are originally folk songs, though perhaps polished a bit in the process of setting them down on paper.
    "few people know more than three thousand characters; basic literacy requires approximately one thousand characters"- I'd also question this claim. Most literate native speakers I know know significantly more than that, though perhaps I run with an unusually erudite crowd. I know a girl who's tested as knowing around nine thousand, at least passively.

  10. Terpomo said,

    March 31, 2022 @ 12:35 pm

    Oh, and I forgot to mention, a user on reddit has claimed:
    "If you just want anecdotes, I can speak to Mandarin at least. A lot of people educated in CC can understand plain, modern-leaning texts read aloud. Even some simple passages from something like Shiji are understandable with practice and training. Some people are just naturally better at this sort of task than others. I have a friend who can listen to Six dynasties poetry read aloud and usually transcribe it near-perfectly (minus some obscure variant characters). She's particularly gifted in this sort of thing, though, and insanely well-read."
    If they're not lying- and I don't see why they would be- it seems it is possible. That said, they also say:
    "Okay, but reading comprehension only goes so far if you don't have practice listening. You don't have time to pause, consider, and/or look at commentary if the pace is natural speech. There are/were some learning environments where recitation of texts and "blind" listening are things that people do, but it is not the norm. If you are tying to understand something like Zuozhuan or Shangshu read aloud, I think ambiguities in pronunciation are only going to be about half the battle no matter if it is Mandarin or a more conservative Sinophone language."
    But I think that applies to difficult or context-heavy texts even in a modern spoken language.

  11. Victor Mair said,

    March 31, 2022 @ 3:34 pm

    Terpomo's remarks are couched in such anecdotal, contingent and speculative, hedging language ("I'd question" [twice], "generally… but for", "a friend… who reports that classmates… sometimes converse in Classical Chinese for a lark", "a user on reddit has claimed", "I think", "it seems", etc., etc.) that they do not carry much weight and are not at all convincing in his attempt to prove that Classical Chinese / Literary Sinitic is sayable.

    He admits that "It's elliptical, sure, to the point that if read in Mandarin there may be issues with homophony." Indeed! I.e., not sayable.

    He argues that some ancient texts were "originally oral in nature", but what have they become now after thousands of years of redaction and transmission? I've written dozens of posts, papers, and books where I demonstrate that what may have started out as vernacular is either edited out or revised as literary / classical language. ("perhaps polished a bit in the process of setting them down on paper" — to say the least!)

    "Most literate native speakers I know know significantly more than that [VHM: three thousand characters], though perhaps I run with an unusually erudite crowd." Speaks for itself. In my half a century and more of Chinese language studies, I've known thousands upon thousands of individuals from all walks of life. Only a few intellectuals actively command more three thousand characters. I've written papers and posts that surprisingly show most of the classical canon and premodern literary texts (e.g., the corpora of Tang poets) were written with a lexicon of between roughly two and three thousand characters).

    You also have to take into account the reality of character amnesia, a serious problem that we have addressed dozens of times on Language Log. It is all too common nowadays for people to forget how to write even such simple characters as those for "egg" and "shrimp".

    Then there's the frequent mixing in of pinyin and English with characters when one forgets how to write the latter. The amount of such admixture is growing by leaps and bounds as people more and more rely on electronic devices to write — using Pinyin for inputting.

    I don't know a single soul who has mastered all 9,000 or so characters of the Xīnhuá Zìdiǎn 新华字典 (Xinhua Dictionary) (more than 13,000 if you count traditional forms and variants), and I have been to conferences of lexicographers in China where linguists admit that even mastery of six thousand or so characters in that dictionary is well-nigh unattainable for most mortals. Those are specialists saying this, mind you.

    C. C. Cheng, emeritus professor of computational linguistics at the University of Illinois, estimates that the human lexicon has a de facto storage limit of 8,000 lexical items (referred to in n. 12 on p. 301 of Jerry Packard's The Morphology of Chinese: A Linguistic and Cognitive Approach [Cambridge University Press, 2000]). Here we're talking about words, not characters. Cheng also holds that the human cognitive capacity for Chinese characters is around 3,000 to 3,500 characters. QED at the beginning of the o.p.

    Since this has already gone on too long, I'll just close by putting it the way I always do: Classical Chinese / Literary Sinitic cannot be used for spontaneous, unrehearsed conversation. A tiny number of highly learned scholars can "converse" in snippets memorized from famous texts of the past, but that's only a tour de force game for a handful of academics, not a means for practical communication in daily life.

    For the population as a whole, Classical Chinese / Literary Sinitic is not sayable.

  12. Jerry Packard said,

    April 1, 2022 @ 10:48 am


    " …"few people know more than three thousand characters; basic literacy requires approximately one thousand characters"- I'd also question this claim. Most literate native speakers I know know significantly more than that…"

    Children in China by the time they have finished elementary school (grades 1-6) have already been taught about 2,570 characters.

    Source: Shu, H., Chen, X., Anderson, R.C., Wu, N. and Xuan, Y. (2003). Properties of School Chinese: Implications for Learning to Read. _Child Development, 74_ 27-47.

  13. Jonathan Smith said,

    April 3, 2022 @ 1:18 pm

    My own thought above was specifically about pseudo-classical in modern TV etc. — perhaps much is formulaic enough for subtitle-free understanding, but probably some is not…

    re: terpomo's anecdotes — all are entirely believable, though re: the question of vocalizing earlier texts with Mandarin "readings," they make the point even clearer that such performances approach "comprehensibility" only under the most constrained of circumstances…

    re: verbalizing earlier texts in some more conservative system like reconstructed "MC" — of course this becomes far more reasonable; there is further no reason to doubt that early texts (esp. "OC"-era, but even say highly classicized medieval literary language) write language which was in the general case understandable when vocalized…

    re: Cheng's "storage limit of 8,000 lexical items," don't know context here but it's hard to credit: "Reasonably conservative estimates from studies that have attempted to use a sound methodology (Goulden, Nation, & Read, 1990; Zechmeister, Chronis, Cull, D’Anna, & Healy, 1995) indicate that well-educated native speakers know around 20,000 word-families (excluding proper names and transparently derived forms)." (Nation 2006 "How large a vocabulary is needed for reading and listening?")

    — maybe the problem is comparing knowledge of "characters" to actual (passive or active) vocabulary size… absolutely classic apples and oranges, though assumption of some equivalence is disturbingly common even in academic discussions when it comes to Chinese…

  14. John Swindle said,

    April 3, 2022 @ 3:31 pm

    Knowledge of characters vs actual (passive or active) vocabulary size, sure, but knowledge of characters isn't unambiguous either. As Professor Mair's occasional posts about character amnesia remind us, there are degrees of knowing characters.

RSS feed for comments on this post