Pinyin to Hanzi Two Way Conversions

« previous post | next post »

Apollo Wu, who was a long-term translator at United Nations headquarters, sent me the following note:

Dear Victor,

I wish to acquire a language tool for two way conversions between Pinyin and Hanzi texts. Do you know if any do exist?  I sometimes write Pinyin texts and want to convert them to characters for some Chinese readers who are not familiar with Pinyin.

Best!

Apollo

Here's how I replied to Apollo:

Dear Apollo,

I don't know of any such tools, nor would I expect that there would be any such tools at this stage in the development of Pinyin as a functional orthography.  Indeed, I doubt that there ever will be such a reliable tool for conversion back and forth between Hanzi and Pinyin texts.  Here are some reasons why:

1. The conventions for representation of tones (or not) are not fixed.

2. Although we do have official rules for Pinyin orthography, they are not all-inclusive / comprehensive, nor are they agreed upon by all who write in Pinyin, and they are still in flux.  As a matter of fact, even for a language like English which has been written with an alphabet for hundreds of years, there are still plenty of areas for disagreement concerning word separation, hyphenation, and so forth.

3. Although it is gradually growing, a community of individuals who regularly write in Pinyin for a variety of purposes barely exists.  Writing in Pinyin is still basically an ad hoc enterprise.  So there are as yet no widely agreed upon conventions for writing Pinyin texts that are shared by a sizable group of people who are committed to them.

4. Above all, you'd need yī duì yī 一对一 ("one-to-one") conversion capability between Hanzi and Pinyin, and that is something I don't think we can ever expect.  I have discussed this with you and others many times.  Fundamentally, there are too many homophones for all the tens of thousands of morphosyllabic Sinographs out there.  There is no way in heaven or hell that we could devise enough special spellings to account for all the proper nouns in contemporary Mandarin, not to mention all the special usages that seep into Mandarin from the topolects, much less the vast number of rare occurrences of characters in historical texts that are still occasionally cited today.  It would just be an absolutely unworkable nightmare to try to devise special spellings for all such low frequency Sinographs.  On the other hand, necessity will require that, as with "Shanxi" and "Shaanxi", very high frequency homophones be provided with distinctive forms in Pinyin (cf. "night" and "knight" in English), but the number of such special spellings will be strictly limited and be governed by pragmatics.

5. Eventually, if Pinyin does become widely used as the preferred medium for writing Mandarin among certain groups, the texts produced in it will be very different from texts produced in Sinographs.  With Sinographic texts, you can expect a considerable amount of visual-semantic input from the script itself, whereas with Pinyin texts, the author must pay much greater attention to oral intelligibility.  People will be able to write beautiful prose and exquisite poetry in Pinyin, but they will have to be conscious of writing in a style that will be phonetically apprehensible,  They would definitely fail as effective authors of Pinyin texts if they consciously or unconsciously rely on the visual-semantic capabilities of the Sinographic script.

6. Once Pinyin does become established as a viable alternative for the writing of Mandarin texts, then it's conceivable that software will become available for the translation of Pinyin texts to character texts, and vice versa, but that's quite a ways in the future.  And I stress that this will be translation, not conversion.

In closing, I'm glad to hear that you still — after several decades of supporting Pinyin — find it useful for direct communication with colleagues who are familiar with it.  I believe this is one powerful, viable avenue for the enlargement of Pinyin as a functional orthography in what I have long referred to as part of an emerging digraphia.  It is happening before our very eyes, and you have played a key role in promoting Pinyin for a variety of purposes during the nearly half a century since I have known you.

best wishes,

Victor

I asked several colleagues if they basically agree with what I have written and whether they had anything to add to or to modify it.

I have devoted so much attention to this now because it is something that Apollo keeps bringing up.  I do have serious misgivings about the possibility of automatic, one-to-one conversion between Pinyin and characters.

David Moser wrote back as follows:

I basically agree with what you say, but I think you are too pessimistic about a tool that will convert Pinyin to characters.  And I don't fully agree with your point no. 6 that the result, if the attempt succeeds, would amount to translation rather than conversion.  Let me explain.   I use all the speech-to-text apps, and while there is still a homophone problem in English (not as bad as Chinese), and occasionally my spoken "to" or "won" or "Moor" etc. get rendered as "two", or "one" or "more", the programs are improving in accuracy by leaps and bounds, mostly because of the vast amounts of "big data" out there now, which can statistically predict which of the homonym pair is more likely, given the lexical context.  Now, part of the reason Google, Baidu, or WeChat fail to render accurate text strings for the spoken input is because the spoken input is flawed.  For example, often when I say "Peking University" the program renders it as "picking university", but that's because I lazily didn't exaggerate the "ee" sound, and when I try again, stressing the "PEEking" syllable, it gets it right.  The difference here is that there is no phonetic ambiguity with Pinyin input.  So I think you can see that this is more than a mere analogy.  I always use speech-to-text also for my WeChat messages in Chinese.  And the accuracy rate of the conversion is already astoundingly good, and getting better every month.  These StoT programs are already pretty darn good, even with slurring, rapid articulation, mistakes and different accents.  And ask yourself: How is speech-to-text in Chinese fundamentally any different from Pinyin-to-text?  If everyone were using Pinyin all the time in texts, short messages and emails, then the predictive text algorithms would start getting better and better, and in a decade or so, the apps could produce Pinyin-to-Hanzi transformations with more than 95% accuracy, just as is the case with English.   We're definitely not there yet, but I think if there were billions of sentences in Pinyin being typed in to apps every day on the planet, the Googles and Baidus and WeChats would get better at converting the scripts, even with all the dialect and pronunciation messiness.  And although there would always be a higher level of semantic ambiguity in Pinyin texts than Hanzi texts, the result would not be translation exactly, any more than current Hanzi speech-to-text is translation.

In a sense, we already have Pinyin-to-Hanzi conversion.  It's called the Hanzi speech-to-text function.

I understand what David is saying. Indeed, I can partially corroborate it by an experience I had last night on the commuter train heading back home in the evening.

There aren't many people on the train at that hour, so I get to know who the regular riders are. I don't talk to many of them, but I cannot help but observe their habits and activities. One Chinese woman who accompanies her husband (who falls asleep immediately when he gets on the train), spends the whole ride reading her equivalent of Face Book and interacting with friends on WeChat. For years she always did her chats with Pinyin plus occasional fingertip writing on the glass pad of her phone. Sometimes she would get frustrated when she couldn't get the desired character to come up (cf. my description of a professor from Taiwan flailing away at her glass plate in Prague several years ago: "Swype and Voice Recognition for mobile device inputting" [1/22/14]).

Last night, however, I saw the woman on the train talking to her phone. She talked softly, and the phone responded accurately most of the time (even though competing with the din of the wheels on the tracks and other noises on the train). I could see the microphone reception bar in the middle of the glass panel on her phone jumping up and down as she spoke, and the little balloons full of short texts with what she was saying — interspersed green and white — popping up at a fairly good pace. From time to time, she would have to repeat or reword what she was saying to get it to come out right, but overall it kept her going and happy for half an hour.

Now here's the nub and the rub: what she was saying was all very simple, mundane stuff, and I doubt that the total number of characters she called up amounted to more than a few hundred — just daily conversation. Of course, this is all very impressive, and I am in awe of what StoT programs have already achieved. But will they ever be able to express our deepest thoughts and feelings? Will they be able to cope with highly technical prose or historical records? Poetry and fiction? And here when I say StoT programs I mean StoHanzi, which I believe is much, much harder than going from phonetic S to phonetic T in English or Spanish, etc.

Anyway, all of this will work itself out in due course. It's exciting to watch it happening, especially since many of the things involving Romanized text that I predicted more than half a century ago are already becoming a reality.



30 Comments

  1. ycx said,

    December 17, 2019 @ 7:37 am

    I think that one reason why StoT works so well in Chinese is that the algorithms can pick up on the tone of the word being spoken, whereas encoding the tone either requires diacritics or appending the tone number to the word. The vast majority of pinyin text that I've seen both online and offline don't include the tone information (and require that it be translated, as you've mentioned in point 6).

    I mean, how would a conversion program handle something like https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_Den or https://www.reddit.com/r/ChineseLanguage/comments/dc9tp2/the_story_of_lady_ji_attacking_a_chicken/ ?

  2. Michael Watts said,

    December 17, 2019 @ 10:14 am

    I mean, how would a conversion program handle something like https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_Den or https://www.reddit.com/r/ChineseLanguage/comments/dc9tp2/the_story_of_lady_ji_attacking_a_chicken/ ?

    It would fail, for the same reason humans cannot understand those texts when they are read aloud. Why would we expect software to perform a task that can't be done?

    And why would the fact that we can't do an impossible task prevent us from doing a possible one?

  3. Antonio L. Banderas said,

    December 17, 2019 @ 10:38 am

    After reading this great post, I started to wonder how technology is able to recognize tones, but ended up realizing I do not really know how tone contours are orally/phonetically distributed along Mandarin rhymes, and specifically along Mandarin diphthongs –especially as opposed to the simpler syllabic stress(es) in Spanish or English.

    Could somebody elaborate on it, with reference to such variables as time, volume of speech, pitch levels, etc.

  4. Doctor Science said,

    December 17, 2019 @ 11:49 am

    I've been reading Language Log for many years, and at last I'm starting to learn Chinese. I just put up a post about my Chinese adventures, talking (among other things) about how LLOG has given me background for what I'm trying to learn now.

    I ask some questions, too:

    a. I heard about the Heisig method for character learning in LLOG comments. Are there any caveats or downsides to this method?

    As an aside: does Heisig have any affect on the character amnesia curve? Has anyone tried translating Heisig into Chinese and using it to teach native speaker children?

    b. I've been supplementing my class (which is only once a week) with the ChineseSkill app, but I can't find a way to go back and review lessons I've already completed. What app(s) do you-all favor?

    c. What pushed me to take the class (besides being retired and at last having the time) was watching Chinese dramas with their terrible English subtitles. What kind of shows would have the slowest, clearest speech?

  5. Calvin said,

    December 17, 2019 @ 2:02 pm

    Re: Two way conversions between Pinyin and Hanzi texts

    For short text snippets up to 5000 characters, Google Translate (translate.google.com) can accept both Pinyin and Hanzi (auto-detected) and display the transcribed version below in the same input box.

    I captured couple examples here: https://imgur.com/a/jnuunjQ.

  6. Doctor Science said,

    December 17, 2019 @ 2:17 pm

    My first comment was a bit OT, so:

    I agree with those who think pinyin-to-Hanzi is not substantially different from speech-to-Hanzi, except it's easier (no issues with accents, background noises, etc.). Compared to English or French, pinyin (with tone marks) is really close to WYSIWYG.

    I was very surprised to learn in my first Chinese lesson that tā can mean "he" (他), "she" (她), or "it" (它). I immediately wondered if illiterate Chinese-speakers–which of course were the vast majority until quite recently–really thought of these 3 tā as being distinct, or if the gendering of the pronoun was "pasted on" from the writing system. Do young native speakers often make mistakes in choosing which tā to write? Do native Chinese speakers have problems keeping gendered pronouns straight when they learn languages with very distinct he/she/it?

    Which tā pronoun humans write has to be determined from context, so machine speech-to-text (and pinyin-to-Hanzi) for Chinese will also have to be context-sensitive. But that doesn't mean it's an intractably difficult problem.

  7. Alyssa said,

    December 17, 2019 @ 3:15 pm

    Does one-way conversion from Hanzi to Pinyin already exist? My limited understanding is that this should be straightforward, shouldn't it?

  8. Bathrobe said,

    December 17, 2019 @ 3:55 pm

    tā can mean "he" (他), "she" (她), or "it" (它)

    I think this is a modern innovation under the influence of Western languages…

  9. Chris Button said,

    December 17, 2019 @ 11:19 pm

    @ Antonio L. Banderas

    The tones in are affected by a combination of intonation and stress. So while specific tonemes (i.e., phonemic tones) can be identified, their surface tonetics (i.e., phonetic tones) can vary significantly.

  10. Daan said,

    December 18, 2019 @ 2:57 am

    @Doctor Science and @Bathrobe:

    You may be interested in this review of Huang Xingtao's "她"字的文化史 (Cultural History of the Character "她/tā") https://harvard-yenching.org/cultural-history-of-the-Chinese-character-ta

    The review (and presumably the book itself) describes how the need for a specific female pronoun arose in the late 19th century in the context of translating Western literature and became the subject of some debate (including over whether the alternative 伊 yī should be used in preference to 她 tā).

    I have always thought the identical pronunciation of 他/她/它 is an interesting example of VHM's occasional warning not to identify Chinese (a language people speak) with Chinese characters (a system used to record it). I have had several very intelligent and highly educated Chinese colleagues at the universities I have worked at who in spite of their otherwise excellent English occasionally mix up "he" and "she", one of whom once suggested it was because "it's all the same word in Chinese". Clearly for some Chinese speakers at least, tā is fundamentally one word regardless of the different ways in which it may be "spelled".

    I understand that in Cantonese 佢 keoi5 is used for he/she/it and there are no distinct characters depending on the gender. If 他/她/它 is a recent invention, I would be interested to know whether other topolects have gendered pronouns (written or spoken). Does anyone have any information on this?

  11. Michael Watts said,

    December 18, 2019 @ 6:30 am

    I immediately wondered if illiterate Chinese-speakers–which of course were the vast majority until quite recently–really thought of these 3 tā as being distinct, or if the gendering of the pronoun was "pasted on" from the writing system. Do young native speakers often make mistakes in choosing which tā to write? Do native Chinese speakers have problems keeping gendered pronouns straight when they learn languages with very distinct he/she/it?

    Yes, this is one of the most prominent and characteristic problems displayed by Chinese speakers speaking English. They always choose "he".

    It's clear that the Chinese "mental model" only includes the one 3rd-person pronoun, which may be spelled differently according to context. But spelling is a slower process than speaking — the difference is not accessible in real time as they speak.

  12. Victor Mair said,

    December 18, 2019 @ 8:21 am

    "They always choose 'he'."

    Not always.

  13. Daniel said,

    December 18, 2019 @ 2:50 pm

    I'll add that "he" and "she" are not phonetically distinct in mandarin. Neither can be pronounced accurately, and the closest pronunciation is the same for both, pinyin "xi", the consonant being a palatal fricative.

    Chinese speakers will sometimes choose "he" and "she" at random, and not even realize which one they are saying.

  14. Ellen K. said,

    December 18, 2019 @ 4:13 pm

    @Daniel

    Do you mean that the English words "he" and "she", when said with the closest Mandarin sounds, sound the same?

    If so, I can see how that would make an added level of difficulty for Mandarin speakers speaking English as a second language differentiating "he" and "she" in speech.

  15. Michael Watts said,

    December 18, 2019 @ 7:04 pm

    Ellen K.:

    The closest Mandarin sounds to the English /h/ and /ʃ/ are quite distinct. Neither can be followed by the vowel /i/, so there's more room for interpretation in the question "which Mandarin syllable is closest to 'he'?". But training a Mandarin speaker to produce good facsimiles of "he" and "she" isn't difficult.

    What ambiguity there is vanishes when the speaker produces "him" or "her". Pronoun case is another difference that Mandarin doesn't bother with, but it seems to be easier for Chinese speakers than pronoun gender.

  16. Victor Mair said,

    December 18, 2019 @ 9:07 pm

    "The closest Mandarin sounds to the English /h/ and /ʃ/ are quite distinct."

    What are they, together with their following vowels?

  17. Philip Taylor said,

    December 19, 2019 @ 4:59 am

    I would suggest that they are the initials of 喝 (hē) and 谁 (shéi) respectively, together with the corresponding finals. Other finals are, of course, possible.

  18. Jeffrey said,

    December 19, 2019 @ 7:57 am

    On a humorous note.

    Yesterday a Chinese friend of mine sent me a clip of a video that had a phrase that she just couldn't understand and asked me to give it a listen. The best she could do was: "before our committees."

    I clicked on the video.

    It was a lecture in English about GEOMETRY.

    Can you guess?

    Yep. you guessed it: "before Archimedes."

    I had to smile.

  19. Chris Button said,

    December 19, 2019 @ 6:27 pm

    I'll add that "he" and "she" are not phonetically distinct in mandarin. Neither can be pronounced accurately, and the closest pronunciation is the same for both, pinyin "xi", the consonant being a palatal fricative.

    What a great observation! That explains why people might continue to confuse them in pronunciation even when they are aware of the actual distinction.

    I would suggest that they are the initials of 喝 (hē) and 谁 (shéi) respectively, together with the corresponding finals.

    But neither can occur before /i/ and if one is restricting oneself to the sounds with which one is comfortable making in one's native language then it's going to be "x" /ɕ/ for both rather than "h" /h~x/ or "sh" /ʂ/ (in spite of the "h" /h/ correspondence for some speakers)

    "before our committees."

    "before Archimedes."

    Definitely some intervocalic flapping of the "t" going on there

  20. Philip Taylor said,

    December 19, 2019 @ 8:13 pm

    Chris, is it true to say that "neither can occur before /i/" ? I would accept without hesitation the statement that "neither does occur before /i/" (in MSM), but can one generalise from that to say that neither can occur ?

    If so, then could one not equally argue that /kw/ cannot occur before /ʌ/ in British English ? It is true that it does not so occur (as far as I can tell) but I do not think that any native speaker of <Br.E> would have any difficulty in pronouncing "kwuk" as /kwʌk/ (tho' "quuk" might cause some hesitation …).

  21. Chris Button said,

    December 19, 2019 @ 11:54 pm

    @ Philip Taylor

    You could compare the palatalization in Japanese behind "sa, shi, su, se, so" rather than "si". So it comes down to what's the most comfortable default option, rather than the learned one.

  22. Jeffrey said,

    December 20, 2019 @ 12:13 am

    @Chris Button

    Definitely some intervocalic flapping of the "t" going on there

    Yes, there was.

    When teaching in the US, I start working on the flap/tap "t" from the very beginning. It's essential for the quick farewell "later." Even absolute beginners want to sound like a local and fit in, and the flap/tap "t" really helps.

  23. Philip Taylor said,

    December 20, 2019 @ 7:20 am

    Chris, yes, I take your point, but would it really be more difficult for a native speaker of MSM to synthesise a new sound from the initial of 喝 (hē) or 谁 (shéi) and the final of 里 (lǐ) than it would be for a native speaker of <Br.E> to synthesise a new sound from the "initial" of "quack" and the "final" of "duck" ? I agree that it would not come naturally and would initially require conscious effort, but just as we westerners can get to grips with the phonology of MSM and produce sounds that are not in the phonology of our native languages (given a little effort), so it seems to me that a native speaker of MSM should have all of the tools necessary to produce /hiː/ or /ʃiː/.

  24. Philip Taylor said,

    December 20, 2019 @ 7:28 am

    And a supplementary question, if I may ? Would it be true to say that a native (monolingual) speaker of MSM has the necessary phones to produce /hiː/ or /ʃiː/ but lacks the necessary phonemes in his/her native phonology ?

  25. David Moser said,

    December 20, 2019 @ 8:43 am

    @Jeffrey
    Very funny example. I'm reminded of a similar experience. A few years ago, a Chinese student of mine was trying transcribe some George Bush press conferences, and she called on me for help transcribing some difficult passages. At one point she said to me "He keeps saying this one phrase time and time again, and it sounds to me like 'Lame tay sum.' What in the world is he saying?" I asked her to put on the tape recording to listen, and I heard him say the phrase in question, which was "Let me tell you something", which, in George Bush dialect, comes out as "Lemme tellya sumpin'"

  26. Rodger C said,

    December 20, 2019 @ 9:29 am

    Nome sane?

  27. Jeffrey said,

    December 20, 2019 @ 10:50 am

    @David Moser

    Lemme tellya sumpin / Lame tay sum

    Nice.

    That Lemme tellya sumpin is chock full of connected speech goodies: deletion, vowel weakening, and assimilation for starters.

    By the way, I'm a beginning student of Putonghua (but a veteran teacher of ESL/EFL) and I'm wondering if there are any books or websites that cover the connected speech processes in Chinese. Any suggestions?

  28. Chris Button said,

    December 20, 2019 @ 9:19 pm

    I agree that it would not come naturally and would initially require conscious effort

    I think that's really the point, and hence the relative ease of confusion due to an innate phonological association.

  29. Mingfei Lau said,

    December 27, 2019 @ 12:41 am

    For Sinograph to Pinyin conversion, there are a lot of tools and programs out there and many of them are open sourced on GitHub, like this one
    https://github.com/hotoo/pinyin

    As for Pinyin to Sinograph conversion, it is a much more complicated task. Because Pinyin to Sinograph conversion is a simple dictionary look-up task and every output is deterministic. But for Pinyin to Sinograph, there are just too many homophones in modern Mandarin, so the output is non-deterministic. It is often impossible to determine which Sinograph the Pinyin corresponds to without a context. And in some extreme situations, like the 陝西 vs 山西 example, one just never knows even with a context. However that doesn't mean this task is unsolvable, and nowadays we have machine learning algorithms to "guess" the most likely and sensible sentences from a string of Pinyin:
    https://github.com/letiantian/Pinyin2Hanzi

    These tools are much more complicated than those for Sinograph to Pinyin conversions. Basically they are a set of probabilistic models (this one uses Viterbi algorithm and dynamic programming), which are trained with huge corpora and language data, to learn the patterns of the written language. So when you give a string of Pinyin as inputs of these models, they process the string and computes the highest likely corresponding Sinograph string as the output.

    This technology is also widely used in Chinese Input Methods (拼音輸入法). So when you type a string of Pinyin, the software displays the highest possible characters on the screen, which is the string of characters that you are most likely to type.

  30. Jarek Weckwerth said,

    December 28, 2019 @ 7:01 am

    Thank you, Mingfei Lau, for this informed answer. I suspected this would be the case.

RSS feed for comments on this post