Automatic Pinyin annotation — state of the art
« previous post | next post »
[This is a guest post by Gábor Ugray]
Back in 2018 your post Pinyin for phonetic annotation planted an idea in my head that I’ve been gradually expanding ever since. I am now at a stage where I routinely create annotated Chinese text for myself; this (pdf) is what one such document looks like.
The tricky part is obtaining the automatic Pinyin annotations, and that is a very neglected area of Chinese language processing. I am using a mix of Google MT and Key5. (I only learned about Key5 from your post; it is one of the internet’s best-hidden treasures.) Both of these tools produce their own, different set of errors.
I decided it was time to properly and methodically evaluate these two tools. I wrote up my findings in a fairly dense text here: Is automatic Pinyin transcription feasible? I tested Google MT and Key5. For a shorter-than-short TL;DR you can directly look at the visual evaluations of Google and Key5.
Mark Liberman said,
January 20, 2020 @ 1:01 pm
There are literally hundreds — maybe thousands — of publications about Hanzi-to-pinyin mapping.
Two examples from 25 years ago:
https://www.aclweb.org/anthology/O94-1002/
https://www.isca-speech.org/archive/icslp_1994/i94_1771.html
but basically any Chinese text-to-speech system has to solve this problem.
Given this literature, it's indeed interesting that there isn't a larger number of task-specific tools, nor (as far as I know) any standard collections and metrics for evaluating their performance.
Antonio L. Banderas said,
January 20, 2020 @ 1:25 pm
I am still fascinated by the Chinese elasticity/flexibility, a lexical property of chinese terms, two sides of the same coin, which must be reflected in the very same entry for a certain lemma.
Therefore, for example the fifth version of the prestigious XDHYCD (Xiandai Hanyu Cidian) applies mutual annotations in the respective entries, so that the entry for 煤 mei ‘coal’ reads "noun, … also called 煤炭 mei-tan ‘coal-charcoal’", and the entry for 煤炭 meitan ‘coal-charcoal’ is annotated as "noun, 煤 mei ‘coal’".
Unfortunately, currently in wiktionary this is wrongly reflected in the broadly termed 'compounds' section, as a synonym or after 'see also', and only for the monosyllabic version.
http://www-personal.umich.edu/~duanmu/2014Elastic.pdf
Elasticity from Xiandai Hanyu Cidian 2005 has been tabulated in the following open access thesis; however, it seems to be rife with inconsistencies
deepblue.lib.umich.edu/bitstream/2027.42/116629/1/yandong_1.pdf
[(myl) Does the CEDICT dictionary entry for 煤 offer what you want? ]
Antonio L. Banderas said,
January 20, 2020 @ 1:29 pm
I've found the following critical issue which affects the functionality of the English-Chinese Collins online dictionaries: in the Chinese-English side, Pinyin romanization should be added in:
a) the synonyms of the different meanings of an entry which appear between parentheses preceded by an equal symbol
b) the page of "All related terms" appearing after clicking 'View more related words' at the bottom of the page.
Furthermore, in the English-Chinese, instead of the current situation in which they appear between parentheses right next to the simplified ones, traditional characters should be either deleted, since in the Chinese-English dictionary they do not appear, or else added aside, in a new sentence below the simplied ones.
reader_not_academe said,
January 22, 2020 @ 11:25 am
@Mark Liberman I checked both articles, but they don't seem to be producing Pinyin output as such. I didn't consider TTS but your suggestion makes sense, TTS and Hanzi-to-Pinyin are overlapping problems. I guess TTS solves both more and less. More because Pinyin, even though phonetic, is still a written abstraction and doesn't represent a lot of detail like prosody, bracketing of constituents, intonation etc. And less because Pinyin has its own idiosyncrasies related mostly to ortography (what to capitalize, what to write together, where to use a dash etc.) that is not relevant for TTS.
It would appear that TTS is evaluated based on its "end result," the audio produced, and there hasn't been much motivation for practitioners to share datasets for interim steps, which might be different system to system. And more recent ML approaches tend to build end-to-end solutions with everything in between a black box (aka neural net), so I wonder if even this past level of interim representations is here to stay.
Hanzi-to-Pinyin is a pretty specific task. I hope whoever else is faced with it in the future will find the evaluation approach I took, and the data/tools that I share, useful.