rime-cantonese, a Cantonese lexicon for building keyboards and more
« previous post | next post »
The following is a guest post by Mingfei Lau. A short intro about the author:
My name is Mingfei Lau, a member of The Linguistic Society of Hong Kong Jyutping Workgroup. I am a language engineer at Amazon and I work on different projects on Cantonese resource development in my spare time.
Today, Pinyin is undoubtedly the most popular way to type Mandarin. But what about Cantonese? This wasn’t easy until rime-cantonese, the normalized Cantonese Jyutping[1] lexicon appeared. Lo and behold, you can now type Cantonese in Jyutping just like typing Mandarin in Pinyin.
rime-cantonese is a multi-purpose Cantonese lexicon built by CanCLID, Cantonese Computational Linguistics Infrastructure Development Workgroup (粵語計算語言學基礎設施建設組), a group of folk linguists and software engineers who are passionate about the language resource development of Cantonese. As one of the team co-founders, I gave a presentation of this project in the School of Cantonese 2021. I talked about the motivations and development details of this normalized lexicon. CanCLID prescribes certain Chinese character variants for Cantonese words based on the de-facto orthography of Cantonese, the most widely accepted writing customs among the Cantonese speaking community[2]. With this approach, we hope to promote the standardization of written Cantonese, enabling and encouraging more people to type Cantonese with Jyutping, like we already do for Mandarin in Pinyin.
As a result, several Cantonese keyboards have been developed for use with this lexicon. The CanCLID team built a page for easy downloads and installations: https://jyutping.net/. In addition, Sogou (搜狗輸入法), the largest virtual keyboard company in China, collaborated with CanCLID and released their Cantonese keyboard powered by rime-cantonese [3], which has gained nearly 200k users since its launch.
rime-cantonese is not limited to making keyboards. You can build extensive apps and tools with its data. As an example, CanCLID built a browser extension[4] which automatically annotates Cantonese pronunciations on Chinese characters, turning any web pages to a Cantonese textbook. Just click the switch and voilà, read any Chinese characters in Cantonese and never go back and forth to look up dictionaries any more:
The data of rime-cantonese can also be used in Natural Language Processing (NLP) applications. PyCantonese, the state-of-art Cantonese NLP package, has utilized rime-cantonese in its word-segmentation module. rime-cantonese has a CC BY 4.0 license, which means being free and open to any commercial and non-commercial usage.
CanCLID is currently working on multiple projects, mainly Cantonese dialogue corpus and lexicons. We are the maintainers of the Cantonese locale of Mozilla’s Common Voice project. We accept contributions, feedback and collaborations from any entity. You can reach out to us through support@jyutping.org and we are happy to discuss anything about Cantonese and potential collaborations.
Footnotes
[1] Jyutping is the de-facto standard romanization scheme for Cantonese. The official reference page of Jyutping: https://jyutping.org/en/jyutping/
[2] As an example, here is an incomplete list of orthographical choices made by the team: https://jyutping.org/en/blog/typo/
[3] The download page of Sogou’s Cantonese keyboard, available on iOS, iPadOS and Android: https://shouji.sogou.com/interface/multilingual.php?language=3
[4] inject-jyutping is available on both Chrome and Firefox.
[Thanks to Diana Shuheng Zhang]
Dwight Williams said,
October 31, 2021 @ 4:32 pm
Good news indeed!
Reini Jiken said,
October 31, 2021 @ 10:34 pm
At the time I found we could type Cantonese and even other Chinese languages with Jyutping, the standard Cantonese spelling, I realized its importance and convenience to Cantonese typing. It's so glad to see CanCLID workgroup is helping us on offering more tools and apps in order to educate publics about the normal way to type Cantonese.
David Marjanović said,
November 1, 2021 @ 7:17 am
That's interesting! Which others?
Mingfei Lau said,
November 1, 2021 @ 8:49 am
Jyutping is an input method schema for typing Chinese characters, so technically you can type any language written in Chinese characters. You can type Mandarin in Jyutping or Cantonese in Pinyin. It's just you have to type it character-by-character, not words-by-words, which makes it very inaccessible.
But if you are asking about Romanization schemes for other Chinese languages than Jyutping and Pinyin, I have a project for this:
https://github.com/laubonghaudoi/Chinese_Rime
It collects 113 Romanization schemas for various Chinese languages.