Cantonese word list and parser

« previous post | next post »

This morning I received an announcement from the The Linguistic Society of Hong Kong (LSHK) that its long awaited Jyutping word list is now online.  Access to the word list is available here.

The LSHK word list was compiled by The Jyutping Work Group.  The words in the list come from various sources: Andy Chin’s Mid-20th Century Hong Kong Cantonese Corpus, Shin Kataoka and Cream Lee’s teaching materials, and the Education Bureau’s word list for elementary schools.

Members of the Jyutping Word List Group:

Cheung Kwan Hin (Convenor)

Andy Chin

Peter Chung

Cream Lee

Caesar Lun

Vicky Man

Tang Sze Wing

The LSHK word list contains a generous 15,740 entries that may be searched by various means.  I looked up one of my favorite Cantonese expressions, 哎吔, and found four pronunciations, all from the LSHK database:  ai1jaa1, ai3jaa4, ai1jaa3, ai2jaa5.  In Sheik’s CantoDict, 哎吔 has the Jyutping romanization ai1jaa1 and is defined as meaning “Aiya!; a sort of [as in ai1jaa1 lou5dau6 哎吔老豆 (“some kind of daddy”)].”

I particularly like the fact that the syllables of the words in the LSHK list are joined together (ai1jaa1) and not divided (ai1 jaa1) or separated by a hyphen (ai1-jaa1), as is the custom on most sites.

As I was preparing this post, I made the serendipitous discovery that the Sheik site has a very valuable parser.  You can enter a Chinese text of up to 250 characters into the parser, which “attempts to split it up into component Chinese words. You can then see the meanings of each character and compound word by moving your mouse over the parsed text (we recommend using the Firefox web browser).”  Fortunately, I have always and only used Firefox, so the parser worked very well for me.

I put the following two passages, chosen at random from the Mandarin and Cantonese versions of the Wikipedia article on Hong Kong, into the Sheik parser, and I must say that the results were very helpful.  Give it a whirl yourself!

Mandarin:

公元前214年,香港被秦朝納入中原王朝版圖,此後長期為中國領土,也曾短暫納入廣東越南地區版圖[c],亦曾建立過本土政權[d]。1842年,中國清朝英國簽訂《南京條約》,永久割讓香港島,此後再簽訂《北京條約》和《展拓香港界址專條》,分別割讓九龍和租借新界,這些由英國統治的地區構成現今香港的治理範圍。二戰期間,香港曾被日本佔領約三年零八個月。

Cantonese:

香港史可追至新石器時代。但轉變成大城,要由大清國講起。英國同大清打鴉片戰爭期間,一八四一年,英國人佔領香港島,響島東北岸起域多利亞城。一八四二年,清廷簽《江寧條約》,正式割香港島畀英國。一八六零年,清廷簽《北京條約》,再割埋九龍半島



5 Comments

  1. Jenny Chu said,

    August 26, 2016 @ 9:10 am

    I think this is great news… Right?

  2. Randy Ballard said,

    August 28, 2016 @ 6:44 am

    Nice. I look forward to this new Jyutping word list. I found my Cantonese tutor on Skype at http://www.mandarintutor.com/cantonese and we started out practicing with Jyutping. I admit Jyutping isn’t easy to learn but like anything, it takes persistence to get good at it. I look forward to practicing this new Jyutping word list with my Cantonese tutor during our Skype lessons. Thanks for the post!

  3. Simon P said,

    August 28, 2016 @ 11:50 am

    This is fantastic news. I haven’t been doing much with my Cantonese lately, but this new resource might make me dust it off again. The people at LSHK are doing us learners a great service with their fantastic work on the Cantonese language.

  4. Simon P said,

    August 28, 2016 @ 11:50 am

    This is fantastic news. I haven’t been doing much with my Cantonese lately, but this new resource might make me dust it off again. The people at LSHK are doing us learners a great service with their fantastic work on the Cantonese language.

  5. Chris Godwin said,

    August 28, 2016 @ 4:08 pm

    My Internet provider tells me the web link uce.zapto.org… can’t be found. Have others managed to access the word list?

RSS feed for comments on this post