Archive for Writing systems

The horror of ideograms

Well, I'm as recovered from my cold as I was able to get, and it is time to go. I am setting off for a trip to what everyone (following Europe) calls the Far East. (For Californians it is clearly the far west.) I head first to Hong Kong, for a few days during which I will be giving at least four lectures, and a panel session, and various other meetings (this really is not leisure time). And there is just one thing that really, really scares me about it. Perhaps you can guess.

Read the rest of this entry »

Comments (69)

The boat that ain't sayin' nothin'

Speeding east out of the Amsterdam area along dead straight train tracks beside a broad canal, I saw a huge cargo barge loaded up with giant shipping containers. It had several of the crew's automobiles parked on an upper deck. As the train whizzed past it and I could see the name on the bow, I saw that it was called the Omerta. Omertà? The brutal Sicilian mafia's fiercely enforced code of silence? I really wanted to hop off the train and ask the captain what on earth had led to the boat being thus named. But perhaps he would have turned out to be a Sicilian with an illicit cargo and would have refused to talk to me about it…

Read the rest of this entry »

Comments (35)

Who Has the Biggest Dictionary?

The East Asians have an ongoing contest propelled by dictionary size envy. Everybody wants to see who can produce a dictionary with the most entries. The Koreans at Dankook University have just pulled off the amazing feat of compiling a dictionary that has outstripped anything yet generated by the Japanese or the Chinese themselves. After 30 years of labor and investing more than 31,000,000,000 KRW (equal to more than 25 million USD), the South Koreans have just published the Chinese-Korean Unabridged Dictionary in 16 volumes. This humongous lexicon contains nearly half a million entries composed of 55,000 different characters. You can read more about the Dankook dictionary and its bested competitors here and here.

Read the rest of this entry »

Comments off

The Opacity and Difficulty of the Chinese Script

My class on the Chinese script has around 36 students in it. About half of them are native speakers from Taiwan, the Mainland, Singapore, and Hong Kong (most of these are graduate students who already have M.A.'s from overseas universities or are finishing up their Ph.D.'s). About one quarter of the other students are native speakers of Japanese, Korean, and Vietnamese. About a quarter are Americans who have studied Mandarin anywhere from two to twelve years.

Today, I made the students close their computers, electronic dictionaries, and all their books and papers, then asked them to write down on a piece of paper the simplified and traditional characters for Taiwan and beneath that what the meaning or origin of the name is. In the top right corner they indicated whether they were native speakers or how many years they had studied Chinese (I also should have asked them to indicate where they were from, but neglected to do so). The results:

  • only 2 students could write both forms correctly
  • only 4 students could write both forms partially correctly
  • only 10 students could write one form correctly
  • about 10 students could write one form partially correctly
  • the remainder of the students could not write either form correctly, including a couple of the native speakers
  • most students who had taken up to 6 years of Chinese couldn't write either form correctly

[If you want to give yourself the same quiz, before reading further, the answer is here.]

Read the rest of this entry »

Comments (34)

Sino-Russian Transcription and Transliteration

It has often been my duty to translate or edit Russian archeological and Sinological works in English. Two things plague such work more than anything else, and both have to do with transliteration.

Read the rest of this entry »

Comments (18)

How Michael spent his summer vacation

Well, part of it, anyhow… Michael Y. Chen wrote:

I went to Beijing and studied Chinese in July, and while I was over there I came across an interesting phenomenon.

In English, we talk about shapes that correspond to letters, like an S-curve or a T junction. While asking for directions, I found that there's a similar thing for shapes that correspond to Chinese characters. For example, 十字路口 (shi2zi4lu4kou3), a "十 intersection", refers to a four-way intersection (or just any intersection). The phrase is based entirely on the shape of the character, and not the meaning (十 means ten in Chinese). There's also a 丁字路口 (ding1zi4lu4kou3), a "丁 intersection", which would correspond to our T intersections.

Read the rest of this entry »

Comments (54)

Identifying written Cantonese

A query by a commenter on Victor's post raises an issue that seems worthy of discussion here on the main page. The question is whether it is possible to distinguish written Mandarin from written Cantonese. A widely believed myth is that even forms of Chinese that are mutually incomprehensible in their spoken forms are identical in writing. This is not true. Victor's post itself points out small differences between written Taiwanese Mandarin and Mainland Mandarin. Written Cantonese can in fact be distinguished from written Mandarin.

Read the rest of this entry »

Comments (28)

US Mint Announces Coin With Braille

The United States Mint has announced the release of its first coin with readable Braille on it, a commemorative silver dollar in honor of Louis Braille, the creator of Braille, to be released next year. The Braille is on the reverse.


The reverse of the Braille commemorative silver dollar

Read the rest of this entry »

Comments (28)

A Tale of a Pot

A third century C.E. toddy pot from Tamil Nadu with an inscription in Tamil Brahmi

A few days ago an unusual article appeared in The Hindu. It is about the fragment of a pot shown above, a pot used for collecting toddy (palm sap, modern Tamil கள்ளு) made about 1800 years ago. The writing on the pot is in Tamil Brahmi, a writing system that only fairly recently has come to be well understood. It says: n̪a:kan uɾal, Old Tamil for "Naakan's (pot with) toddy-sap". In modern Tamil writing this would be: நாகன் உறல். As the article points out, the fact that a poor toddy-tapper would write his name on a pot is indicative of mass literacy at the time.

Read the rest of this entry »

Comments (24)

Ask Language Log: Linguistic fact checking at the New Yorker

Stephen Smith writes:

There's a New Yorker article about a Moldovan woman working for an organization that tries to track down victims of sex trafficking and bring them home, but it includes this weird bit:

"She talks on the phone and knocks out memos and documents and e-mails in four languages and three alphabets—Russian, Romanian, Swedish, and English."

Russian is written in Cyrillic, Romanian is almost always written in Latin characters (though in Moldova, Cyrillic letters were officially used – but that was twenty years ago), and Swedish and English are always in Latin characters. Romanian and Swedish have some non-standard characters, but even if you count each language as having its own alphabet, that should make four alphabets, not three. And of course if you're going to count each language as having its own alphabet, what's the point in writing them both down? The New Yorker is usually such a fastidious publication – am I missing something here?

Stephen's question really ought to be addressed to the Columbia Journalism Review, I guess — the general problem of fact-checking at the New Yorker is not one that I'm professionally competent to investigate. But this is not the first case where we've noted carelessness and confusion about linguistic matters in New Yorker stories.

Read the rest of this entry »

Comments off

Is English more efficient than Chinese after all?

[Executive summary: Who knows?]

This follows up on a series of earlier posts about the comparative efficiency — in terms of text size — of different languages ("One world, how many bytes?", 8/5/2005; "Comparing communication efficiency across languages", 4/4/2008; "Mailbag: comparative communication efficiency", 4/5/2008). Hinrich Schütze wrote:

I'm not sure we have interacted since you taught your class at the 1991 linguistics institute in Santa Cruz — I fondly remember that class, which got me started in StatNLP.

I'm writing because I was intrigued by your posts on compression ratios of different languages.

As somebody else remarked, gzip can't really be used to judge the informativeness of a piece of text. I did the following simple experiment.

I read the first 109 or so characters from the xml Wikipedia dump and wrote them to a file (which I called wiki). I wrote the same characters to a second file (wikispace), but inserted a space after each character. Then I compressed the two files. Here is what I got:

1012930723 wiki
2025861446 wikispace
314377664 wiki.gz
385264415 wikispace.gz
385264415/314377664 approx 1.225

The two files contain the same information, but gzip's model does not handle this type of encoding well.

In this example we know what the generating process of the data was. In the case of Chinese and English we don't. So I think that until there is a more persuasive argument we should stick with the null hypothesis: the two texts of a Chinese-English bitext are equally informative, but the processes transforming the information into text are different in that the output of one can be more efficiently compressed by gzip than the other. I don't see how we can conclude anything about deep cultural differences.

Note that a word-based language model also would produce very different numbers for the two files.

Does this make sense or is there a flaw in this argument?

Read the rest of this entry »

Comments (16)

Two Dots Too Many

The Turkish newspaper Hürriyet reports a tragic consequence of the failure to localize cell phones.

Read the rest of this entry »

Comments (7)

Awkward Sneeze

I've often commented upon the deleterious effect of computers on the ability of Chinese to write characters, and the curiously named Jennifer 8. Lee already back in the February 1, 2001 issue of The New York Times wrote a convincing article entitled "Where the PC is mightier than the pen". More recently, I addressed this topic in a January 4, 2007 post on Pinyin News entitled "Chinese Characters as a High-Maintenance Script and the Consequences Thereof". And my friend, David Moser, described to me in a personal communication some years ago that it is nearly impossible to find a Chinese person who can write *both* the second and third characters of the common term DA3 PEN1TI4 ("sneeze") without using pinyin ("spelling") to type them into a computer or looking them up in a dictionary, again usually via pinyin. (I'm intentionally omitting the characters for this and the next term I shall discuss so that individuals who are literate in Chinese and wish to test themselves can do so.)

Now comes further evidence that, whether due to the effect of computers or simply because they would never have known anyway, persons whose main written language is Chinese are unable to write another common expression.

Read the rest of this entry »

Comments off