Archive for Writing systems

Sino-Russian Transcription and Transliteration

It has often been my duty to translate or edit Russian archeological and Sinological works in English. Two things plague such work more than anything else, and both have to do with transliteration.

Read the rest of this entry »

Comments (18)

How Michael spent his summer vacation

Well, part of it, anyhow… Michael Y. Chen wrote:

I went to Beijing and studied Chinese in July, and while I was over there I came across an interesting phenomenon.

In English, we talk about shapes that correspond to letters, like an S-curve or a T junction. While asking for directions, I found that there's a similar thing for shapes that correspond to Chinese characters. For example, 十字路口 (shi2zi4lu4kou3), a "十 intersection", refers to a four-way intersection (or just any intersection). The phrase is based entirely on the shape of the character, and not the meaning (十 means ten in Chinese). There's also a 丁字路口 (ding1zi4lu4kou3), a "丁 intersection", which would correspond to our T intersections.

Read the rest of this entry »

Comments (54)

Identifying written Cantonese

A query by a commenter on Victor's post raises an issue that seems worthy of discussion here on the main page. The question is whether it is possible to distinguish written Mandarin from written Cantonese. A widely believed myth is that even forms of Chinese that are mutually incomprehensible in their spoken forms are identical in writing. This is not true. Victor's post itself points out small differences between written Taiwanese Mandarin and Mainland Mandarin. Written Cantonese can in fact be distinguished from written Mandarin.

Read the rest of this entry »

Comments (28)

US Mint Announces Coin With Braille

The United States Mint has announced the release of its first coin with readable Braille on it, a commemorative silver dollar in honor of Louis Braille, the creator of Braille, to be released next year. The Braille is on the reverse.


The reverse of the Braille commemorative silver dollar

Read the rest of this entry »

Comments (28)

A Tale of a Pot

A third century C.E. toddy pot from Tamil Nadu with an inscription in Tamil Brahmi

A few days ago an unusual article appeared in The Hindu. It is about the fragment of a pot shown above, a pot used for collecting toddy (palm sap, modern Tamil கள்ளு) made about 1800 years ago. The writing on the pot is in Tamil Brahmi, a writing system that only fairly recently has come to be well understood. It says: n̪a:kan uɾal, Old Tamil for "Naakan's (pot with) toddy-sap". In modern Tamil writing this would be: நாகன் உறல். As the article points out, the fact that a poor toddy-tapper would write his name on a pot is indicative of mass literacy at the time.

Read the rest of this entry »

Comments (24)

Ask Language Log: Linguistic fact checking at the New Yorker

Stephen Smith writes:

There's a New Yorker article about a Moldovan woman working for an organization that tries to track down victims of sex trafficking and bring them home, but it includes this weird bit:

"She talks on the phone and knocks out memos and documents and e-mails in four languages and three alphabets—Russian, Romanian, Swedish, and English."

Russian is written in Cyrillic, Romanian is almost always written in Latin characters (though in Moldova, Cyrillic letters were officially used – but that was twenty years ago), and Swedish and English are always in Latin characters. Romanian and Swedish have some non-standard characters, but even if you count each language as having its own alphabet, that should make four alphabets, not three. And of course if you're going to count each language as having its own alphabet, what's the point in writing them both down? The New Yorker is usually such a fastidious publication – am I missing something here?

Stephen's question really ought to be addressed to the Columbia Journalism Review, I guess — the general problem of fact-checking at the New Yorker is not one that I'm professionally competent to investigate. But this is not the first case where we've noted carelessness and confusion about linguistic matters in New Yorker stories.

Read the rest of this entry »

Comments off

Is English more efficient than Chinese after all?

[Executive summary: Who knows?]

This follows up on a series of earlier posts about the comparative efficiency — in terms of text size — of different languages ("One world, how many bytes?", 8/5/2005; "Comparing communication efficiency across languages", 4/4/2008; "Mailbag: comparative communication efficiency", 4/5/2008). Hinrich Schütze wrote:

I'm not sure we have interacted since you taught your class at the 1991 linguistics institute in Santa Cruz — I fondly remember that class, which got me started in StatNLP.

I'm writing because I was intrigued by your posts on compression ratios of different languages.

As somebody else remarked, gzip can't really be used to judge the informativeness of a piece of text. I did the following simple experiment.

I read the first 109 or so characters from the xml Wikipedia dump and wrote them to a file (which I called wiki). I wrote the same characters to a second file (wikispace), but inserted a space after each character. Then I compressed the two files. Here is what I got:

1012930723 wiki
2025861446 wikispace
314377664 wiki.gz
385264415 wikispace.gz
385264415/314377664 approx 1.225

The two files contain the same information, but gzip's model does not handle this type of encoding well.

In this example we know what the generating process of the data was. In the case of Chinese and English we don't. So I think that until there is a more persuasive argument we should stick with the null hypothesis: the two texts of a Chinese-English bitext are equally informative, but the processes transforming the information into text are different in that the output of one can be more efficiently compressed by gzip than the other. I don't see how we can conclude anything about deep cultural differences.

Note that a word-based language model also would produce very different numbers for the two files.

Does this make sense or is there a flaw in this argument?

Read the rest of this entry »

Comments (16)

Two Dots Too Many

The Turkish newspaper Hürriyet reports a tragic consequence of the failure to localize cell phones.

Read the rest of this entry »

Comments (7)

Awkward Sneeze

I've often commented upon the deleterious effect of computers on the ability of Chinese to write characters, and the curiously named Jennifer 8. Lee already back in the February 1, 2001 issue of The New York Times wrote a convincing article entitled "Where the PC is mightier than the pen". More recently, I addressed this topic in a January 4, 2007 post on Pinyin News entitled "Chinese Characters as a High-Maintenance Script and the Consequences Thereof". And my friend, David Moser, described to me in a personal communication some years ago that it is nearly impossible to find a Chinese person who can write *both* the second and third characters of the common term DA3 PEN1TI4 ("sneeze") without using pinyin ("spelling") to type them into a computer or looking them up in a dictionary, again usually via pinyin. (I'm intentionally omitting the characters for this and the next term I shall discuss so that individuals who are literate in Chinese and wish to test themselves can do so.)

Now comes further evidence that, whether due to the effect of computers or simply because they would never have known anyway, persons whose main written language is Chinese are unable to write another common expression.

Read the rest of this entry »

Comments off