Archive for Language and computers

Cantonese input methods

Despite the efforts of the central government to clamp down on and diminish the role of Cantonese in education and in public life generally, the language has been experiencing a heady resurgence, especially in connection with the prolonged Umbrella Movement last fall.

"Cantonese resurgent" (12/11/12)

"Here’s why the name of Hong Kong’s 'Umbrella Movement' is so subversive" (10/23/14)

"Translating the Umbrella Revolution" (10/3/14)

"Cantonese protest slogans" (10/26/14), etc.

Read the rest of this entry »

Comments (9)

Zhou Youguang, 109 and going strong

A year ago, I wrote "Zhou Youguang, Father of Pinyin" (1/14/14) to celebrate Zhou xiansheng's 108th birthday and his many accomplishments in language reform and applied linguistics.  Included in that post were a portrait of ZYG in his study and numerous links concerning the man and his works.

Read the rest of this entry »

Comments (3)

Stylometric analysis of the Sony Hacking

The question of who was behind the hacking of Sony peaked a couple of weeks ago, but it is still a live issue.  The United States government insists that it was the North Koreans who did it:

"Chief Says FBI Has No Doubt That North Korea Attacked Sony" (New York Times — January 8, 2015)

James B. Comey, director of the Federal Bureau of Investigation, said on Wednesday that no one should doubt that the North Korean government was behind the destructive attack on Sony’s computer network last fall.

Read the rest of this entry »

Comments (13)

Kazakh

Google Translate just keeps getting bigger and bigger and better and better.  As of today, it now includes Kazakh.  And here's the first word that I typed in Google Translate + Kazakh:

Қазақ

Read the rest of this entry »

Comments (25)

Tim Cook, Bent Man

Last week, China was gaga over Facebook chairman Mark Zuckerberg for gamely, if somewhat lamely, speaking Mandarin before an audience of Tsinghua University students:

"Zuckerberg's Mandarin" (10/23/14)

In the days following his sensational performance at Tsinghua, while not universally showered with adulation (and Facebook is still blocked in China), Zuckerberg was generally acclaimed for his gutsy, good-natured effort to speak to Chinese people in their own language.

In stark contrast, poor Tim Cook (Apple CEO) was mocked by the Chinese netizenry for his declaration in Bloomberg Businessweek:  "So let me be clear: I’m proud to be gay…."

"Tim Cook Speaks Up" (10/30/14)

The resultant hullabaloo on the Chinese internet was instantaneous:

"Tim Cook Coming Out Has Turned China Into a Nation of 5th-Graders:  Despite the Apple CEO's good intentions, Chinese netizens can't seem to stop mocking iPhones for being gay. " (10/30/2014)

Read the rest of this entry »

Comments (18)

The paucity of two-letter words

The number of possible two-letter lower-case strings over the English alphabet (not including the apostrophe) is 262 = 676. This morning I ran a script to test which two-letter sequences show up as words included in the standard 25,143-word list of words supplied with many Unix-derived systems (usually at /usr/share/dict/words). I found the proportion of two-letter sequences that are 2-letter words is roughly 9 percent (59/676 ≈ 0.09). That is, more than 90 percent of the logically possible two-letter combinations from aa to zz do not occur as spellings of common English words. You might think a lot of the explanation lies in phonetics: vowelless combinations like pq or bn are unpronounceable. But I then did the same thing for two-letter standard Unix commands: bc (basic calculator), cp (copy files), ls (list files), mv (move or rename files), etc. These arbitrarily adopted program names do not have to be pronounceable, and usually aren't. And I found that the ratio of two-letter Unix commands (more precisely, two-letter commands that have manual entries on Apple OS X version 10.6.8.) to two-letter sequences that are not Unix commands is almost exactly the same (62/676 ≈ 0.09). Why? Could it be that some kind of natural law discourages packing too many meanings into character strings (or phoneme sequences) of a given length, because it is likely to give rise to confusion or mnemonic problems? Does every language waste (as it were) at least 90 percent of the space available in the length-N sequences of letters or sounds that it uses, possibly for every N > 1?

Read the rest of this entry »

Comments (44)

Stray Chinese characters in English language documents

Lawrence Evalyn wrote to me saying that he received the official communication below about a new student card that is being issued by his university.  He was perplexed by all the Chinese characters that got inserted in the text.  They seem to appear consistently in certain places and for certain letters.  [N.B.:  The communication has been anonymized for posting on Language Log.]

Read the rest of this entry »

Comments (10)

Is the Urdu script on the verge of dying?

Hindi-Urdu, also referred to as Hindustani, is the classic case of a digraphia, so much so that there has been a long-standing controversy over whether they are one language or two.  Their colloquial spoken forms are nearly identical, but when written down, the one in the Devanāgarī script, the other in the Nastaʿlīq script, they have a very different look and "feel".

Read the rest of this entry »

Comments (56)

Language notes from Macao and Hong Kong

From June 13 until the 18th, I was at a conference on Buddhist culture and society held at the University of Macao.  There were about thirty participants, all except me from East Asia, and the East Asians were about evenly divided among scholars from Taiwan, China, Macao, and Hong Kong, plus one each from Japan and Korea.

Read the rest of this entry »

Comments (25)

Machine translation of Literary Sinitic

Here on Language Log, we've often talked about the great difference between Modern Standard Mandarin (MSM) and the various other Sinitic languages (e.g., Cantonese, Taiwanese, Shanghainese, etc.).  The gap between Classical Chinese and all modern Sinitic languages is even greater than that between MSM and the other modern forms of Sinitic.  It is like the difference between Sanskrit and Hindi, between Latin and Italian, between Classical and modern Greek.

Read the rest of this entry »

Comments (15)

The sparseness of linguistic data

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

Read the rest of this entry »

Comments off

The future of Chinese language learning is now

When I began learning Mandarin nearly half a century ago, I knew exactly how I wanted to acquire proficiency in the language.  Nobody had to tell me how to do this; I knew it instinctively.  The main features of my desired regimen would be to:

1. pay little or no attention to memorizing characters (I would have been content with actively mastering 25 or so very high frequency characters and passively recognizing at most a hundred or so high frequency characters during the first year)

2. focus on pronunciation, vocabulary, grammar, particles, morphology, syntax, idioms, patterns, constructions, sentence structure, rhythm, prosody, and so forth — real language, not the script

3. read massive amounts of texts in Romanization and, if possible later on (after about half a year when I had the basics of the language nailed down), in character texts that would be phonetically annotated

Read the rest of this entry »

Comments (40)

Transcriptional and hybrid words in Mandarin

Like all languages, Mandarin and other Sinitic tongues have borrowed and coined words throughout their history.  But it would seem that the pace and nature of the current changes in Chinese usage are of such extraordinary amplitude that an unprecedented transformation is occurring, one that may be marked not merely by differences in quantity and quality, but of order and kind.

Read the rest of this entry »

Comments (5)