Archive for Language and computers


Google Translate just keeps getting bigger and bigger and better and better.  As of today, it now includes Kazakh.  And here's the first word that I typed in Google Translate + Kazakh:


Tim Cook, Bent Man

Last week, China was gaga over Facebook chairman Mark Zuckerberg for gamely, if somewhat lamely, speaking Mandarin before an audience of Tsinghua University students:

"Zuckerberg's Mandarin" (10/23/14)

In the days following his sensational performance at Tsinghua, while not universally showered with adulation (and Facebook is still blocked in China), Zuckerberg was generally acclaimed for his gutsy, good-natured effort to speak to Chinese people in their own language.

In stark contrast, poor Tim Cook (Apple CEO) was mocked by the Chinese netizenry for his declaration in Bloomberg Businessweek:  "So let me be clear: I’m proud to be gay…."

"Tim Cook Speaks Up" (10/30/14)

The resultant hullabaloo on the Chinese internet was instantaneous:

"Tim Cook Coming Out Has Turned China Into a Nation of 5th-Graders:  Despite the Apple CEO's good intentions, Chinese netizens can't seem to stop mocking iPhones for being gay. " (10/30/2014)

The paucity of two-letter words

The number of possible two-letter lower-case strings over the English alphabet (not including the apostrophe) is 262 = 676. This morning I ran a script to test which two-letter sequences show up as words included in the standard 25,143-word list of words supplied with many Unix-derived systems (usually at /usr/share/dict/words). I found the proportion of two-letter sequences that are 2-letter words is roughly 9 percent (59/676 ≈ 0.09). That is, more than 90 percent of the logically possible two-letter combinations from aa to zz do not occur as spellings of common English words. You might think a lot of the explanation lies in phonetics: vowelless combinations like pq or bn are unpronounceable. But I then did the same thing for two-letter standard Unix commands: bc (basic calculator), cp (copy files), ls (list files), mv (move or rename files), etc. These arbitrarily adopted program names do not have to be pronounceable, and usually aren't. And I found that the ratio of two-letter Unix commands (more precisely, two-letter commands that have manual entries on Apple OS X version 10.6.8.) to two-letter sequences that are not Unix commands is almost exactly the same (62/676 ≈ 0.09). Why? Could it be that some kind of natural law discourages packing too many meanings into character strings (or phoneme sequences) of a given length, because it is likely to give rise to confusion or mnemonic problems? Does every language waste (as it were) at least 90 percent of the space available in the length-N sequences of letters or sounds that it uses, possibly for every N > 1?

Stray Chinese characters in English language documents

Lawrence Evalyn wrote to me saying that he received the official communication below about a new student card that is being issued by his university.  He was perplexed by all the Chinese characters that got inserted in the text.  They seem to appear consistently in certain places and for certain letters.  [N.B.:  The communication has been anonymized for posting on Language Log.]

Is the Urdu script on the verge of dying?

Hindi-Urdu, also referred to as Hindustani, is the classic case of a digraphia, so much so that there has been a long-standing controversy over whether they are one language or two.  Their colloquial spoken forms are nearly identical, but when written down, the one in the Devanāgarī script, the other in the Nastaʿlīq script, they have a very different look and "feel".

Language notes from Macao and Hong Kong

From June 13 until the 18th, I was at a conference on Buddhist culture and society held at the University of Macao.  There were about thirty participants, all except me from East Asia, and the East Asians were about evenly divided among scholars from Taiwan, China, Macao, and Hong Kong, plus one each from Japan and Korea.

Machine translation of Literary Sinitic

Here on Language Log, we've often talked about the great difference between Modern Standard Mandarin (MSM) and the various other Sinitic languages (e.g., Cantonese, Taiwanese, Shanghainese, etc.).  The gap between Classical Chinese and all modern Sinitic languages is even greater than that between MSM and the other modern forms of Sinitic.  It is like the difference between Sanskrit and Hindi, between Latin and Italian, between Classical and modern Greek.

The sparseness of linguistic data

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

The future of Chinese language learning is now

When I began learning Mandarin nearly half a century ago, I knew exactly how I wanted to acquire proficiency in the language.  Nobody had to tell me how to do this; I knew it instinctively.  The main features of my desired regimen would be to:

1. pay little or no attention to memorizing characters (I would have been content with actively mastering 25 or so very high frequency characters and passively recognizing at most a hundred or so high frequency characters during the first year)

2. focus on pronunciation, vocabulary, grammar, particles, morphology, syntax, idioms, patterns, constructions, sentence structure, rhythm, prosody, and so forth — real language, not the script

3. read massive amounts of texts in Romanization and, if possible later on (after about half a year when I had the basics of the language nailed down), in character texts that would be phonetically annotated

Transcriptional and hybrid words in Mandarin

Like all languages, Mandarin and other Sinitic tongues have borrowed and coined words throughout their history.  But it would seem that the pace and nature of the current changes in Chinese usage are of such extraordinary amplitude that an unprecedented transformation is occurring, one that may be marked not merely by differences in quantity and quality, but of order and kind.

Swype and Voice Recognition for mobile device inputting

In late 2012, while visiting my son Tom in Dallas, I noticed that he was doing something very odd with his cell phone.  Most people enter text into their cell phone by pressing their thumbs (or their fingertip) on the letters of a small keyboard, whether virtual or actual.  But Tom was doing something altogether different:  he was sliding his finger over the glass surface of his phone and somehow, by so doing, he was able to enter text.  I was dumbfounded!  What amazed me most of all was how casual he was about it.  He'd be talking to me about something, then glance down at his cell phone, move his fingertip around on the glass, and — presto digito! — he'd have typed a message to someone and sent it off.

"People mountain, people sea" and "let's play"

Stephan Stiller says that my post on "Good good study; day day up" reminds him of "people mountain, people sea" (rénshānrénhǎi 人山人海), i.e., "crowded; packed; a sea of people".  This is another fairly complex Chinglishism that has entered the vocabulary of many English speakers who know no Chinese.  It was popularized by a Hong Kong music production company that took this expression as its name, and there was also a Hong Kong film that used this expression as its title.

Sneeze, hiccup, cough

Exceedingly few people (almost none) can write the Chinese  characters for the Mandarin word for "sneeze" (dǎ pēntì).  I suspect that most people would also get one or both of the characters for "cough" (késou) wrong, though it's not as hard as dǎ pēntì.

I mentioned this surmise to several colleagues and encouraged them to test themselves, their friends, and their students to see whether they could write késou correctly, or even at all.  I cautioned them that it should not be permitted to use any electronic device or reference material (dictionaries, etc.) to remind those being tested how to write the two characters for késou.  They must simply be written out directly on paper by hand.

