Archive for Language and computers

Machine translation of Literary Sinitic

Here on Language Log, we've often talked about the great difference between Modern Standard Mandarin (MSM) and the various other Sinitic languages (e.g., Cantonese, Taiwanese, Shanghainese, etc.).  The gap between Classical Chinese and all modern Sinitic languages is even greater than that between MSM and the other modern forms of Sinitic.  It is like the difference between Sanskrit and Hindi, between Latin and Italian, between Classical and modern Greek.

Read the rest of this entry »

Comments (15)

The sparseness of linguistic data

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

Read the rest of this entry »

Comments off

The future of Chinese language learning is now

When I began learning Mandarin nearly half a century ago, I knew exactly how I wanted to acquire proficiency in the language.  Nobody had to tell me how to do this; I knew it instinctively.  The main features of my desired regimen would be to:

1. pay little or no attention to memorizing characters (I would have been content with actively mastering 25 or so very high frequency characters and passively recognizing at most a hundred or so high frequency characters during the first year)

2. focus on pronunciation, vocabulary, grammar, particles, morphology, syntax, idioms, patterns, constructions, sentence structure, rhythm, prosody, and so forth — real language, not the script

3. read massive amounts of texts in Romanization and, if possible later on (after about half a year when I had the basics of the language nailed down), in character texts that would be phonetically annotated

Read the rest of this entry »

Comments (40)

Transcriptional and hybrid words in Mandarin

Like all languages, Mandarin and other Sinitic tongues have borrowed and coined words throughout their history.  But it would seem that the pace and nature of the current changes in Chinese usage are of such extraordinary amplitude that an unprecedented transformation is occurring, one that may be marked not merely by differences in quantity and quality, but of order and kind.

Read the rest of this entry »

Comments (5)

Swype and Voice Recognition for mobile device inputting

In late 2012, while visiting my son Tom in Dallas, I noticed that he was doing something very odd with his cell phone.  Most people enter text into their cell phone by pressing their thumbs (or their fingertip) on the letters of a small keyboard, whether virtual or actual.  But Tom was doing something altogether different:  he was sliding his finger over the glass surface of his phone and somehow, by so doing, he was able to enter text.  I was dumbfounded!  What amazed me most of all was how casual he was about it.  He'd be talking to me about something, then glance down at his cell phone, move his fingertip around on the glass, and — presto digito! — he'd have typed a message to someone and sent it off.

Read the rest of this entry »

Comments (42)

"People mountain, people sea" and "let's play"

Stephan Stiller says that my post on "Good good study; day day up" reminds him of "people mountain, people sea" (rénshānrénhǎi 人山人海), i.e., "crowded; packed; a sea of people".  This is another fairly complex Chinglishism that has entered the vocabulary of many English speakers who know no Chinese.  It was popularized by a Hong Kong music production company that took this expression as its name, and there was also a Hong Kong film that used this expression as its title.

Read the rest of this entry »

Comments (31)

Sneeze, hiccup, cough

Exceedingly few people (almost none) can write the Chinese  characters for the Mandarin word for "sneeze" (dǎ pēntì).  I suspect that most people would also get one or both of the characters for "cough" (késou) wrong, though it's not as hard as dǎ pēntì.

I mentioned this surmise to several colleagues and encouraged them to test themselves, their friends, and their students to see whether they could write késou correctly, or even at all.  I cautioned them that it should not be permitted to use any electronic device or reference material (dictionaries, etc.) to remind those being tested how to write the two characters for késou.  They must simply be written out directly on paper by hand.

Read the rest of this entry »

Comments (13)

Substituting Pinyin for unknown Chinese characters

On September 25, I posted on "Character amnesia and the emergence of digraphia", which occasioned a vigorous debate. A few of the commenters thought the essay in question wasn't actually written by a student. Be that as it may, this habit of replacing characters by Pinyin is becoming more and more common, especially among young students. Let us look at this scene from the Chinese documentary "Qǐng tóu wǒ yī piào" 请投我一票 (Please vote for me) at (34:29).

Read the rest of this entry »

Comments (30)

A fair-use victory for Google in these United States

US Circuit Judge Denny Chin has ruled in favor of Google in its long-running copyright litigation with the Authors Guild over the scanning and digitization of books. Chin ruled that the Google Books project constitutes fair use because it is "highly transformative" and "provides significant public benefits." In explaining those public benefits, Chin cited the use of Google Books data for Ngram queries, and pointed to a research example that we've discussed several times on Language Log.

Read the rest of this entry »

Comments (29)

CD tilde home

Thomas Pynchon's recent novel Bleeding Edge is set in New York City, after the dot.com bust of 3/10/2000 and shortly before the World Trade Center attack of 9/11/2001.

The central figure is Maxine Tarnow, who runs a small fraud-investigation outfit called Tail 'Em and Nail 'Em, and many of her clients and her friends are associated with the failed, failing, or somehow-surviving start-ups of Silicon Alley. As a result, a lot of the local linguistic color in this novel is geekish in nature.

Read the rest of this entry »

Comments (33)

Of toads, modernization, and simplified characters

Considering the fact that we've had a lot of traffic on spelling bees, character amnesia, simplified characters, and whatnot on Language Log recently, it's not surprising that the following article by Dan Kedmey would appear in Time yesterday (Aug. 15, 2013), though without any mention of Language Log:  "What the Word 'Toad' Can Tell You About China’s Modernization".

At first I was going to just write a short note about this article and add it as a comment to this post from a week ago.  But the more I read through the article, the more annoyed I became by how riddled with errors it is.  So I've decided to write this post listing some of the more egregious mistakes, lest innocent readers be led astray.  After all, Time still commands a substantial readership, so the magazine needs to be held accountable for the accuracy of its statements, even when writing about something so supposedly quaint as Chinese — which, by now, certainly should no longer be viewed as exotic at all, since China has become very much a part of the global economy.

Read the rest of this entry »

Comments (26)

Noodle devils

Nathan Vedal wrote to tell me about an interesting mistranslation into Chinese that he recently came across.

Having purchased some not particularly healthy, but quite delicious, instant noodles produced by a Korean company, he was perusing the Chinese instructions, which included the following sentence:

Read the rest of this entry »

Comments (10)

Copying characters

From a collection of photographs of Chinese school children in rural areas:

Read the rest of this entry »

Comments (80)