Language Log

Archive for Language and computers

The paucity of two-letter words

September 3, 2014 @ 10:25 am· Filed by Geoffrey K. Pullum under Language and computers

The number of possible two-letter lower-case strings over the English alphabet (not including the apostrophe) is 26² = 676. This morning I ran a script to test which two-letter sequences show up as words included in the standard 25,143-word list of words supplied with many Unix-derived systems (usually at /usr/share/dict/words). I found the proportion of two-letter sequences that are 2-letter words is roughly 9 percent (59/676 ≈ 0.09). That is, more than 90 percent of the logically possible two-letter combinations from aa to zz do not occur as spellings of common English words. You might think a lot of the explanation lies in phonetics: vowelless combinations like pq or bn are unpronounceable. But I then did the same thing for two-letter standard Unix commands: bc (basic calculator), cp (copy files), ls (list files), mv (move or rename files), etc. These arbitrarily adopted program names do not have to be pronounceable, and usually aren't. And I found that the ratio of two-letter Unix commands (more precisely, two-letter commands that have manual entries on Apple OS X version 10.6.8.) to two-letter sequences that are not Unix commands is almost exactly the same (62/676 ≈ 0.09). Why? Could it be that some kind of natural law discourages packing too many meanings into character strings (or phoneme sequences) of a given length, because it is likely to give rise to confusion or mnemonic problems? Does every language waste (as it were) at least 90 percent of the space available in the length-N sequences of letters or sounds that it uses, possibly for every N > 1?

Read the rest of this entry »

Permalink Comments (42)

Stray Chinese characters in English language documents

August 22, 2014 @ 9:33 pm· Filed by Victor Mair under Language and computers, Writing systems

Lawrence Evalyn wrote to me saying that he received the official communication below about a new student card that is being issued by his university. He was perplexed by all the Chinese characters that got inserted in the text. They seem to appear consistently in certain places and for certain letters. [N.B.: The communication has been anonymized for posting on Language Log.]

Read the rest of this entry »

Permalink Comments (10)

Is the Urdu script on the verge of dying?

June 29, 2014 @ 3:20 am· Filed by Victor Mair under Diglossia and digraphia, Language and computers, Language on the internets, Writing

Hindi-Urdu, also referred to as Hindustani, is the classic case of a digraphia, so much so that there has been a long-standing controversy over whether they are one language or two. Their colloquial spoken forms are nearly identical, but when written down, the one in the Devanāgarī script, the other in the Nastaʿlīq script, they have a very different look and "feel".

Read the rest of this entry »

Permalink Comments (56)

Language notes from Macao and Hong Kong

June 22, 2014 @ 1:31 am· Filed by Victor Mair under Language and computers, Lost in translation, Multilingualism, Topolects

From June 13 until the 18th, I was at a conference on Buddhist culture and society held at the University of Macao. There were about thirty participants, all except me from East Asia, and the East Asians were about evenly divided among scholars from Taiwan, China, Macao, and Hong Kong, plus one each from Japan and Korea.

Read the rest of this entry »

Permalink Comments (25)

Machine translation of Literary Sinitic

June 8, 2014 @ 10:18 am· Filed by Victor Mair under Language and computers, Translation

Here on Language Log, we've often talked about the great difference between Modern Standard Mandarin (MSM) and the various other Sinitic languages (e.g., Cantonese, Taiwanese, Shanghainese, etc.). The gap between Classical Chinese and all modern Sinitic languages is even greater than that between MSM and the other modern forms of Sinitic. It is like the difference between Sanskrit and Hindi, between Latin and Italian, between Classical and modern Greek.

Read the rest of this entry »

Permalink Comments (15)

The sparseness of linguistic data

April 7, 2014 @ 4:42 am· Filed by Geoffrey K. Pullum under Changing times, Grammar, Information technology, Language and computers, Lost in translation, Research tools, Resources

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

Read the rest of this entry »

Permalink Comments off

The future of Chinese language learning is now

April 5, 2014 @ 3:14 pm· Filed by Victor Mair under Dictionaries, Information technology, Language acquisition, Language and computers, Language and education, Language and technology, Language teaching and learning, Pedagogy

When I began learning Mandarin nearly half a century ago, I knew exactly how I wanted to acquire proficiency in the language. Nobody had to tell me how to do this; I knew it instinctively. The main features of my desired regimen would be to:

1. pay little or no attention to memorizing characters (I would have been content with actively mastering 25 or so very high frequency characters and passively recognizing at most a hundred or so high frequency characters during the first year)

2. focus on pronunciation, vocabulary, grammar, particles, morphology, syntax, idioms, patterns, constructions, sentence structure, rhythm, prosody, and so forth — real language, not the script

3. read massive amounts of texts in Romanization and, if possible later on (after about half a year when I had the basics of the language nailed down), in character texts that would be phonetically annotated

Read the rest of this entry »

Permalink Comments (40)

Transcriptional and hybrid words in Mandarin

March 6, 2014 @ 11:05 am· Filed by Victor Mair under Borrowing, Errors, Language and computers, Transcription

Like all languages, Mandarin and other Sinitic tongues have borrowed and coined words throughout their history. But it would seem that the pace and nature of the current changes in Chinese usage are of such extraordinary amplitude that an unprecedented transformation is occurring, one that may be marked not merely by differences in quantity and quality, but of order and kind.

Read the rest of this entry »

Permalink Comments (5)

Swype and Voice Recognition for mobile device inputting

January 22, 2014 @ 2:14 pm· Filed by Victor Mair under Information technology, Language and computers, Language and technology, Speech technology, Writing systems

In late 2012, while visiting my son Tom in Dallas, I noticed that he was doing something very odd with his cell phone. Most people enter text into their cell phone by pressing their thumbs (or their fingertip) on the letters of a small keyboard, whether virtual or actual. But Tom was doing something altogether different: he was sliding his finger over the glass surface of his phone and somehow, by so doing, he was able to enter text. I was dumbfounded! What amazed me most of all was how casual he was about it. He'd be talking to me about something, then glance down at his cell phone, move his fingertip around on the glass, and — presto digito! — he'd have typed a message to someone and sent it off.

Read the rest of this entry »

Permalink Comments (42)

"People mountain, people sea" and "let's play"

January 19, 2014 @ 7:12 am· Filed by Victor Mair under Found in translation, Language and computers, Language and culture, Language on the internets, Lost in translation, Translatese

Stephan Stiller says that my post on "Good good study; day day up" reminds him of "people mountain, people sea" (rénshānrénhǎi 人山人海), i.e., "crowded; packed; a sea of people". This is another fairly complex Chinglishism that has entered the vocabulary of many English speakers who know no Chinese. It was popularized by a Hong Kong music production company that took this expression as its name, and there was also a Hong Kong film that used this expression as its title.

Read the rest of this entry »

Permalink Comments (31)

Sneeze, hiccup, cough

December 19, 2013 @ 9:36 pm· Filed by Victor Mair under Language and computers, Words words words, Writing, Writing systems

Exceedingly few people (almost none) can write the Chinese characters for the Mandarin word for "sneeze" (dǎ pēntì). I suspect that most people would also get one or both of the characters for "cough" (késou) wrong, though it's not as hard as dǎ pēntì.

I mentioned this surmise to several colleagues and encouraged them to test themselves, their friends, and their students to see whether they could write késou correctly, or even at all. I cautioned them that it should not be permitted to use any electronic device or reference material (dictionaries, etc.) to remind those being tested how to write the two characters for késou. They must simply be written out directly on paper by hand.

Read the rest of this entry »

Permalink Comments (13)

Substituting Pinyin for unknown Chinese characters

December 3, 2013 @ 2:28 am· Filed by Victor Mair under Alphabets, Diglossia and digraphia, Language and computers, Language and the movies, Writing systems

On September 25, I posted on "Character amnesia and the emergence of digraphia", which occasioned a vigorous debate. A few of the commenters thought the essay in question wasn't actually written by a student. Be that as it may, this habit of replacing characters by Pinyin is becoming more and more common, especially among young students. Let us look at this scene from the Chinese documentary "Qǐng tóu wǒ yī piào" 请投我一票 (Please vote for me) at (34:29).

Read the rest of this entry »

Permalink Comments (30)

A fair-use victory for Google in these United States

November 14, 2013 @ 11:56 am· Filed by Ben Zimmer under Language and computers, Language and technology, Language and the law

US Circuit Judge Denny Chin has ruled in favor of Google in its long-running copyright litigation with the Authors Guild over the scanning and digitization of books. Chin ruled that the Google Books project constitutes fair use because it is "highly transformative" and "provides significant public benefits." In explaining those public benefits, Chin cited the use of Google Books data for Ngram queries, and pointed to a research example that we've discussed several times on Language Log.

Read the rest of this entry »

Permalink Comments (29)

« Previous Page — « Previous Entries

Next Entries » — Next Page »

Archive for Language and computers

The paucity of two-letter words

Stray Chinese characters in English language documents

Is the Urdu script on the verge of dying?

Language notes from Macao and Hong Kong

Machine translation of Literary Sinitic

The sparseness of linguistic data

The future of Chinese language learning is now

Transcriptional and hybrid words in Mandarin

Swype and Voice Recognition for mobile device inputting

"People mountain, people sea" and "let's play"

Sneeze, hiccup, cough

Substituting Pinyin for unknown Chinese characters

A fair-use victory for Google in these United States

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta