Archive for Language and technology
A few million monkeys (yawn)
Language Log readers may be wondering why there has been no coverage of the achievement of Jesse Anderson, who has managed to get millions of monkeys, as computationally simulated on Amazon servers, to reproduce 99.9 percent of the works of Shakespeare (his own account is here on his blog, and various journalistic sheep have obediently reproduced his account in the newspapers). I'll tell you why.
Read the rest of this entry »
Permalink Comments off
Sequoyah's syllabary, from parchment to iPad
In a great use of comic art, Roy Boney Jr. has created a graphic feature for the magazine Indian Country Today about the history of the Cherokee syllabary developed by Sequoyah in the early 19th century. Boney begins with the syllabary's inception and early use, and continues all the way through technological developments like the Selectric typewriter and Unicode standardization. Check it out here.
The economics of Chinese character usage
Under the above rubric, my friend Apollo Wu sent around a note (copied below) about the economic impact of the use of Chinese characters in the operation of his business. Since Apollo was for many years (from 1973 to 1998) a top translator in the Chinese Translation Service at United Nations headquarters in New York, he knows whereof he speaks. Among other interesting tidbits that I heard from Apollo over the decades was that, of the official languages of the United Nations (Arabic, Mandarin Chinese, English, French, Russian, and Castilian Spanish) Chinese was by far the least efficient and most expensive to process.
Read the rest of this entry »
Password strength
We neglected to mention this while the relevant cartoon was the current one at xkcd, but a couple of days ago there was a nice analysis of why through 20 years of effort, we've successfully trained everyone to use passwords that are hard for humans to remember but easy for computers to guess. Check it out. The observation seems correct: if you try it out on one of the web interfaces that assess the strength of your password as you choose it, you'll find that a word with a few letters replaced by miscellaneous digits and so on, like Ne8r@$k@, gets high marks but grizzle snip grunt mackerel doesn't (and probably won't be accepted beyond the first 8 to 12 characters). Yet if you mutter "grizzle snip grunt mackerel" under your breath once, you'll find you remember it all day, even without using it. And length is your main security. The example the cartoon gives contrasts a 3-day brute-force cracking time (for about 28 bits of entropy) with a 550-year time (for about 44).
[Comments are closed unless you have a password. If you have forgotten your password, click here.]
Permalink Comments off
Microsoft tech writing noun pile blog post madness!
Fans of noun piles will enjoy the recent blog post by Mike Pope, a technical editor at Microsoft, "Fun (or not) with noun stacks." Mike shares a few of the lovely compound noun pileups he's encountered on the job:
- data bound control table row action links
- failed password security question answer attempts limit
- reduced minimum OS partition space available requirement
Mike goes on to explain why he thinks these problematic constructions continue to crop up in technical writing, driven by imperatives of terseness and concision at the expense of comprehensibility. He also gives helpful advice for untangling technical noun piles into something more user-friendly. That's all well and good, but you have to wonder just how deeply enmeshed in nerdview a writer must be to produce a whopper like "failed password security question answer attempts limit."
Translationese
Looking at Geoff's post on machine-translated phishing scam messages, the message certainly does come across as very similar to the English output we in the biz frequently see coming out of statistical machine translation of Chinese. This includes Chinese-specific issues like recovering correct determiners from a language that does not express them overtly (I hope that the [not this] letter meets you in good spirits), as well as the ubiquitous phenomenon of sentences that are locally coherent — thanks to phrase-level translations and good statistical language-models for English — but globally nonsensical. I don't claim to know what makes a text poetic, but it seems to me that this combination of local coherence and larger-scale disconnectedness must be at least partly responsible for what Geoff describes as the "strange poetry" of machine translationese.
Read the rest of this entry »
The barley is their goal
You know what I think is happening? This is just too insane not to be true. I believe Hong Kong script kiddies wanting to try Nigerian-style thieving of bank account details are actually using Google Translate to translate their phishing messages from Chinese into English. Below the fold I quote in full (obscuring my address with x's to outwit the spam robots) a wildly, asyntactically unintelligible phishing spam which I received today. It's unintendedly hilarious — you could try reading it aloud at parties. And it's so garbled and implausible that I can't believe even poor naive Aunt Mildred will be suckered. Interestingly, it shows clear signs of being the output of very bad corpus-based translation, unsupervised and unchecked. My suspicion of Chinese provenance was based not just on the .hk (Hong Kong) address, but also on the fact that the spammer thinks an English-speaking PhD named Dr. Roller Key would refer to himself as Dr. Roller — that is, the Chinese syntax for personal names is being assumed.
Read the rest of this entry »
Permalink Comments off
Spam for sale
I guess I had not really foreseen how fast the advent of ebooks would lead to a gigantic, unstoppable tsunami of what can only be described as bookspam, available for sale at Amazon.com. Have a look at this article by John Naughton, about the results of Amazon making available an easy conversion to Kindle format and easy uploading for sale.
Read the rest of this entry »
Permalink Comments off
Chinese typewriter, part 2
On June 30, 2009, I wrote a post entitled "Chinese Typewriter". It's time now to do an update, because on March 9, 2011, I travelled to the University of Kansas to deliver the Wallace Johnson Memorial Lecture. So what do Wallace Johnson and the University of Kansas have to do with Chinese typewriters? It's simply that Wallace Johnson is the only Westerner I know who became proficient in the use of the kind of Chinese typewriter I wrote about in my 2009 post, and he happened to teach Chinese history at the University of Kansas from 1965 to 2007. I knew Wally Johnson because of his interest in Tang period law and because he received his Ph.D. from the University of Pennsylvania under Derk Bodde, who was a good friend of mine.
Read the rest of this entry »
Not sacrificing anything to prevent anything…not
From a Livescience.com article (about a police chief who recommends keystroke-logging your kids to obtain their passwords so you can find out where they go online) comes this disastrous tangle of a sentence, which will take hours of police time to clear up:
"When it comes down to safety and welfare of your child, I don’t think any parent would sacrifice anything to make sure nothing happens to their children," said Batelli, the father of a teenage daughter.
Read the rest of this entry »
Permalink Comments off
Could Watson parse a snowclone?
Today on The Atlantic I break down Watson's big win over the humans in the Jeopardy!/IBM challenge. (See previous Language Log coverage here and here.) I was particularly struck by the snowclone that Ken Jennings left on his Final Jeopardy response card last night: "I, for one, welcome our new computer overlords." I use that offhand comment as a jumping-off point to dismantle some of the hype about Watson's purported ability to "understand" natural language.
Read the rest of this entry »
New search service for language resources
It has just become a whole lot easier to search the world's language archives. The new OLAC Language Resource Catalog contains descriptions of over 100,000 language resources from over 40 language archives worldwide.
This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.
OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. The OLAC Language Resource Catalog was developed by staff at the Linguistic Data Consortium, the University of Pennsylvania Libraries, the Graduate Institute of Applied Linguistics, and the University of Melbourne. The primary sponsor is the National Science Foundation.
