Archive for Language and technology

The barley is their goal

You know what I think is happening? This is just too insane not to be true. I believe Hong Kong script kiddies wanting to try Nigerian-style thieving of bank account details are actually using Google Translate to translate their phishing messages from Chinese into English. Below the fold I quote in full (obscuring my address with x's to outwit the spam robots) a wildly, asyntactically unintelligible phishing spam which I received today. It's unintendedly hilarious — you could try reading it aloud at parties. And it's so garbled and implausible that I can't believe even poor naive Aunt Mildred will be suckered. Interestingly, it shows clear signs of being the output of very bad corpus-based translation, unsupervised and unchecked. My suspicion of Chinese provenance was based not just on the .hk (Hong Kong) address, but also on the fact that the spammer thinks an English-speaking PhD named Dr. Roller Key would refer to himself as Dr. Roller — that is, the Chinese syntax for personal names is being assumed.

Read the rest of this entry »

Comments off

Spam for sale

I guess I had not really foreseen how fast the advent of ebooks would lead to a gigantic, unstoppable tsunami of what can only be described as bookspam, available for sale at Amazon.com. Have a look at this article by John Naughton, about the results of Amazon making available an easy conversion to Kindle format and easy uploading for sale.

Read the rest of this entry »

Comments off

Chinese typewriter, part 2

On June 30, 2009, I wrote a post entitled "Chinese Typewriter". It's time now to do an update, because on March 9, 2011, I travelled to the University of Kansas to deliver the Wallace Johnson Memorial Lecture. So what do Wallace Johnson and the University of Kansas have to do with Chinese typewriters? It's simply that Wallace Johnson is the only Westerner I know who became proficient in the use of the kind of Chinese typewriter I wrote about in my 2009 post, and he happened to teach Chinese history at the University of Kansas from 1965 to 2007. I knew Wally Johnson because of his interest in Tang period law and because he received his Ph.D. from the University of Pennsylvania under Derk Bodde, who was a good friend of mine.

Read the rest of this entry »

Comments (33)

Not sacrificing anything to prevent anything…not

From a Livescience.com article (about a police chief who recommends keystroke-logging your kids to obtain their passwords so you can find out where they go online) comes this disastrous tangle of a sentence, which will take hours of police time to clear up:

"When it comes down to safety and welfare of your child, I don’t think any parent would sacrifice anything to make sure nothing happens to their children," said Batelli, the father of a teenage daughter.

Read the rest of this entry »

Comments off

Could Watson parse a snowclone?

Today on The Atlantic I break down Watson's big win over the humans in the Jeopardy!/IBM challenge. (See previous Language Log coverage here and here.) I was particularly struck by the snowclone that Ken Jennings left on his Final Jeopardy response card last night: "I, for one, welcome our new computer overlords." I use that offhand comment as a jumping-off point to dismantle some of the hype about Watson's purported ability to "understand" natural language.

Read the rest of this entry »

Comments (32)

New search service for language resources

It has just become a whole lot easier to search the world's language archives.  The new OLAC Language Resource Catalog contains descriptions of over 100,000 language resources from over 40 language archives worldwide.

This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.

OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.  The OLAC Language Resource Catalog was developed by staff at the Linguistic Data Consortium, the University of Pennsylvania Libraries, the Graduate Institute of Applied Linguistics, and the University of Melbourne.  The primary sponsor is the National Science Foundation.

Comments (2)

On "culturomics" and "ngrams"

I'm still mulling over the blockbuster "culturomics" paper published in Science last week and ably addressed here by Geoff Nunberg and Mark Liberman. I'll have more to say about aspects of the paper having to do with the size of the English lexicon, but in the meantime let me direct you to my latest Word Routes column on the Visual Thesaurus, which takes up the more superficial question of nomenclature: both culturomics and ngram (as in the Ngram Viewer) are less than transparent to non-specialists (and even trouble some specialists). An excerpt follows below.

Read the rest of this entry »

Comments (24)

Comprehend this!

Perhaps the most illiterate phishing spam yet: ignoring the incompetence of having Velez Restrepo as the sender, jg_van88 (at a Chinese address) as the reply-to, and Mr(.) John Galvan as the alleged sender, with the X-Accept-Language set to Spanish, this message has at least 20 linguistic errors in the text, which is roughly one for each four words.

From gvelez@une.net.co
Wed Dec 15 11:11:57 2010
Date: Wed, 15 Dec 2010 03:11:43 -0800
From: velez restrepo guillermo <gvelez@une.net.co>
Subject: Comprehend This Proposal
Bcc:
Reply-to: jg_van88@w.cn
X-Mailer: Sun Java(tm) System Messenger Express 7.3-11.01 64bit (built Sep 1 2009)
X-Accept-Language: es
Priority: normal

Good day,

I am Mr John Galvan a staff of a private offshore AIG Private bank united kingdom.

I have a great proposal that we interest and benefit you, this proposal of mine is worth of £15,500,000.00 Million Pounds.I intend to give Four thy Percent of the total funds as compensation for your assistance. I will notify you on the full transaction on receipt of your response if interested, and I shall send you the details.

Kind Regards,
Mr. John Galvan

Read the rest of this entry »

Comments (72)

Enforced francophony from Microsoft

Microsoft Word has really done it to me this time. I need some expert help, Language Log readers. I have a perfectly ordinary file (a simple letter template showing my home address), created in Word on an American Macintosh Powerbook using an American-purchased copy of Word, and when I open it as a copy on my UK-purchased MacBook Pro (though not when I open it as the original) almost everything works except that the file is deranged, and thinks it is supposed to be in French.

Editing the file provokes enforcement of French spacing conventions (colons and semicolons are preceded by an extra inserted space that I do not type); the double quotation symbols (‘‘like this’’) appear as those funny French marks that look a bit like pairs of less-thans and greater-thans (sort of <<like this>>); and, weirdest of all, the spelling and checking of "grammaire et style" turn into French. Word works through the file checking every significant English word and rejecting it for insufficient francophonicity (with no suggestions for respelling), underlining them all in red, though most French words are accepted. The grammar check not only assumes that French is being checked but also reports its results and queries in French. Saving the file preserves the pseudo-Frenchness.

Read the rest of this entry »

Comments (89)

The protective bloom of ignorance

I have often stressed the point to my students: it is not your ignorance that interferes with your education in this subject; it's the very opposite. It's the fact that you are a highly intelligent human being and you know many things deeply and thoroughly that can prevent your learning. Of the things I teach, it is in phonetics that this comes out most vividly: the reason you can't learn to hear and produce the difference between Hindi dental [t] and retroflex [&#x0288], I tell them, is not that you are no good at this practical phonetics stuff, but that you have had twenty years of training in ignoring this contrast (so as to become an expert speaker of English or some other language), and you have done brilliantly at it. Well, there was an echo of the same line that popped up today in some news about the phishing industry. Dr Emily Finch, a University of Surrey criminologist, said:

The general public is more internet security-aware than it was five years ago. Malicious anti-virus scams are an indication that criminals are now tapping into this.

Rather than exploiting our ignorance – the basic premise of common scams such as phishing – they are actively using our knowledge and fear of online threats to their advantage.

Read the rest of this entry »

Comments off

Is "Character Amnesia" Here to Stay?

A little over a month ago, I wrote a blog about what I called "Character Amnesia." Today, half a dozen readers have called my attention to an Aug. 25th article by Judith Evans for Agence France-Presse entitled "Wired youth forget how to write in China and Japan" (and other titles) that refers to "character amnesia" and quotes from an interview with me on August 9.  The article is also being sent around on Facebook and other sharing services, so it is getting a lot of coverage.  I cannot guarantee that I coined the expression "character amnesia," but it does seem to be meeting a need.

Read the rest of this entry »

Comments (13)

وزارة-الأتصالات.مصر leads the non-Latin charge

The first Internet domain names using non-Latin characters are being rolled out, a plan put into motion after approval from the Internet Corporation for Assigned Names and Numbers (ICANN). Arabic-speaking nations are the first to reap the orthographic benefits, with new country codes available for Egypt (مصر), Saudi Arabia (السعودية), and the United Arab Emirates (امارات). The Egyptian Ministry of Communications and Information Technology, previously online at <http://www.mcit.gov.eg/>, is blazing the trail with its new URL:

<وزارة-الأتصالات.مصر>

Not everything is fully worked out with the new system, though. Browsers that aren't caught up to speed on the non-Latin domain names will see the addresses rendered as Latinized gobbledygook. The Egyptian Communication Ministry's Arabic-script URL, for instance, currently resolves to <http://xn—-rmckbbajlc6dj7bxne2c.xn--wgbh1c/>. That's not very communicative.

[Update: See the very helpful comments below for an explanation of the Latinized encoding.]

Comments (20)

Beowulf Burlington forever

Six of us — three philosophers, two linguists, and a mathematician — were having dinner the Café Noir in Providence last Thursday night, and when three of us decided on the excellent boeuf bourguignon, someone at the table told a story of a colleague who tried to include the phrase boeuf bourguignon in a word-processed file and found that the spell-checker recommended correcting the spelling to Beowulf Burlington.

Read the rest of this entry »

Comments (15)