Archive for Language and technology

Don't send me passwords

Keith Allan has bravely outed himself as editor of the journal from which I recently received a thoroughly discourteous message sequence. I thank him for responding to the discussion, and for confirming that it was not about him pressing the buttons in the wrong order. The reason his fine journal (the Australian Journal of Linguistics) sent me a message sequence I found annoying and presumptuous is the design of the stupid ScholarOne Manuscript software. Let me explain a little more about the nature of my life (perhaps my experiences will find an echo in yours), the part that involves those arbitrary strings of letters and digits we are all supposed to carry around in our heads like mental sets of keys.

Read the rest of this entry »

Comments (38)

Stupid message sequencing discourtesy

Picture this: that you receive two unexpected emails from me in quick succession. The first is a boilerplate pre-packaged message informing you that I have entered your address on my website as my temporary address for two or three days later this month, and I have let my employers know that people can call me or fax me at your house. I'm a complete stranger to you, except that you know my name from Language Log; I have obtained your email address from public sources, and pre-emptively set up arrangements to that assume I'll be staying with you.

The second of the two emails is personally addressed, and says that I'll be in your area later this month to give a lecture, and since I'm on a tight budget, would it be all right if I came to stay for two nights?

I take it you'd be somewhere between insulted and shocked, despite the fact that it is sort of flattering that a famous Language Log writer has singled you out as a person he would like to stay with. Well the equivalent not only happened to me today; it happens to me every couple of months.

Read the rest of this entry »

Comments (52)

The sliced raw fish shoes it wishes

The crash-blossom-y headline that Geoff Pullum just posted about, "Google's Computer Might Betters Translation Tool," has been changed in the online edition of The New York Times to something more sensible: "Google’s Computing Power Refines Translation Tool." The headline in the print edition, says LexisNexis, is "Google Can Now Say No to 'Raw Fish Shoes,' in 52 Languages." This is a typical example of the gap between oblique print headlines and their more straightforward online equivalents designed with search engines in mind. (See the April 2006 Times article, "This Boring Headline Is Written for Google.")

Read the rest of this entry »

Comments (36)

So many languages, so much technology…

Suppose you had 100 digital recorders and 800 small languages, all in a country the size of California, but in one of the remotest parts of the planet.  What would you do?  What would it take to identify and train a small army of language workers?  How could the recordings they collect be accessible to people who don't speak the language?  My answer to this question is linked below – but spend a moment thinking how you might do this before looking.  One inspiration for this work was Mark Liberman's talk The problems of scale in language documentation at the Texas Linguistics Society meeting in 2006, in a workshop on Computational Linguistics for Less-Studied Languages.  Another inspiration was observing the enthusiasm of the remaining speakers of the Usarufa language to maintain their language (see this earlier post).  About 9 months ago, I decided to ask Olympus if they would give me 100 of their latest model digital voice recorders.  They did, and the BOLD:PNG Project starts next week.  Please sign the guestbook on that site, or post a comment here, if you'd like to encourage the speakers of these languages who are getting involved in this new project.

Comments (13)

Sarcasm punctuation mark sure to succeed:-!

Via John Gruber at Daring Fireball, I've learned that a company called Sarcasm, Inc., is marketing a "Sarcasm punctuation mark" called SarcMark, which people are supposed to use to "emphasize a sarcastic phrase, sentence or message". John Gruber's pitch-perfect assessment:

What a great idea. I'm sure it'll be a huge hit.

Read the rest of this entry »

Comments (40)

Jingle bells, pedophile

Top story of the morning in the UK for the serious language scientist must surely be the report in The Sun concerning a children's toy mouse that is supposed to sing "Jingle bells, jingle bells" but instead sings "Pedophile, pedophile". Said one appalled mother who squeezed the mouse, "Luckily my children are too young to understand." The distributors, a company called Humatt, of Ferndown in Dorset, claims that the man in China who recorded the voice for the toy "could not pronounce certain sounds." And the singing that he recorded "was then speeded up to make it higher-pitched — distorting the result further." (A good MP3 of the result can be found here.) They have recalled the toy.

Shocked listeners to BBC Radio 4 this morning heard the presenters read this story out while collapsing with laughter. Language Log is not amused. If there was ever a more serious confluence of issues in speech technology, the Chinese language, freedom of speech, taboo language, and the protection of children, I don't know when.

Read the rest of this entry »

Comments (81)

Happy Web Day!

In my latest Word Routes column on the Visual Thesaurus, I consider the enormous linguistic impact of an internal memorandum published at the European Organization for Nuclear Research (CERN) on November 12, 1990. The memo, by Tim Berners-Lee and Robert Cailliau, was entitled "WorldWideWeb: Proposal for a HyperText Project," and needless to say, we've all been webified ever since. Read all about it here.

Comments (18)

Google Demotes Literary Stars

My post about Google's metadata problems, along with a similar piece in the Chronicle of Higher Education, got a lot of people talking about the problem in the press and the blogs. (I even ran into an allusion to it in a La Repubblica piece on the Google Book Settlement when I arrived in Rome yesterday morning.) A number of people passed along their own experiences with flaky metadata. Others criticized me on grounds that could be broadly summed up as "Don't look a gift horse in the server," "It's better than nothing," "Who needs metadata anyway?," "Just give them time," and "Why concentrate on trivialities like metadata while ignoring the real perils of corporate monopoly" (as in "serving as a consultant for monitoring the proper temperatures of the pitchforks in hell").

This is all to the good, if it helps move up the metadata issues in Google's queue. I do think this will get a lot better as Google puts its considerable mind to it. But there was one other aspect of the metadata problem which I hadn't noticed or even thought about, but which in its own small way was unkindest cut of all. It was noticed by the children's book author Ace Bauer, who was prompted by my account of the metadata problems to check his Google Books listing:

Turns out my review rating ranked only one star out of 5. That's dim. But see, the review upon which they based this ranking was Kirkus's. Kirkus loved the book. They gave it a star. One star. That's all they give folks. It's considered a major honor.

Indeed it is, and actually the falling-star glitch affects a number of writers, for example Roy Blount, Jr., the president of the Author's Guild, who is has been an enthusiastic backer of the settlement. Google Books assigns a one-out-of-five star rating to at least two of Blount's books on the basis of their starred Kirkus reviews, Crackers and First Hubby, and visits similar review rating downgrades on books by Guild vice-president Judy Blume and Guild board members Nick LemannJames GlieckOscar Hijuelos, among others.

 I don't know exactly what the Google people will say when they cotton to this one, but it's a good guess the first sentence will begin with "oy."

Read the rest of this entry »

Comments (11)

NLTK Book on Sale Now

The NLTK book, Natural Language Processing with Python, went on sale yesterday:

Cover of Natural Language Processing with Python

"This book is here to help you get your job done." I love that line (from the preface). It captures the spirit of the book. Right from the start, readers/users get to do advanced things with large corpora, including information-rich visualizations and sophisticated theory implementation. If you've started to see that your research would benefit from some computational power, but you have limited (or no) programming experience, don't despair — install NLTK and its data sets (it's a snap), then work through this book.

Read the rest of this entry »

Comments (5)

Chinese Typewriter

This (the machine invented by the famous Chinese author, Lin Yutang, and described on the first page [first four paragraphs] of the Wikipedia article here) is probably the closest the Chinese ever got to decomposing their script into an "alphabet" consisting of "letters" (recurrent graphemic elements that can be combined in a principled way to form all of the characters / morphemes in their writing system).  You'll note that it didn't really work during their presentation to the Remington Typewriter Company executives.  The press conference demonstration they had the next day was probably of the carefully rehearsed, staged, orchestrated sort designers of Chinese information processing / technology software and hardware often present (the kind documented by Li-ching Chang in her film made at a vocational high school in Beijing), not one prepared to respond spontaneously to tasks posed by the audience.  Judging from my own experience with Chinese software and information processing / technology developers over more than a quarter of a century, this may have been what went wrong when Lin presented his typewriter to the Remington executives:  they asked him (or his operator) to type something impromptu.  Incidentally, the development of this fatally flawed typing machine left Lin — whose books were bestsellers in America — bankrupt.

Read the rest of this entry »

Comments (46)

Experiencing language death

Usarufa speakers experience the webUsarufa is a language of Papua New Guinea with just 1200 speakers (ISO-639 code "usa").  There's no fluent speakers under the age of 25, so the language must be considered moribund.  Before posting recordings of this language online, I needed to get informed consent, so I introduced some speakers to the World Wide Web.  We poked around for a while, finding useful sites about about insecticides for dealing with the taro beetle.  Then we turned our attention to audio.

I played them a recording of the "last words" of the Jiwarli language of Western Australia.  After some questioning looks I explained that this language is now dead, and we were listening to its last speaker before he died.  As one they all looked down, shaking their heads in disbelief and saying sorry, sorry, sorry….  It was as if I told them a mutual friend had died.  They urged me to put that recording on a cassette tape so they could take it back to their village.  That way, everyone would surely understand what will happen to the Usarufa language unless there are serious attempts to revitalize it.

I wasn't prepared for the intensity of their response.  Now I'm wondering if a collection of such recordings might be a useful tool in promoting language revitalization, and also in explaining the concept of language archiving.  (Thanks to Ima'o Ta'asata, James Warebu, Sivini Ikilele, and Waks Mark for their dedication to the preservation of Usarufa oral culture, and to Aaron Willems and SIL-PNG for facilitating this work.)

Comments (29)

Rhymes with "black" and sounds like "Alabama"

You'd think it was the end of the world. Apparently, the Nuance Communications-powered text-to-speech system on the new Amazon Kindle mispronounces Barack Obama's name, saying something like "buh-RACK oh-BAM-uh" instead of "buh-ROCK oh-BAH-muh". Why is this little tidbit worth a piece in the business/media section of The New York Times? The answer is, it's not. It could have been an OK lead-in to a technology piece about how text-to-speech systems work, and how they can fail — often spectacularly — on unknown words, especially names. Granted, adding the (pronunciation of the) name of a political figure such as Barack Obama to the system's dictionary is a simple enough thing to do (which is how Nuance will in fact fix the problem, if it hasn't already), and it was clearly an oversight worth pointing out to the company. But then again, the version of Firefox I'm using right now (3.0.4 for the Mac) has been underlining both of the President's names in what I have been typing thus far, incorrectly guessing that I'm misspelling something, and I'll bet you won't see some NYT reporter wasting their time on such a triviality.

Read the rest of this entry »

Comments (54)

A Limitation on Names in the PRC

Anyone who looked at the front page of the New York Times today probably noticed the article by Sharon LaFraniere entitled "Your Name's Not on Our List?  Change It, Beijing Officials Say." Featured in the article is a young woman named Ma Cheng, whose surname Ma is written with the character for "horse" and whose given name Cheng is written with a very rare character composed of three horses lined up closely in a row:  馬馬馬 (the latter character is exceedingly difficult to write in a small square exactly the same size as the space allotted to one horse [and to all other characters, even if they have as many as 64 strokes]!).  The article states that this character pronounced Cheng is not to be found among the 32,252 characters in the Chinese government's computer systems, so Ms. Ma has been told peremptorily that she must change her name.

Read the rest of this entry »

Comments (42)