Archive for Computational linguistics

You can help improve ASR

If you're a native speaker of English, and you have about an hour to spare, and the title of this post (or a small promised gift) convinces you to devote your spare hour to helping researchers improve automatic speech recognition, just pick one of these four links at random and follow the instructions: 1, 2, 3, 4.

[Update — the problem with the tests has been fixed — but more than 1,000 people have participated, and the server is saturated, so unless you've already started the experiment, please hold off for now!]

If you'd like a fuller explanation, read on.

Read the rest of this entry »

Comments (28)

Jeopardizing Valentine's Day

I've stolen the title of this post from the subject line of a message from Hal Daumé, who has invited folks at University of Maryland to a huge Jeopardy-watching party he's organizing tonight. Today is February 14, so for at least some of the audience, Jeopardy might indeed jeopardize Valentine's Day, substituting geeky fun (I use the term fondly) for candle-lit dinners.

In case you hadn't heard, the reason for the excitement, pizza parties, and so forth is that tonight's episode will, for the first time, feature a computer competing against human players — and not just any human players, but the two best known Jeopardy champions. This is stirring up a new round of popular discussion about artificial intelligence, as Mark noted a few days ago. Many in the media — not to mention IBM, whose computer is doing the playing — are happy to play up the "smartest machine on earth", dawn-of-a-new-age angle. Though, to be fair, David Ferrucci, the IBMer who came up with the idea of building a Jeopardy-playing computer and led the project, does point out quite responsibly that this is only one step on the way to true natural language understanding by machine (e.g. at one point in this promotional video).

Regardless of how the game turns out, it's true that tonight will be a great achievement for language technology. Though I would also argue that the achievement is as much in the choice of problem as in the technology itself.

Read the rest of this entry »

Comments (36)

Language and intelligence

Two interesting popular articles on linguistic aspects of artificial intelligence have recently appeared in the popular press.

The first one is by Richard Powers ("What is Artifical Intelligence?", NYT 2/6/2011):

IN the category “What Do You Know?”, for $1 million: This four-year-old upstart the size of a small R.V. has digested 200 million pages of data about everything in existence and it means to give a couple of the world’s quickest humans a run for their money at their own game.

The question: What is Watson?

I.B.M.’s groundbreaking question-answering system, running on roughly 2,500 parallel processor cores, each able to perform up to 33 billion operations a second, is playing a pair of “Jeopardy!” matches against the show’s top two living players, to be aired on Feb. 14, 15 and 16.

Read the rest of this entry »

Comments (18)

Four revolutions

This started out to be a short report on some cool, socially relevant crowdsourcing for Egyptian Arabic. Somehow it morphed into a set of musings about the (near-) future of natural language processing…

A statistical revolution in natural language processing (henceforth NLP) took place in the late 1980s up to the mid 90s or so. Knowledge based methods of the previous several decades were overtaken by data-driven statistical techniques, thanks to increases in computing power, better availability of data, and, perhaps most of all, the (largely DARPA-imposed) re-introduction of the natural language processing community to their colleagues doing speech recognition and machine learning.

There was another revolution that took place around the same time, though. When I started out in NLP, the big dream for language technology was centered on human-computer interaction: we'd be able to speak to our machines, in order to ask them questions and tell them what we wanted them to do. (My first job out of college involved a project where the goal was to take natural language queries, turn them into SQL, and pull the answers out of databases.) This idea has retained its appeal for some people, e.g., Bill Gates, but in the mid 1990s something truly changed the landscape, pushing that particular dream into the background: the Web made text important again. If the statistical revolution was about the methods, the Internet revolution was about the needs. All of a sudden there was a world of information out there, and we needed ways to locate relevant Web pages, to summarize, to translate, to ask questions and pinpoint the answers.

Fifteen years or so later, the next revolution is already well underway.

Read the rest of this entry »

Comments (9)

The case of the missing spamularity

A recent diary post by Charlie Stross  ("It's made out of meat", 12/22/2010) poses a striking paradox. Or rather, he makes a prediction about a process whose trajectory, as so far observable, seems paradoxical to me.

Read the rest of this entry »

Comments (35)

Word lens

Competing with Culturomics for meme room today is Word Lens, which has a great YouTube ad:

Read the rest of this entry »

Comments (26)

Humanities research with the Google Books corpus

In Science today, there's yesterday, there was an article called "Quantitative analysis of culture using millions of digitized books" [subscription required] by at least twelve authors (eleven individuals, plus "the Google Books team"), which reports on some exercises in quantitative research performed on what is by far the largest corpus ever assembled for humanities and social science research. Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)

I have an article on the project appearing in tomorrow's in today's Chronicle of Higher Education, which I'll link to here, and in later posts Ben or Mark will probably be addressing some of the particular studies, like the estimates of English vocabulary size, as well as the wider implications of the enterprise. For now, some highlights:

Read the rest of this entry »

Comments (58)

"Utterly noxious retail" as Search Engine Optimization

David Segal, "A bully finds a pulpit on the web", NYT 11/26/2010:

Today, when reading the dozens of comments [at getsatisfaction.com] about DecorMyEyes, it is hard to decide which one conveys the most outrage. It is easy, though, to choose the most outrageous. It was written by Mr. Russo/Bolds/Borker himself.

“Hello, My name is Stanley with DecorMyEyes.com,” the post began. “I just wanted to let you guys know that the more replies you people post, the more business and the more hits and sales I get. My goal is NEGATIVE advertisement.”

Read the rest of this entry »

Comments (8)

The robot army

Randall Stross, "When the Software Is the Sportswriter", NYT 11/27/2010:

ONLY human writers can distill a heap of sports statistics into a compelling story. Or so we human writers like to think.

StatSheet, a Durham, N.C., company that serves up sports statistics in monster-size portions, thinks otherwise. The company, with nine employees, is working to endow software with the ability to turn game statistics into articles about college basketball games.

Read the rest of this entry »

Comments (18)

Speech-based quantification of Parkinson's Disease

Earlier this year, I discussed an interesting paper from a poster session at ICASSP 2010 ("Clinical applications of speech technology", 3/18/2010), which used an automated evaluation of dysphonia measures in short speech samples to match clinicians' evaluations of Parkinson's Disease severity.

That work, extended and improved, has been published as Athanasios Tsanas et al., "Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity", J.  Roy. Soc. Interface, 11/17/2010.

Read the rest of this entry »

Comments (6)

ASR Elevator

This is funny, though unfair:

Read the rest of this entry »

Comments (39)

Statistical MT – with meter and rhyme

I promised in an earlier post to report on some of the many interesting presentations here at InterSpeech 2010. But various other obligations and opportunities have cut into my blogging time, and so for now, I'll just point you to the slides for my own presentation here: Jiahong Yuan and Mark Liberman, "F0 Declination in English and Mandarin Broadcast News Speech".

I still hope to blog about some of the other interesting things I've learned here, but it's already time for me to head out on the next leg of my journey. Worse, I've already got a list of things to blog about from the next conference where I'm co-author on a presentation, EMNLP 2010 — which hasn't even started yet. At the top of that list is Dmitriy Genzel, Jakob Uszkoreit and Franz Och, "'Poetic' Statistical Machine Translation: Rhyme and Meter".

Read the rest of this entry »

Comments (6)

Sproat asks the question

The "Last Words" segment of the latest issue of Computational Linguistics is by Richard Sproat: "Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals". Richard reviews and extends the analysis (partly contributed by him and by Cosma Shalizi) in "Conditional entropy and the Indus Script" (4/26/2009) and "Pictish writing?" (4/2/2010), and poses the question that I was too polite to ask:

How is it that papers that are so trivially and demonstrably wrong get published in journals such as Science or the Proceedings of the Royal Society?

Read the rest of this entry »

Comments (24)