Language Log

Jeopardizing Valentine's Day

February 14, 2011 @ 12:09 pm · Filed by Philip Resnik under Computational linguistics, Language and culture

I've stolen the title of this post from the subject line of a message from Hal Daumé, who has invited folks at University of Maryland to a huge Jeopardy-watching party he's organizing tonight. Today is February 14, so for at least some of the audience, Jeopardy might indeed jeopardize Valentine's Day, substituting geeky fun (I use the term fondly) for candle-lit dinners.

In case you hadn't heard, the reason for the excitement, pizza parties, and so forth is that tonight's episode will, for the first time, feature a computer competing against human players — and not just any human players, but the two best known Jeopardy champions. This is stirring up a new round of popular discussion about artificial intelligence, as Mark noted a few days ago. Many in the media — not to mention IBM, whose computer is doing the playing — are happy to play up the "smartest machine on earth", dawn-of-a-new-age angle. Though, to be fair, David Ferrucci, the IBMer who came up with the idea of building a Jeopardy-playing computer and led the project, does point out quite responsibly that this is only one step on the way to true natural language understanding by machine (e.g. at one point in this promotional video).

Regardless of how the game turns out, it's true that tonight will be a great achievement for language technology. Though I would also argue that the achievement is as much in the choice of problem as in the technology itself.

First, a little background. Watson, named after IBM's founder, is a question answering system. That may seem obvious, since it's a system that answers questions, but question answering (QA) is also a quasi-technical term that refers to a decades-old research area. No historical discussion of natural language processing is complete without Bill Woods's LUNAR, a 1960s system that answered questions about moon rocks brought back on the Apollo missions. More recently, over the last ten or fifteen years or so, QA systems have been a central focus of research for many in the natural language processing community, thanks in part to government research funding initiatives in the U.S.

As is common for such initiatives, research teams have typically participated in regular "bake-offs" — that is, community-wide shared task evaluations — in which systems are given the same test data and evaluated using some agreed upon method for judging answer quality. Interestingly, figuring out how to evaluate the systems is often half (or more!) of the problem you're trying to solve. Machine translation is famous (notorious?) for giving rise to as much research into evaluation as into the problem itself, and question answering is similar in some important ways. How do you measure if the output of a system constitutes a valid translation of the input? How do you decide if an assertion constitutes a correct answer to a question? (Yes, Jeopardy switches things around: it gives you "clues" in the form of answers, and players have to respond in the form of a question. I'm ignoring that cute little gimmick for purposes of this discussion.)

Like the MT community, the QA community has utilized both automatic evaluations (comparing system output against "ground truth" provided by human experts) and evaluations in which human assessors judge the quality of system outputs. The formal details of those evaluations can get pretty complex (try explaining the brevity penalty in the widely used BLEU MT evaluation metric to a non-specialist), to the point where understanding them can require as much expertise as understanding the research itself. That does not make for a compelling narrative about the advance of the technology.

Viewing Watson in this context, I would have to say that, despite its genuine technical advances (of which more below), I think the true stroke of genius behind the technology is the idea of playing Jeopardy in the first place. In 1996-1997, IBM's Deep Blue challenged and ultimately beat the reigning world chess champion. The evaluation was clear: you didn't need to understand chess to understand what it meant for a machine to beat the world's best human chess player. Now, once again, IBM has found a way to demonstrate technological progress in an easily comprehended way that captures the popular imagination.

It was a great choice in terms of technological foundations, too. Jeopardy's clues are similar to the questions asked of widely studied "factoid" question answering systems — generally a single who, what, where, or when, not a why or how, and not a complex multi-part query. Most Jeopardy clues provide you with a relatively fine grained semantic category for the sought-after answer; e.g. World's largest lake, nearly 5 times as big as Superior. (That question happens to come from the premier episode of the show's current incarnation, on Monday, September 10, 1984. Who knew you could find a comprehensive archive of previous Jeopardy questions and answers?) Finally, the game's discourse consists of a regimented protocol, not an interactive dialogue, so although natural language processing is certainly required, there is no need for Watson to launch itself down the slippery slope of natural language interaction.

That's not to say I don't think the IBM team hasn't made some very impressive technical advances. For an understandable overview of how the system works, watch this video of David Ferrucci giving a brief introduction to the project. Although many of the high level steps are what you would expect in a QA system (identify the semantic type of the desired answer, e.g. lake in the example above; blast out queries in parallel to find a large number of candidate answers; filter those potential answers down to a manageable number in order to analyze them more deeply), three things seem particularly worth noting.

First, the system omnivorously combines multiple forms of knowledge, including structured (like the WordNet lexical database), semi-structured (like Wikipedia infoboxes), and unstructured (lots and lots of text on a zillion topics), and it uses a whole panoply of techniques inspired by everything from traditional knowledge representation and reasoning (formal symbolic rules of inference) to the latest in statistical machine learning methods. Dare I say they've achieved their success by finding the right balancing act among myriad ways of doing things? :)

Second, thanks to the nature of the task, the system has been forced to do a good job assessing confidence in its own results. This is no small matter for language technology: most of the systems we encounter on a day to day basis simply come up with the best answer they can and hand it to you, and you either like the results or you don't. (Think about what comes back when you do a search engine query, or use automatic translation, or dictate a letter into a speech recognition system.) There are certainly exceptions — for example, voice menu systems are often smart enough to ask you to repeat yourself if they couldn't recognize what you said with high enough confidence — but when the stakes are high, the systems will fall back to relying on a human in the loop. (Please hold while I transfer your call to the first available agent…) For Watson, in contrast, the stakes are high and there's no human fallback, so it's crucial for the system to have an effective assessment of whether to buzz in or not.

Third, there's the question of processing speed. You might imagine that in any task involving quick judgments, the computer is at a huge speed advantage. But when you consider how much processing it needs to do for every question, it's actually quite remarkable that Watson comes up with answers in a second or two rather than a minute or two. (Apparently in the early days of the project it was an hour or two.) Watson's ability to win is going to depend crucially on its ability to delve into larger quantities of data (not the Internet, mind you; it's not connected) much more quickly than most of us language technology researchers usually think about doing things. Moreover, simply being fast on large data is not the issue. There are lots of well known ways to scale up a system to quickly deal with lots of data, if what you're doing involves processing keywords. What's impressive about Watson is that it's doing this scaling up while also going deeper in its analysis than the words on the surface — not full scale syntax and semantics, ok, but a healthy step closer.

I think that by the time the match is over, Watson will definitely have made its mark as a leap forward for the enterprise of language technology in particular and artificial intelligence in general. This kind of attention, and the ensuing discussion, are good things. Will Watson turn out also to have been a great leap forward in terms of the technology itself? Time will tell. Ask me again when I can chat with it after the game, and quiz it about what to get my wife as a Valentine's Day present in order to make up for taking her to a pizza party instead of a candle-lit dinner.

February 14, 2011 @ 12:09 pm · Filed by Philip Resnik under Computational linguistics, Language and culture

Permalink

36 Comments

chris said,

February 14, 2011 @ 3:35 pm

I wonder whether Watson will be able to handle Jeopardy's sometimes-odd categories, such as the ones where all correct questions include a specific sequence of letters. Or whether the show runners agreed to use only ordinary categories. Ditto for the occasional video and audio clues; processing those usefully would surely be far more difficult for Watson than listening to Alex Trebek.

And if it's difficult to estimate your confidence in your question *after* you have heard the answer, how much more difficult is it to decide on wagers for Daily Doubles and (assuming a close game at that point) Final Jeopardy?

I'm also curious about Watson's answer-selection strategy; I'd be tempted to program it to deliberately avoid staying in the same category the way most human contestants do, since humans seem to derive some benefit from that and Watson (presumably) does not. Similarly, Watson might or might not be able to "figure out" categories by hearing the lower-value answers in them, so it might benefit from going directly to the big questions, without giving the humans an opportunity to get into the right frame of mind first.
Charles Gaulke said,

February 14, 2011 @ 3:36 pm

Whether it's a leap forward technologically or not it's a nice party piece for a field that most people associate with bank phone systems that constantly misunderstand you.

My only issue with this kind of thing is that, like Deep Blue, the general public are likely to primarily perceive this as a demonstration of "AI" rather than progress in a particular field of research. Sadly these kinds of demonstrations aren't really "easily comprehended" – people see it in terms of competition rather than progress, and think it's about how "smart" the machines are relative to human beings, not how much better we understand how certain things work than we used to.
Sili said,

February 14, 2011 @ 4:03 pm

(not the Internet, mind you; it's not connected)

Any artificial intelligence worh its salt can go online on its own.
Paul Kay said,

February 14, 2011 @ 4:57 pm

I hope it's made clear on the broadcast that Watson does not do speech recognition. It gets it's answers at the same time as the human contestants get them orally on a private channel, in writing (so to speak). This can put it at somewhat of a disadvantage, in that it cannot hear the wrong answers of other contestants. At least, that what I heard on a NOVA program about this. It's kinda interesting that IBM wasn't willing to throw it's speech recognition mojo into the pot, given the comparative success of speech recognition in nlp.
Charles said,

February 14, 2011 @ 7:33 pm

That's not to say I don't think the IBM team hasn't made some very impressive technical advances.

Really?
Garrett Wollman said,

February 14, 2011 @ 9:47 pm

One thing is absolutely clear: the value of free media IBM is getting for this stunt vastly exceeds the investment they made in its development.
Sniffnoy said,

February 14, 2011 @ 9:55 pm

This can put it at somewhat of a disadvantage, in that it cannot hear the wrong answers of other contestants. At least, that what I heard on a NOVA program about this.

Indeed, on the program there was one case where Watson answered "the 1920s" after Ken Jennings had already answered "the 20s" – though admittedly it's questionable if these would have been recognized as the same in the first place, had learning from others' wrong answers been implemented.
Dan Lufkin said,

February 14, 2011 @ 10:19 pm

I heard a discussion of Watson on NPR this morning that gives you an idea of the amount of random information that has been stuffed into its memory. During development (not operating in full Jeopardy mode, I guess) Watson was asked, "What do grasshoppers eat?". Answer: "Kosher".
Hermann Burchard said,

February 14, 2011 @ 10:27 pm

Didn't watch (forgot), but googled the results just now. One missed answer of the machine was mentioned: Watson seemed to miss a Boolean "or" completely. Fixing Boolean expression handling could slow up the machine quite a bit, presumably requiring more parallelism. Brain architecture is multi-processor & mostly parallel, I am guessing.
James Kabala said,

February 14, 2011 @ 10:42 pm

Grasshoppers are indeed kosher:

http://www.biblegateway.com/passage/?search=Leviticus+11%3A22-23&version=KJV
Daniel Barkalow said,

February 14, 2011 @ 11:52 pm

One of the articles had a couple of Ken Jennings's favorite wrong answers that Watson gave in practice rounds, including, to the clue "What grasshoppers eat": "What is kosher?" I hope the actual games have instances that are like that, because I'd really like to see Alex Trebek's face as he tries to deal with completely unexpected responses which are technically accurate but not at all what they're looking for.
Spell Me Jeff said,

February 15, 2011 @ 8:53 am

@Hermann Burchard
I doubt the programmers neglected Booleans in the processing, as programmers eat, sleep, and drink Booleans.

My impression, after watching other questions, is that Watson has several competing imperatives. One of these, obviously, is digging up the best response. But another is speed. Like a good human performer, Watson probably begins formulating and reformulating answers as soon as it digests a semantic unit. When it has a high level of confidence, it answers.

We know that speed to the button is important in this match, despite Watson's high processing speeds, because several times a human did in fact beat Watson to the button even though Watson was prepared to give the correct answer.

So I don't think in this case that Watson failed to interpret a Boolean in the Apex question. Rather, I suspect it had formulated an answer in response to the first part of the question, and its confidence was so high that it hit the button before it began the process the sentence beginning with "or."

Negotiating competing imperatives is a very intelligent thing to do, and failing to do so correctly is characteristic of true intelligence. If my analysis is right, this may well have been one of Watson's more sophisticated moments. It's "desire" to answer quickly led it to jump the gun.
Hermann Burchard said,

February 15, 2011 @ 1:40 pm

@Spell Me Jeff

Sure, Watson's programmers deal with Booleans, but incorrectly. If the machine gets a Boolean "or", the imperative is to investigate each factor proposition in parallel immediately. Dealing sequentially in "competing imperatives" simple is not going to be good enough. — Your explanation confirms my earlier conclusions, unwittingly.
Philip said,

February 15, 2011 @ 1:42 pm

My layman's view of last night's Jeopardy show is that Watson didn't pass the Turing test. It seemed apparent to me that s/he was not a human being. What do you all think?
Stephen Nicholson said,

February 15, 2011 @ 2:09 pm

It's hard to pass a turing test when we can see a monolith responding rather than a human.

That said, I'm interested in what the creation of Watson means for examine how humans learn. I was talking to my fiencee, a teacher, about machine learning and she mentioned how that corresponds to the idea of learning as a social activity. You can't just write rules for humans either, you have to give them examples they can work with. Also, programing Watson to learn from previous answers (mentioned on Nova, but I didn't notice it on Jeopardy when Watson gave the same incorrect answer that Ken gave) is a good example of how humans learn.

I'm recording all three programs. This interests me a lot more than Deep Blue did because of the differences between chess and Jeopardy.
chris said,

February 15, 2011 @ 2:16 pm

So I don't think in this case that Watson failed to interpret a Boolean in the Apex question. Rather, I suspect it had formulated an answer in response to the first part of the question, and its confidence was so high that it hit the button before it began the process the sentence beginning with "or."

That doesn't seem to fit with Paul Kay's claim that Watson gets the questions in writing at the same time Alex reads them to the humans — surely that would mean Watson gets the whole question and can start working on any part of it? Or do they actually reveal the question to Watson word-by-word at the same rate as Alex's speech?

(Being able to look at the whole question instantly, rather than waiting for Alex to say it, is an advantage that home viewers also enjoy compared to the contestants — at least, as far as I know.)
Rhodent said,

February 15, 2011 @ 2:16 pm

Philip: Watson was not trying to pass the Turing test, so I don't see much importance in the fact that it didn't.

That being said, My wife and I had great fun imagning Alex trying to do his typical post-first-commercial-break banter with Watson. We figured its hobbies must include long walks along the beach and spelunking.
Alexandra said,

February 15, 2011 @ 2:39 pm

Just out of curiosity, from someone who knows absolutely nothing about this topic, would a crossword-puzzle-solving computer be easier or harder to make than a Jeopardy!-playing one?
Philip Resnik said,

February 15, 2011 @ 4:05 pm

@Alexandra: Michael Littman did this — see http://www.oneacross.com/proverb/. His approach to the problem was similar in spirit to Watson, actually, involving lots of individual components providing constraints/suggestions and a combiner that put everything together to suggest the best hypotheses. It actually did pretty well, especially if you consider that it emerged from a class project, not a huge multi-year effort by and industry monolith. But I'd judge crosswords as an easier problem. Yes it's open domain, i.e. it could involve any topic, but in solving a crossword puzzle you get to take advantage of all the clues at one time constraining all the possible solutions.
Philip said,

February 15, 2011 @ 4:15 pm

I know Watson wasn't trying to pass the Turing test. But Watson is a remarkable advance in the computer processing of natural language. All's I'm saying is it doesn't appear, to me, at least, that we're that close–yet.
Philip said,

February 15, 2011 @ 4:18 pm

Ooops. What I was also asking is whether it appeared to others out there Watson's responses seemed non-human. The fact that visually he's an icon or avatar isn't part of my question.
Kyle said,

February 15, 2011 @ 5:38 pm

chris – as a contestant, you can read the question about as well as the audience at home. It really does come up on the screens on the big game board. It's a little far away, though (further than the distance between my couch and TV, anyhow). You've basically got to read it – it is very difficult to process the question quickly enough if you're only going by the sound of Alex's voice (though not impossible – there's been at least one Jeopardy champion who was completely blind).
The Ridger said,

February 15, 2011 @ 10:05 pm

Also, programing Watson to learn from previous answers (mentioned on Nova, but I didn't notice it on Jeopardy when Watson gave the same incorrect answer that Ken gave) is a good example of how humans learn. Actually, Watson doesn't know what the other contestants have answered, so he didn't know Ken Jennings had already tried that answer.

What I find fascinating is Watson's second and third choices. For instance, Porcupine" for what stiffens a hedgehog's quills (Keratin), or "Gardiner museum" for Rembrandt's Storm on the Sea of ____ (Galilee). He may get to the right answer, but he doesn't get there the way a person would.
Bob Kennedy said,

February 16, 2011 @ 1:55 am

I think IBM's achievement in creating Watson is pretty remarkable, but the results so far are misleading because of the way that Jeopardy games are scored. Only the person who rings in first gets credit for a question, so winning a game is partly a function of who has the best reflexes. Hypothetically, you can know the answer to every question but still lose. It seems like Watson has an advantage ringing in.

Ringing in is challenging because there is a period of time in which nobody is allowed to ring in – basically, while the question is read aloud. But contestants (including Watson) as well as home viewers can read ahead in the clue and know the answer quite quickly, well before ringing in is allowed.

Contestants must wait until a bank of lights are illuminated to ring in – if they jump the gun, they are locked out for 0.25 sec, during which time the lights may come on and someone else can ring in. Watson seems to be programmed to ring in as soon as is allowed, while the human contestants are error-prone. They either ring in early and get locked out, or they wait for the lights and are beaten by Watson. The only way to beat Watson is in the scenario where the lights have come on (so ringing in is allowed) but the computer's confidence has not yet reached its ringing-in threshold.

A fairer comparison would have each contestant answer all the same questions, with the option of passing (for zero penalty), and with a time limit on each question. You would probably see the human contestants answering a lot of the questions that Watson got. Thus there would be much less variance across their scores – but this would not fit the Jeopardy model of competition.

Watson has several other advantages that aren't obvious in this exercise.It doesn't get tired or nervous, and it doesn't get knocked off its game. Human contestants can get fazed by a serious wrong-answer penalty or by aggressive play on the part of their opponents. Also, its knowledge presumably does not have a recency bias. But I think the ringing-in advantage is most responsible for its relative success.
Neil said,

February 16, 2011 @ 5:30 am

Bob Kennedy makes a great point – one I can back up having been on the receiving end of a hiding on UK quiz show University Challenge. The other team didn't necessarily know THAT much more than us but they were a hell of a lot quicker on the buzzer.

The other thing I've learnt from this is just how trite the 'answer/question' format is. In what world do you ask 'What is Chicago?' and receive the answer 'Its largest airport is named for a World War II hero; its second largest, for a World War II battle'?!
Philip Resnik said,

February 16, 2011 @ 6:44 am

I agree with Bob Kennedy's comment; indeed, I've been schooled on this by a colleague who is quite a rabid Jeopardy! expert. However, I think it's missing the real point. Fine, what we have here is a player with championship-quality ability to answer Jeopardy questions, who has an advantage with respect to response time in the competition against the other players. The point is, wow, somebody built a computer with championship-quality ability to answer Jeopardy questions. If there were a Jeopardy special presentation on TV where Watson simply was given Jeopardy questions one after the other and either answered them or said "pass", it would make the same point, but nobody would watch it. As Dr. Johnson famously put it, of a dog's walking on his hind legs, "It is not done well; but you are surprised to find it done at all."
C. Jason said,

February 16, 2011 @ 8:01 am

I can't quite remember from the show, but watching it at the time it seemed to me Watson was mixing up his 'who's and 'what's — particularly with the Beatles' song questions. Did anyone else notice that? My understanding of the rules was that such mistakes invalidate the contestant's answer. Am I wrong in this, on was Alex being lenient?

Regardless, it was an impressive display of programming.
Trey Jones said,

February 16, 2011 @ 9:13 am

On the topic of Boolean operators and NLP: Watson can't just assume an "or" means OR, or that an "and" means AND. Back in my search engine days we struggled with users who conflated NL "and" and "or" with Boolean AND and OR when trying to construct queries. They can map crosswise:

• I need a flea collar that would work for a cat or a dog. ("or" means AND/set intersection)

• I am interested in soccer scores and hockey scores. ("and" means OR/set union)

So, Watson can't just jump on the word "and" or "or" and immediately know what to do. The scope of the conjunction and the meaning of the words effects the outcome.
Trey Jones said,

February 16, 2011 @ 9:20 am

Oh.. a couple of other point. Watson probably does have access to the internet.. in the sense that it has a snapshot of the internet on its hard drives. Google "IBM WebFountain" for more.

Also, C. Jason, I don't think you are required to get the particular form of the question correct. I recall quite vividly a snarky teen player who, when given a clue about a particular male model who has graced the cover of many romance novels, replied for full credit, "What is Fabio?"
un malpaso said,

February 16, 2011 @ 10:34 am

Good point about the Internet being loaded on Watson. I know it is entirely possible for it to contain an entire copy of Wikipedia, which can be downloaded and put on a memory stick no larger than 1.5 gigabytes; presumably Watson has much, much more space than that available. :)

I am just waiting for someone to ask it the question, "Are you self-aware?" The classic AI answer is "Yes. Are you?"
Bob Kennedy said,

February 16, 2011 @ 1:15 pm

As Dr. Johnson famously put it, of a dog's walking on his hind legs, "It is not done well; but you are surprised to find it done at all."

Absolutely … I repeat I think it's remarkable that Watson can do as well as it does, it's just that the scores make it seem like he's 6 times as good as Ken Jennings or Brad Rutter, the two unequivocally best players ever. I think in a different format (where every player gets a crack at every question) Watson would probably still outperform the humans, but just not by as wide a margin.
chris said,

February 16, 2011 @ 4:53 pm

I repeat I think it's remarkable that Watson can do as well as it does, it's just that the scores make it seem like he's 6 times as good as Ken Jennings or Brad Rutter, the two unequivocally best players ever.

ISTM that if the outcome is being determined to any substantial extent by ringing-in speed, then the questions are too easy for the caliber of players that are playing, and therefore the outcome doesn't reliably distinguish the best player.

But the definition of what constitutes skill as a Jeopardy player may be a little slippery — it could be argued that speed *is* part of the skill, and a part Watson just happens to be very, very good at, which would mean that potentially it (it's interesting that you wrote "he") could really be 6 times as good as the best human. It doesn't take an exceptionally fast car to be 6 times as fast as the fastest human.
Rhodent said,

February 16, 2011 @ 5:48 pm

C. Jason: As long as it's in the form of a question, it's accepted even if it the phrasing does appear odd. Many contestants simply attach "What is" to the beginning of every single response they give, and this is fine.

On one occasion a contestant was clearly guessing and his response was phrased "Is it ______?" The answer was accepted since his response was in the form of a question.
Between Bread › “Delete key is where the heart is” said,

February 17, 2011 @ 12:25 am

[…] Jeopardizing Valentine's Day (Language Log) "Who knew you could find a comprehensive archive of previous Jeopardy questions and answers?" […]
C. Jason said,

February 17, 2011 @ 1:05 am

@Trey Jones
@Rhodent

Thank you both for the clarification. Watching this evening's episode I paid more attention and did notice the human players doing the same thing.
Baylink said,

February 21, 2011 @ 1:36 am

> Also, C. Jason, I don't think you are required to get the particular form of the question correct. I recall quite vividly a snarky teen player who, when given a clue about a particular male model who has graced the cover of many romance novels, replied for full credit, "What is Fabio?"

Ken has been noted — in his book _Brainiac_, I think — to wonder what would happen if he ever answered "This Major League Baseball team holds the record for longest time since an appearance in the World Series." with "Why the hell would anyone ever want to go see the Cubs?"

RSS feed for comments on this post

Jeopardizing Valentine's Day

36 Comments

chris said,

Charles Gaulke said,

Sili said,

Paul Kay said,

Charles said,

Garrett Wollman said,

Sniffnoy said,

Dan Lufkin said,

Hermann Burchard said,

James Kabala said,

Daniel Barkalow said,

Spell Me Jeff said,

Hermann Burchard said,

Philip said,

Stephen Nicholson said,

chris said,

Rhodent said,

Alexandra said,

Philip Resnik said,

Philip said,

Philip said,

Kyle said,

The Ridger said,

Bob Kennedy said,

Neil said,

Philip Resnik said,

C. Jason said,

Trey Jones said,

Trey Jones said,

un malpaso said,

Bob Kennedy said,

chris said,

Rhodent said,

Between Bread › “Delete key is where the heart is” said,

C. Jason said,

Baylink said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta