I've stolen the title of this post from the subject line of a message from Hal Daumé, who has invited folks at University of Maryland to a huge Jeopardy-watching party he's organizing tonight. Today is February 14, so for at least some of the audience, Jeopardy might indeed jeopardize Valentine's Day, substituting geeky fun (I use the term fondly) for candle-lit dinners.
In case you hadn't heard, the reason for the excitement, pizza parties, and so forth is that tonight's episode will, for the first time, feature a computer competing against human players — and not just any human players, but the two best known Jeopardy champions. This is stirring up a new round of popular discussion about artificial intelligence, as Mark noted a few days ago. Many in the media — not to mention IBM, whose computer is doing the playing — are happy to play up the "smartest machine on earth", dawn-of-a-new-age angle. Though, to be fair, David Ferrucci, the IBMer who came up with the idea of building a Jeopardy-playing computer and led the project, does point out quite responsibly that this is only one step on the way to true natural language understanding by machine (e.g. at one point in this promotional video).
Regardless of how the game turns out, it's true that tonight will be a great achievement for language technology. Though I would also argue that the achievement is as much in the choice of problem as in the technology itself.
First, a little background. Watson, named after IBM's founder, is a question answering system. That may seem obvious, since it's a system that answers questions, but question answering (QA) is also a quasi-technical term that refers to a decades-old research area. No historical discussion of natural language processing is complete without Bill Woods's LUNAR, a 1960s system that answered questions about moon rocks brought back on the Apollo missions. More recently, over the last ten or fifteen years or so, QA systems have been a central focus of research for many in the natural language processing community, thanks in part to government research funding initiatives in the U.S.
As is common for such initiatives, research teams have typically participated in regular "bake-offs" — that is, community-wide shared task evaluations — in which systems are given the same test data and evaluated using some agreed upon method for judging answer quality. Interestingly, figuring out how to evaluate the systems is often half (or more!) of the problem you're trying to solve. Machine translation is famous (notorious?) for giving rise to as much research into evaluation as into the problem itself, and question answering is similar in some important ways. How do you measure if the output of a system constitutes a valid translation of the input? How do you decide if an assertion constitutes a correct answer to a question? (Yes, Jeopardy switches things around: it gives you "clues" in the form of answers, and players have to respond in the form of a question. I'm ignoring that cute little gimmick for purposes of this discussion.)
Like the MT community, the QA community has utilized both automatic evaluations (comparing system output against "ground truth" provided by human experts) and evaluations in which human assessors judge the quality of system outputs. The formal details of those evaluations can get pretty complex (try explaining the brevity penalty in the widely used BLEU MT evaluation metric to a non-specialist), to the point where understanding them can require as much expertise as understanding the research itself. That does not make for a compelling narrative about the advance of the technology.
Viewing Watson in this context, I would have to say that, despite its genuine technical advances (of which more below), I think the true stroke of genius behind the technology is the idea of playing Jeopardy in the first place. In 1996-1997, IBM's Deep Blue challenged and ultimately beat the reigning world chess champion. The evaluation was clear: you didn't need to understand chess to understand what it meant for a machine to beat the world's best human chess player. Now, once again, IBM has found a way to demonstrate technological progress in an easily comprehended way that captures the popular imagination.
It was a great choice in terms of technological foundations, too. Jeopardy's clues are similar to the questions asked of widely studied "factoid" question answering systems — generally a single who, what, where, or when, not a why or how, and not a complex multi-part query. Most Jeopardy clues provide you with a relatively fine grained semantic category for the sought-after answer; e.g. World's largest lake, nearly 5 times as big as Superior. (That question happens to come from the premier episode of the show's current incarnation, on Monday, September 10, 1984. Who knew you could find a comprehensive archive of previous Jeopardy questions and answers?) Finally, the game's discourse consists of a regimented protocol, not an interactive dialogue, so although natural language processing is certainly required, there is no need for Watson to launch itself down the slippery slope of natural language interaction.
That's not to say I don't think the IBM team hasn't made some very impressive technical advances. For an understandable overview of how the system works, watch this video of David Ferrucci giving a brief introduction to the project. Although many of the high level steps are what you would expect in a QA system (identify the semantic type of the desired answer, e.g. lake in the example above; blast out queries in parallel to find a large number of candidate answers; filter those potential answers down to a manageable number in order to analyze them more deeply), three things seem particularly worth noting.
First, the system omnivorously combines multiple forms of knowledge, including structured (like the WordNet lexical database), semi-structured (like Wikipedia infoboxes), and unstructured (lots and lots of text on a zillion topics), and it uses a whole panoply of techniques inspired by everything from traditional knowledge representation and reasoning (formal symbolic rules of inference) to the latest in statistical machine learning methods. Dare I say they've achieved their success by finding the right balancing act among myriad ways of doing things? :)
Second, thanks to the nature of the task, the system has been forced to do a good job assessing confidence in its own results. This is no small matter for language technology: most of the systems we encounter on a day to day basis simply come up with the best answer they can and hand it to you, and you either like the results or you don't. (Think about what comes back when you do a search engine query, or use automatic translation, or dictate a letter into a speech recognition system.) There are certainly exceptions — for example, voice menu systems are often smart enough to ask you to repeat yourself if they couldn't recognize what you said with high enough confidence — but when the stakes are high, the systems will fall back to relying on a human in the loop. (Please hold while I transfer your call to the first available agent…) For Watson, in contrast, the stakes are high and there's no human fallback, so it's crucial for the system to have an effective assessment of whether to buzz in or not.
Third, there's the question of processing speed. You might imagine that in any task involving quick judgments, the computer is at a huge speed advantage. But when you consider how much processing it needs to do for every question, it's actually quite remarkable that Watson comes up with answers in a second or two rather than a minute or two. (Apparently in the early days of the project it was an hour or two.) Watson's ability to win is going to depend crucially on its ability to delve into larger quantities of data (not the Internet, mind you; it's not connected) much more quickly than most of us language technology researchers usually think about doing things. Moreover, simply being fast on large data is not the issue. There are lots of well known ways to scale up a system to quickly deal with lots of data, if what you're doing involves processing keywords. What's impressive about Watson is that it's doing this scaling up while also going deeper in its analysis than the words on the surface — not full scale syntax and semantics, ok, but a healthy step closer.
I think that by the time the match is over, Watson will definitely have made its mark as a leap forward for the enterprise of language technology in particular and artificial intelligence in general. This kind of attention, and the ensuing discussion, are good things. Will Watson turn out also to have been a great leap forward in terms of the technology itself? Time will tell. Ask me again when I can chat with it after the game, and quiz it about what to get my wife as a Valentine's Day present in order to make up for taking her to a pizza party instead of a candle-lit dinner.