The sparseness of linguistic data
« previous post | next post »
Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:
Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.
To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.
They're right, of course. I have noticed myself that it is extraordinarily easy to take, say, the emails you received this morning, and verify for particular three-word sequences that they seem never to have occurred before in the history of the web. I remember that when I was reflecting on how the sequences Kaavya Viswanathan had lifted from Megan McCafferty showed that her novel had been plagiarized (see this post and this one from 2006) I took the first three words of the fairly routine email my accountant had sent me that morning and did a Google search for the phrase. There was not a single hit.
Linguistics is in the familiar position of having to steer a course between two methodological extremes. On the one hand, the availability of about a trillion words of naturally-occurring text on the web is an incredibly valuable resource, if used with skill, for figuring out what's possible and what's not in the language, and for testing conjectures about how the grammar should be formulated. Posts here on Language Log (especially those by Mark Liberman) have shown that over and over again, as any regular reader will know. 21st-century linguists would be deeply foolish to stick to typical 20th-century methodology: largely ignoring what occurs, and basing everything on personal intuitions of what sounds acceptable.
But on the other hand, it really is true that the probability for most grammatical sequences of words actually having turned up on the web really is approximately zero, so grammaticality cannot possibly be reduced to probability of having actually occurred. Not even for word trigrams is that a reasonable equation. The data will always be too sparse to allow us to use methods (such as that of Google Translate) that rely in effect on the proposition that if the language allows some word sequence then it will have occurred somewhere before. The truth is that billions of word sequences that are grammatical have occurred (plus millions that aren't grammatical), but most grammatical sequences have never occurred at all.
So we have to show some taste and discernment. To identify the language with the body of what has been written and said in it so far would be to veer way too far in the direction of dumb subservience to observables. But to ignore the huge amount of information that can be extracted from corpora of genuinely attested utterances would be to throw away a precious resource and needlessly mire the study of syntax in a sort of navel-contemplation, a cataloguing of possibly inaccurate personal reactions to invented sentences. We have to walk the narrow margin of sensible methodology that lies between the two.