Sindya N. Bhanoo, "Real Words or Gibberish? Just Ask a Baboon", NYT 4/16/2012:
While baboons can’t read, they can tell the difference between real English words and nonsensical ones, a new study reports.
“They are using information about letters and the relation between letters to perform the task without any kind of linguistic training,” said Jonathan Grainger, a psychologist at the French Center for National Research and at Aix-Marseille University in France who was the study’s first author.
Some other media coverage: Seth Borenstein, "If you're reading this, you might be a baboon", AP (reprinted in Christian Science Monitor, 4/13/2012); Sharon Begley, "This is Dan. Dan is a Baboon. Read, Dan, Read", Reuters (reprinted in the Chicago Tribune, 4/12/2012); "See Dan read: Baboons can learn to spot real words", Fox News, 4/12/2012; "Baboons leave scientists spell-bound", ABC Science; "Baboons 'trained to read English'", ITN, 4/13/2012; "Baboon Word Skills Cause Linguistics Rethink"; "Reading time at the zoo: the baboons that excel at English"; "Baboons have a way with words"; "Baboons can recognize written words, study finds"; "Baboons touch on evolution of reading"; etc., etc.
Nature covered this story in its news section under the headline "Baboons can learn to recognize words", illustrated with a picture of a baboon holding up an issue of their journal, over the caption "Baboons can learn how to work out when a four-letter word is real English, and when it's nonsense".
The paper behind all the buzz is Jonathan Grainger, Stéphane Dufau, Marie Montant, Johannes C. Ziegler, and Joël Fagot, "Orthographic Processing in Baboons (Papio papio)", Science 336 (6078) pp. 245-248, 4/13/2012. The press release was published in Science Daily as "Baboons Display 'Reading' Skills, Study Suggests; Monkeys Identify Specific Combinations of Letters in Words", 4/16/2012.
There are two sad facts about this situation. The first sad fact — no less sad because it's completely predictable — is that the study itself is far more circumspect than the press coverage:
Over a period of a month and a half, baboons learned to discriminate dozens of words (the counts ranged from 81 words for baboon VIO to 308 words for baboon DAN) from among a total of 7832 nonwords at nearly 75% accuracy (Fig. 2 and table S1). This in itself is a remarkable result, given the level of orthographic similarity between the word and nonword stimuli. More detailed analyses revealed that baboons were not simply memorizing the word stimuli but had learned to discriminate words from nonwords on the basis of differences in the frequency of letter combinations in the two categories of stimuli (i.e., statistical learning). Indeed, there was a significant correlation between mean bigram frequency and word accuracy [correlation coefficients (r) ranged from 0.51 for baboon VIO to 0.80 for baboon DAN, all P values < 0.05; see supplementary materials]. More importantly, words that were seen for the first time triggered significantly fewer “nonword” responses than did the nonword stimuli (Fig. 3). This implies that the baboons had extracted knowledge about what statistical properties characterize words and nonwords and used this information to make their word versus nonword decision without having seen the specific examples before. In the absence of such knowledge, words seen for the first time should have been processed like nonwords.
That is, the study's key claim is not that baboons can learn to read or to spell or to distinguish English words from non-words in a general sort of way, or even that they necessarily can memorize the spelling of 70-300 specific English words, but rather than the baboons in this study learned something like differences in bigram (letter-pair) frequencies, or perhaps other differences in "the frequency of letter combinations", and used this knowledge to distinguish a smallish set of English words from a larger set of non-words, where "distinguish" means forced-choice discrimination at about 75% correct, where chance would be 50%.
The second sad fact — and this one is also, alas, predictable for a paper related to language published in Science — is that the study's (relatively circumspect) conclusions are in fact not supported by the evidence provided. The problem should be obvious to anyone with general scientific abilities who reads the paper somewhat carefully — and it takes just a few minutes to establish this point quantitatively.
In particular, when we discussed this paper yesterday in a small class on mathematical modeling of linguistic phenomena, the students immediately asked whether bigrams were really needed. Perhaps only the first letter would be enough? Or maybe the words and non-words could be distinguished purely on the basis of their unigram frequencies (that is, the frequencies of single letters, regardless of context)? Such questions were obvious ones given the paper's description of how the two sets were chosen:
A set of 2,235 English four-letter words and their printed frequencies was extracted from the CELEX word-frequency corpus (23). Bigram frequencies were calculated for each of the three contiguous bigrams of a word (letters 1&2, letters 2&3, letters 3&4) counting the number of times these bigrams appeared in the corpus at the same position. Mean bigram frequency was calculated for each word by averaging the three positional bigram frequencies. The 500 words with the highest mean bigram frequency were selected as “word” stimuli. A set of 10,091 four-letter nonwords was created using all the bigrams that appeared at a particular position (initial, medial, terminal) in the CELEX English four-letter word list. All nonwords were composed of one vowel letter and three consonants, and the vowel could be at any position. Each nonword was associated with a mean bigram frequency that was calculated using positionspecific bigram frequencies as with the word stimuli. We then selected 7,832 nonwords that had a mean bigram frequency that was less than the lowest mean bigram frequency of the word stimuli (mean bigram frequency for words, 3.60×10-4; mean bigram frequency for nonwords, 5.96×10-5)
According to the list of words and non-words in the paper's Table S2, these were the overall single-letter percentages in the data presented to Dan (the best-performing baboon),
|Letter||In word list||In non-word list|
These unigram percentages are certainly different — but are they different enough?
In the context of the Enigma cryptanalysis during WWII, Alan Turing suggested a way to combine multiple pieces of individually weak evidence in order to decide between two hypotheses: for each piece of evidence, take the log of the ratio of its estimated probability given hypothesis X to its estimated probability given hypothesis Y, and then add up these log likelihoods for all the available evidence. A positive value indicates that the evidence favors hypothesis X; a negative value favors hypothesis Y.
This idea is now a commonplace one — it's a feature of undergraduate computer science or statistics courses. And Joshua Gold and Michael Shadlen ("Banburismus and the Brain", Neuron 36:299-308, 2002) have presented neurological evidence that monkeys use a similar method in performing a direction-discrimination task (see also "Bletchley Park in the Lateral Interparietal Cortex", 1/9/2004).
We can apply the same method to Dan's overall performance record. The log likelihood ratios implied by the table above are
(In order to avoid infinities, I've substituted 1/1000000 for the zeros in the probability tables, corresponding in this case to q for words and y for nonwords.)
The first word on the list supplied from Dan's data was ACME: the log likelihood ratios of its letters are
0.24397 -0.56538 0.23143 1.07683
with a sum of 0.98684, indicating a (correct) guess of "word" with estimated odds of exp(0.98684) = 2.7 to 1. (This is the "first" word in collating order — we aren't given the actual order of presentation.)
The first non-word on Dan's list is ABBS, with unigram log likelihood ratios of
0.24397 -0.31128 -0.31128 -0.25893
and a sum of -0.63753, indicating a (correct) guess of "not a word" with estimated odds of 1.9 to 1.
If we apply this method to all of Dan's data, the resulting guesses would be correct 75% of the time for words, and 76% of the time for non-words, which is roughly how the baboons performed.
Of course, this is not a very plausible model of Dan the baboon's learning process, since at the start, he doesn't know anything about any properties of word vs. non-words, including their relative unigram frequencies. But in fact, his actual task was an easier one, not a harder one. He's not being asked to decide between arbitrary words and arbitrary non-words — he only needs to distinguish between the specific list of words that he's learned up to a given point in the experiment, and everything that's not on this (initially small) list.
Thus in the first phase, all he has to do is to distinguish between ACME and a bunch of things that aren't ACME (SHOC, FETT, PLUD, NURT, KNIG, HULP, NAMB, WOTS, JARF, ONTT, etc.). Then he has to distinguish e.g. ACME and HALL from a similar list of non-words; and so on.
As a result, in the earlier stages of the experiment, estimated unigram frequencies are a much better cue than they are later on — and Dan would be able to do much better than 75% if unigram frequencies were the only cue he was using.
That was the point we arrived at in a few minutes of class discussion yesterday afternoon. Last night, I got email from Fernando Pereira, pointing me to some notes by Yoav Goldberg ("Do Baboons really care about letter-pairs?") describing some more elaborate experiments on imitating the the incremental performance of one of the other baboons using a simple on-line learning algorithm. Yoav starts his note this way:
… the claim is that baboons learn patterns based on the frequency of a letter in a certain position within a word, and maybe about the frequency of different pairs of adjacent letters (bigrams).
Is that really the case? maybe. But in my view it is very hard to tell: while frequency probably had something to do with it, I am not sure the positional information is used. Why do I think so? because you can learn to predict with the same performance of the monkeys, without looking at position information.
The algorithm that he used is known as the "linear perceptron" — it was invented by Frank Rosenblatt in 1957, and it's a simple on-line method for finding the weights of a linear classfier. It's trivial to implement, and most undergraduate computer science majors (I think) know about it. Yoav provides his Python code, if you want to replicate his experiments or try some other ones.
Yoav started with a more complex feature set:
If we assume that the linear-model and the perceptron learning method are simple enough to be a lower bound on what monkeys can do (meaning that the monkeys’ representation and learning process are more elaborate than those used by the perceptron algorithm — not an unreasonable assumption in my view considering what we know about brain structure), we can see what kind of information is needed to perform the task at a certain level by looking at what the perceptron algorithm is capable of under different word representations. Maybe it’s sufficient to look at the first letter of each word and ignore the rest? Or maybe we need something much more elaborate?
Just the first letter was not quite good enough:
Maybe we could get good prediction by looking only at the first letter of each word? Under this representation, the word “TALK” is represented as a single property “T” while the word “ARCK” is represented as “A”. This is clearly a dumb representation. Still, the algorithm can get an accuracy of 60% using this representation, and of almost 62% by looking only at the third letter. The baboons probably looked at more than one letter.
However, the first two letters were too much information:
What if we look at the pair of the first two letters? Under this representation, the word “TALK” will be represented as a single property “TA” and the word “ARCK” as “AR”. The algorithm learns to predict with 80% accuracy. Better than the apes.
The middle two letters were even better (and thus even worse as a model of the baboons):
How about the middle letter-pair? Under this representation, the word “TALK” will be represented as “AL” and the word “ARCK” as “RC”. Now the algorithm learns to predict with 85% accuracy — much better than the apes.
Full bigram information is way too good:
What if we look at all the consecutive letter pairs? Here “TALK” is represented as the 3 properties “TA, AL, LK”. The algorithms learns to predict with a striking 97% of accuracy.
But the decontextualized unigram distributions are just about right:
Another representation would be the set of letters in a word, without the position information. Under this representation, both “TALK” and “KTAL” would be represented as the four properties “T, A, K, L”. We look at the letters that compose each word, but not at the relative order between them. Accuracy is 76%, very similar to what the monkeys did.
(As Yoav observed, this is a kind of lower bound on performance, since a linear perceptron applied to this problem is not an especially fast-learning algorithm. Other choices would do better for a given feature set.)
I have shown that a simple algorithm can learn to do just as good as the baboon by looking at only at the different letters in a word, without regarding their order. Can we conclusively say that baboons don’t consider letter position when “reading”? Not really. Perhaps the baboons are not as smart as our algorithm. Maybe they do look letter-pairs and letter positions. But what we can conclusively say is that in the specific data presented to the monkeys, the letter frequencies are sufficiently informative to perform on their level, using a simple learning method. This means that while the baboons certainly learned to “read”, it is not clear what kinds of information they used to do so. Every dataset contain patterns to be discovered. Some patterns are informative, others are less so. Some make sense, some less so. These patterns can be picked up by either monkeys or machines, and it is very difficult, by just looking at the performance, to tell which kinds of patterns are being learned.
The main claim of the paper is interesting and valid: one does not need to know a language in order to distinguish real words in that language from some other letter sequences. It can be done by statistical algorithms, and it can be done by monkeys. But the secondary claim, that this is achieved by the monkeys looking at letter-positions and letter-combinations, is far less convincing.
I have a different impression about the relative importance of the paper's claims. I don't think there would have been nearly as much buzz if the only result had been the suggestion that baboons can learn to recognize letters, and the demonstration that they can distinguish letter textures with one unigram distribution from letter textures with a rather different unigram distribution. And in terms of the conclusions advanced by the authors, their assertion that this tells us something interesting about the evolutionary substrate for reading would be much less plausible:
Our findings have two important theoretical implications. First, they suggest that statistical learning is a powerful universal (i.e., cross-species) mechanism that might well be the basis for learning higher-order (linguistic) categories that facilitate the evolution of natural language. Second, our results suggest that orthographic processing may, at least partly, be constrained by general principles of visual object processing shared by monkeys and humans. One such principle most likely concerns the use of feature combinations to identify visual objects, which would be analogous to the use of letter combinations in recent accounts of orthographic processing. Given the evidence that baboons process individual features or their combinations in order to discriminate visual objects, we suggest that similar mechanisms were used to distinguish words from nonwords in the current study. Our study may therefore help explain the success of the human cultural choice of visually representing words using combinations of aligned, spatially compact, ordered sequences of symbols. The primate brain might therefore be better prepared than previously thought to process printed words, hence facilitating the initial steps toward mastering one of the most complex of human skills: reading.