A few million monkeys (yawn)
« previous post | next post »
Language Log readers may be wondering why there has been no coverage of the achievement of Jesse Anderson, who has managed to get millions of monkeys, as computationally simulated on Amazon servers, to reproduce 99.9 percent of the works of Shakespeare (his own account is here on his blog, and various journalistic sheep have obediently reproduced his account in the newspapers). I'll tell you why.
The reason is that this seems to be a bit of self-promoting nonsense, comparable to Paul JJ Payack's silly claims to have an algorithm that has found a million words in English (no links; you can look him up). What Anderson has done is to generate 9-byte sequences randomly and check them against the works of Shakespeare. If he gets a match, he counts that bit of Shakespeare as done and marks it off. When all the text of all the plays and poems has been marked off, he has succeeded in his quest.
To see how dumb this project is, consider doing it with 1-letter sequences. You basically get a match every single time. So the task would be over in a few microseconds at modern server speeds. Now imagine doing it on bigrams (2-letter sequences): it would take longer, but would still be trivial and guaranteed to succeed fairly swiftly. For any k > 0, the task would be quite straightforward, but as you choose larger k it takes longer and longer. So he's wasting CPU cycles doing it for k = 9.
The number of 9-letter sequences over the alphabetic characters a to z is 5,429,503,678,976 (and as that figure of 5.5 trillion is being mentioned in the press stories, it looks like he's ignoring spaces, punctuation, case, fonts, paragraph breaks, etc., but what the heck, let's pretend Shakespeare's work is a bunch of strings over {a b c d e f g h i j k l m n o p q r s t u v w x y z}). There are a few scientific curlicues in the way Anderson does things, but basically he just takes random 9-grams and does a fixed-string search over the Shakespearean corpus to see if he has 9 more letters he can mark off as done.
Language Log has spent some time trying to care about this, honestly, and it's not working. If we were Stochastic Combinatorics Log, we might discuss the expected number of random 9-gram selections needed to match all the 9-grams in Shakepeare's works, and so on; but we aren't, so we won't. And even if we were, this would be a suitable topic for a homework problem in an introductory course, not for articles in major media outlets around the world.
[I ran a special algorithm to decide whether comments would be open on this post, and it came up No.]