The Sunday (UK) Times recently revealed that J.K. Rowling wrote the detective novel The Cuckoo's Calling under the pen name Robert Galbraith. The newspaper explained that, as part of their investigation, they sought the assistance of two scholars who have developed software to help with authorship attribution: Peter Millican of Oxford University and Patrick Juola of Duquesne University. Given the public interest in the Rowling revelation, I asked Patrick to write a guest post describing the authorial analysis that he conducted. (For more on the story, see my post on the Wall Street Journal's Speakeasy blog.)
[Guest post by Patrick Juola]
With the recent announcement by London's Sunday Times that J.K. Rowling had written the recently published novel The Cuckoo's Calling, several people have asked about the process that led up to this. I'm grateful to Ben Zimmer for giving me a chance to write a bit about it.
I don't know how much background most linguists have in "forensic stylometry." The basic theory is pretty simple: language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices. Some choices come from dialect (the reason an Englishman drives a lorry but an American a truck), some from social pressure (if I need to impress someone with my vocabulary, I can utilize a polysyllabic lexicon instead of just using big words), and some just seem to come. An example of the latter category is in the use of many function words. If you ask yourself where the salad fork is relative to the plate, you quickly realize that it's usually to the left of the plate. Or is it? It's just as likely to be "on" the left of the plate, "at" the left of the plate, or perhaps "to" the left SIDE of the plate. Same fork, same position, and at least four different choices for how to describe it, none of which correspond to any sociolinguistic or cognitive variable with which I'm familiar.
But what we do know is that much of this apparently free variation is actually rather static at least at an individual level. So by studying examples of documents a person has written, we can build a model of the kind of choices that person makes. The idea that we can use quantifiable models of this kind of linguistic choice is hardly new. It dates back at least to the logician Augustus de Morgan (yes, de Morgan's rule), who proposed in the mid-19th century that average word length could be used to settle questions of disputed authorship. Mosteller and Wallace studied the writing styles of The Federalist Papers in the mid-60s and showed, for example, that Alexander Hamilton never used the word "whilst" but that James Madison never used the word "while." More interestingly, they both used the word "by," but Madison consistently used it twice as often.
I was approached by a reporter, Cal Flyn, from the Sunday Times, to assess this kind of variation in the writings of "Robert Galbraith," a first-time novelist and author of The Cuckoo's Calling. (I learned later from the papers that the paper had received an anonymous tip via Twitter that Galbraith was the pen name of J.K. Rowling. And in retrospect there were a lot of other clues as well. For example, Galbraith apparently was surprisingly good at describing women's clothing, possibly suggesting a female author.) Would I be willing to look into this? I said yes, of course, but with a couple of conditions. First, I needed clean (machine readable) copies of Cuckoo, and clean samples of something comparable undisputedly by Rowling herself. Secondly, I needed other comparable samples from other writers (distractor authors, to use the common term) to assess the degree of variation.
For the past ten years or so, I've been working on a software project to assess stylistic similarity automatically, and at the same time, test different stylistic features to see how well they distinguish authors. De Morgan's idea of average word lengths, for example, works — sort of. If you actually get a group of documents together and compare how different they are in average word length, you quickly learn two things. First, most people are average in word length, just as most people are average in height. Very few people actually write using loads of very long words, and few write with very small words, either. Second, you learn that average word length isn't necessarily stable for a given author. Writing a letter to your cousin will have a different vocabulary than a professional article to be published in Nature. So it works, but not necessarily well. A better approach is not to use average word length, but to look at the overall distribution of word lengths. Still better is to use other measures, such as the frequency of specific words or word stems (e.g., how often did Madison use "by"?), and better yet is to use a combination of features and analyses, essentially analyzing the same data with different methods and seeing what the most consistent findings are. That's the approach I took.
MATERIALS, METHODS, & MATHS
I was given e-text copies of Cuckoo to compare against Rowling's own The Casual Vacancy, Ruth Rendell's The St. Zita Society, P.D. James' The Private Patient and Val McDermid's The Wire in the Blood. Fortunately, these were relatively clean copies and required little attention; deleting front and back matter, plus a little bit of issue regarding some non-standard punctuation, mostly quotations marks. The JGAAP program handles issues like normalizing whitespace and stripping punctuation in a straightforward manner. I broke Cuckoo into chunks of 1000 lines (the last chunk was incomplete) and compared each chunk individually against the baseline model built from each of the four candidate novels.
The heart of this analysis, of course, is in the details of the word "compared." Compared what, specifically, and how, specifically. I actually ran four separate types of analyses focusing on four different linguistic variables. While anything can in theory be an informative variable, my work focuses on variables that are easy to compute and that generate a lot of data from a given passage of language. One variable that I used, for example, is the distribution of word lengths. Each novel has a lot of words, each word has a length, and so one can get a robust vector of <X>% of the words in this document have exactly <Y> letters. Using a distance formula (for the mathematically minded, I used the normalized cosine distance formula instead of the more traditional Euclidean distance you remember from high school), I was able to get a measurement of similarity, with 0.0 being identity and progressively higher numbers being greater dissimilarity.
Of the 11 sections of Cuckoo, six were closest (in distribution of word lengths) to Rowling, five to James. No one else got a mention.
Another feature I used were the 100 most common words. What percentage of the document were "the," what were "of," and so on. Again, a rich data set that is easy to extract by computer. Using an otherwise similar analysis (including cosine distance again), four of the sections were Rowling-like, four were McDermid-like, and the other three split between James and Rendell.
I ran two tests based on authorial vocabulary. The first was on the distribution of character 4-grams, groups of four adjacent characters. These could be words, parts of words (like four letters "nsid" that would be inside the word "inside") or even parts of two words (like the four letters "n th" as part of the phrase "in the"). This particular unit of analysis has been shown to be very accurate at determining authorship, and there's a very good article by Efstathios Stamatatos that just came out in the Journal of Law and Policy describing why. I also ran on word bigrams, pairs of adjacent words, again a feature with a good track record.
The character 4-grams showed a preference for McDermid, with 8 sections close to her. Three were Rowling-like, and no one else was mentioned. The word pairs, on the other hand, were clearly Rowling-like (9 sections, against 2 by McDermid, no one else mentioned).
So, the final score? The results look "mixed," but pointing strongly to Rowlng. There were certainly a couple of likely losers: nothing at all pointed to Rendell as a possible author, and only one test, and an unreliable one at that, suggested James. McDermid could be a reasonable candidate author, but the word length distribution seemed almost entirely uncharacteristic of her. The only person consistently suggested by every analysis was Rowling, who showed up as the winner or the runner-up in each instance.
Does this prove that Rowling wrote Cuckoo? Of course not. All it really "proves" — suggests, rather — is that out of the four authors studied, the most likely candidate author is probably Rowling. But it could easily also be by someone who, by accident or design, wrote like Rowling. (Certainly one could do worse than imitate the style of one of the most successful writers of this generation.) It was fair to say that there was a lot of evidence pointing Rowling as the author and nothing specifically suggesting that she wasn't. It was certainly enough for the Sunday Times to use as part of their package when they approached Rowling's agent and asked, directly, "Did J.K. Rowling write The Cuckoo's Calling?" Less than a day later, Rowling confirmed through a spokesman that she had indeed written the novel, and the story launched.
It's always nice when a mystery story closes with a confession by the person responsible, and this is no different. This satisfactory conclusion came from several factors, most notably the cooperation of Rowling herself. Nothing in the analysis constituted "proof" of Rowling's authorship; it was at best "suggestive" or perhaps "indicative." In the event that we were studying a long-dead author, this is the kind of thing that could and would be argued about in the journals for decades.
At the same time, we can do some crude statistics about the likelihood that a randomly chosen author would have a style that similar to Rowling, and by extension how strong this suggestion really is. Out of four candidates, Rowling was consistently #1 or #2 (i.e., in the more similar half) of each feature chosen; it's therefore only 50/50 that a randomly chosen author would be in the more similar half. With four studies (handwaving independence assumptions), there's therefore only one chance in 16 that the person would "pass" all studies as being similar to Rowling. If we needed a stronger suggestion, we could easily gather more data, more distractor authors, or simply run more experiments on different variables.
Another key aspect of this study was the high quality of data. Many historical and literary studies start from scanned or retyped texts, often of quite poor quality. Starting from e-books minimized the noise and made the study much more practical (as well as faster). Finally, the good detective work done by the Sunday Times team in identifying a reasonable candidate author as well as some good distractors who were similar in many important respects (gender, dialect, genre, time period) made the comparisons easy and straightforward and therefore more reliable.
Ultimately, the proof of any technology, of course, is in the results. Forensic stylometry has a long history of successes mixed with the occasional failure. Ultimately, empirical testing like this project will be key to determining what specific techniques can minimize this chance of failure and hence maximize the usefulness of this kind of analysis.