The 49th Annual Meeting of the Association for Computational Linguistics took place last week in Portland OR, and one of the papers presented there has gotten some (well deserved) press coverage: Moshe Koppel, Navot Akiva, Idan Dershowitz and Nachum Dershowitz, "Unsupervised Decomposition of a Document into Authorial Components", ACL2011.
Well, at least the AP covered it: Matti Friedman, "An Israeli algorithm sheds light on the Bible", AP 6/29/2011 (as usual published under different headlines in various publications, e.g. "Algorithm developed by Israeli scholars sheds light on the Bible’s authorship" (WaPo), "Software deciphers authorship of the Bible" (CathNews), etc.).
Unfortunately, the AP story focuses on an aspect of the work that the authors mention only briefly in one paragraph of the "Conclusions and Future Work" section of their 9-page paper. Here's how the AP describes it:
For millions of Jews and Christians, it's a tenet of their faith that God is the author of the core text of the Hebrew Bible—the Torah, also known as the Pentateuch or the Five Books of Moses. But since the advent of modern biblical scholarship, academic researchers have believed the text was written by a number of different authors whose work could be identified by seemingly different ideological agendas and linguistic styles and the different names they used for God.
Today, scholars generally split the text into two main strands. One is believed to have been written by a figure or group known as the "priestly" author, because of apparent connections to the temple priests in Jerusalem. The rest is "non-priestly." Scholars have meticulously gone over the text to ascertain which parts belong to which strand.
When the new software was run on the Pentateuch, it found the same division, separating the "priestly" and "non-priestly." It matched up with the traditional academic division at a rate of 90 percent—effectively recreating years of work by multiple scholars in minutes, said Moshe Koppel of Bar Ilan University near Tel Aviv, the computer science professor who headed the research team.
"We have thus been able to largely recapitulate several centuries of painstaking manual labor with our automated method," the Israeli team announced in a paper presented last week in Portland, Oregon, at the annual conference of the Association for Computational Linguistics.
Here's how Koppel et al. put it in their "Future Work" section:
Our success on munged biblical books suggests that our method can be fruitfully applied to the Pentateuch, since the broad consensus in the field is that the Pentateuch can be divided into two main threads, known as Priestly (P) and non-Priestly (Driver 1909). (Both categories are often divided further, but these subdivisions are more controversial.) We find that our split corresponds to the expert consensus regarding P and non-P for over 90% of the verses in the Pentateuch for which such consensus exists. We have thus been able to largely recapitulate several centuries of painstaking manual labor with our automated method. We offer those instances in which we disagree with the consensus for the consideration of scholars in the field.
What are those "munged biblical books" that they succeeded on? Well, what the paper in fact documented was a (convincing and appropriate) methodological experiment. First the authors created an artificial problem by randomly mixing fragments of Jeremiah with fragments of Ezekiel (the "munged biblical books"), and then they found a way to successfully de-munge the mixture.
Thus the AP story is accurate, but it leaves out essentially all of the work that Koppel et al. actually reported on at the ACL meeting. Since I think that work was interesting and worthy of note, I'll summarize and explain it here on Language Log. But consider yourself warned: you're going to see why Matti Friedman focused the AP story on that stuff mentioned, in passing, in the next-to-last paragraph of the paper. The body of the work would be a good deal harder to sell to newspaper readers: "Computer program can tell the difference between Ezekiel and Jeremiah"?
And my explanation will be a little longer and more technical than most of our posts. In fact, I'm going to break it up into two or three parts, for my own sake as well as yours; but if your eyes are already glazing over you might want to move along.
Let's start with the standard current approach to statistical authorship analysis, which is basically the same as the standard statistical approach to everything. If we're trying to decide whether document D was written by author X or author Y, we need to start by finding a set of document "features" to use. These should be things which we can measure in an objective or at least reproducible way, and which we think are likely to help us distinguish one author's style from another's. Then we replace the set of documents with a matrix of numbers, where the rows are the documents and the columns are the features, and the number in row i and column j is the value of the jth feature for the ith document.
One easy and obvious feature is plain old word frequency — how often a given word is used in a given document. However, most words are not very good features for authorship classification, for exactly the same reasons that they ARE good features for document retrieval. A random word (like owl or pensions or levitate) doesn't occur in most documents, and whether it occurs or not tells us at least as much about the topic as about the author.
But function words (like to or must or into) are mostly common and mostly not very topic-specific. And such words, though not helpful for finding relevant documents, do sometimes turn out to to be useful features for authorship attribution. Frederick Mosteller and David Wallace were the first to show this, in their seminal 1964 work Inference and Disputed Authorship: The Federalist. (See also Frederick Mosteller and David Wallace, "Inference in an Authorship Problem", Journal of the American Statistical Association 1963.)
The Federalist, as the Wikipedia article explains, is
… a series of 85 articles or essays promoting the ratification of the United States Constitution. Seventy-seven of the essays were published serially in The Independent Journal and The New York Packet between October 1787 and August 1788. A compilation of these and eight others, called The Federalist; or, The New Constitution, was published in two volumes in 1788 by J. and A. McLean.
The scholarly consensus has been that Hamilton wrote 51 of these 85 essays, that Madison wrote 14, that Hamilton and Madison wrote 3 of them together, and that John Jay wrote 5 of them. This leaves 12 essays that are commonly known as the "disputed papers". In order to given themselves a better foundation for inference, Mosteller and Wallace added 36 documents confidently attributed to Madison and 5 known to have been written by Hamilton. This redefined the problem as using 51+5=56 papers by Hamilton and 14+36=50 papers by Madison to decide which of these men wrote the 12 "disputed papers".
As features, Mosteller and Wallace decided to use the frequencies of 70 function words. Here's the table from their 1963 paper:
Mosteller and Wallace (and the many researchers who have followed in their footsteps) consider a variety of clever statistical modeling strategies, full of phrases like "a measure of the non-Poissonness of a negative binomial" and "approximate confidence limits for the likelihood ratio". But the reasons that their approach is a sensible one can be seen by plotting the Madisonian, Hamiltonian, and disputed-paper frequencies of a few of their words.
Here are the frequencies per 1,000 words of to, upon, would in the 106 documents whose authorship is undisputed. We can see that Hamilton is much more likely to use upon, and slightly more likely to use would and to:
If we add the twelve disputed points, we see that they seem to fall more in Madison's part of this space than in Hamilton's:
And in fact, if we rotate the coordinates a bit, we could see that there's a plane in this three-dimensional space such all of Hamilton's points fall on one side of it, while all of Madison's points and all of the disputed points fall on the other side of it. Here's a plot of that plane —
to be precise, the plane -0.5368to – 24.6634upon – 2.9532would = -66.6159
— from Glenn Fung, "The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization", Proc. 2003 Conf. on Diversity in Computing:
But in fact, nearly all of the work here is being done by Hamilton's inordinate fondness for upon. The function-word-frequency approach doesn't work for every authorship problem — and Koppel et al. found that it didn't work very well for distinguishing among authors in the Hebrew bible.
For the details of how the function-word method failed, and how the King James Version, WordNet and Strong's Numbers succeeded, you're going to have to wait for the next installment. Or you could read their paper.