Over the past few months, there have been several developments in the legal battle between Paul Ceglia and Mark Zuckerberg over Ceglia's claim to part ownership of Facebook. As Ben Zimmer explains ("Decoding Your E-Mail Personality", NYT Sunday Review, 7/23/2011):
Mr. Ceglia says that a work-for-hire contract he arranged with Mr. Zuckerberg, then an 18-year-old Harvard freshman, entitles him to half of the Facebook fortune. He has backed up his claim with e-mails purported to be from Mr. Zuckerberg, but Facebook’s lawyers argue that the e-mail exchanges are fabrications. […]
The law firm representing Mr. Zuckerberg called upon Gerald McMenamin, emeritus professor of linguistics at California State University, Fresno, to study the alleged Zuckerberg e-mails. (Normally, other data like message headers and server logs could be used to pin down the e-mails’ provenance, but Mr. Ceglia claims to have saved the messages in Microsoft Word files.) Mr. McMenamin determined, in a report filed with the court last month, that “it is probable that Mr. Zuckerberg is not the author of the questioned writings.” Using “forensic stylistics,” he reached his conclusion through a cross-textual comparison of 11 different “style markers,” including variant forms of punctuation, spelling and grammar.
For some background, you can read Joe Mullin, "What’s In Facebook’s Pile Of Evidence Against Paul Ceglia?", Paid Content 6/2/2011; Chris Gayomali, "How to Write like Mark Zuckerberg", Time Techland 6/3/2011; and Joe Mullin, "Paul Ceglia Insists He's No Fraud — He Really Owns Part of Facebook", 6/17/2011.
Prof. McMenamin's report (filed 6/2/2011) is here. The 11 contested emails that McMenamin analyzed are in Paul Ceglia's First Amended Complaint (dated 4/11/2011); as far as I know, the 35 "Known-Zuckerberg" emails that he used (from Harvard's server back-ups) are not (yet?) available on line.
Ben's Sunday Review article focuses on a debate within the field:
But Mr. McMenamin’s report has raised eyebrows in the forensic linguistics community. Earlier this month, the outgoing president of the International Association of Forensic Linguists, Ronald R. Butters, publicly questioned whether Mr. McMenamin could actually establish that Mr. Zuckerberg likely did not write the e-mails based on such slender evidence. For example, the would-be Zuckerberg e-mails had one instance of uncapitalized “internet,” while a sample of e-mails known to be sent by Mr. Zuckerberg had two capitalized instances of “Internet.” “Are we really doing ‘scientific’ and ‘linguistic’ analysis at all when we simply note instances or absences of this or that superficial textual feature?” Mr. Butters asked.
Ben quotes Carole Chaski in favor of a more optimistic outlook, and also cites some work by Fung and others using the Enron email database in an authorship-determination study (see e.g. Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, and Mourad Debbabi, "A novel approach of mining write-prints for authorship attribution in e-mail forensics", Digital Investigation 5(1) 2008; Farkhund Iqbal, Liaquat Khan, Benjamin Fung, & Mourad Debbabi, "E-mail Authorship Verification for Forensic Investigation", SAC2010.)
As I understand Ron Butters' point, it's not mainly about whether email authorship analysis is possible in principle, but rather about some particular characteristics of the Ceglia-Zuckerberg case: Was there enough evidence? Was the evidence of a suitable kind? And was the evidence evaluated in an appropriate and convincing way?
If you read Gerald McMenamin's report, you'll see that he looked at 11 "style-markers":
1. Punctuation: APOSTROPHES
2. Punctuation: SUSPENSION POINTS
3. Spelling: BACKEND
4. Spelling: INTERNET
5. Spelling: CANNOT
6. Syntax: RUN-ON SENTENCES
7. Syntax: SINGLE-WORD SENTENCE OPENERS
8. Syntax: SENTENCE-INITIAL "SORRY" [similarity]
9. Syntax: DISTANT OR AMBIGUOUS PRONOUN-REFERENT
10. Syntax: NO COMMA AFTER IF-CLAUSE
11. Discourse: MESSAGE-FINAL "THANKS!" [similarity]
He found that
There are two similarities (Nos. 8 and 11) and nine differences between the QUESTIONED writings and KNOWN-Zuckerberg writings, the differences demonstrating a compelling aggregate-array of distinct markers in the respective sets of writings.
He offers an evaluation that invokes the language of probability:
It is important to note that no single marker of these nine differing features is idiosyncratic to these writers. However, these nine contrasting markers constitute a unique set of markers. It would be improbable to find a single writer who simultaneously demonstrates both the QUESTIONED set and the KNOWN set.
The background assumption here are that different writers will exhibit such "style-markers" to different, but individually consistent, extents. If the pattern of style-markers in the contested documents is very different from the pattern observed in documents known to be written by Mark Zuckerberg, then Mark Zuckerberg probably didn't write the contested documents.
Ron Butters' point (though I shouldn't put words in his mouth) seems to be that the event-counts in this case are too small for credible estimates of probability. There are several interesting practical and theoretical aspects of this problem, but I've run out of time this morning, so I'll put them off to another day.
I'll close by recommending another article, which is about forensic speaker recognition, but lays out the technical and legal issues (and some of the conflicts between them) in an especially clear way: Phil Rose, "Technical forensic speaker recognition: Evaluation, types and testing of evidence", Computer Speech & Language 20(2-3) 2006.
Update — more here.