High-stakes forensic linguistics

« previous post | next post »

Over the past few months, there have been several developments in the legal battle between Paul Ceglia and Mark Zuckerberg over Ceglia's claim to part ownership of Facebook. As Ben Zimmer explains ("Decoding Your E-Mail Personality", NYT Sunday Review, 7/23/2011):

Mr. Ceglia says that a work-for-hire contract he arranged with Mr. Zuckerberg, then an 18-year-old Harvard freshman, entitles him to half of the Facebook fortune. He has backed up his claim with e-mails purported to be from Mr. Zuckerberg, but Facebook’s lawyers argue that the e-mail exchanges are fabrications. […]

The law firm representing Mr. Zuckerberg called upon Gerald McMenamin, emeritus professor of linguistics at California State University, Fresno, to study the alleged Zuckerberg e-mails. (Normally, other data like message headers and server logs could be used to pin down the e-mails’ provenance, but Mr. Ceglia claims to have saved the messages in Microsoft Word files.) Mr. McMenamin determined, in a report filed with the court last month, that “it is probable that Mr. Zuckerberg is not the author of the questioned writings.” Using “forensic stylistics,” he reached his conclusion through a cross-textual comparison of 11 different “style markers,” including variant forms of punctuation, spelling and grammar.

For some background, you can read Joe Mullin, "What’s In Facebook’s Pile Of Evidence Against Paul Ceglia?", Paid Content 6/2/2011;  Chris Gayomali, "How to Write like Mark Zuckerberg", Time Techland 6/3/2011; and Joe Mullin, "Paul Ceglia Insists He's No Fraud — He Really Owns Part of Facebook", 6/17/2011.

Prof. McMenamin's report (filed 6/2/2011) is here. The 11 contested emails that McMenamin analyzed are in Paul Ceglia's First Amended Complaint (dated 4/11/2011); as far as I know, the 35 "Known-Zuckerberg" emails that he used (from Harvard's server back-ups) are not (yet?) available on line.

Ben's Sunday Review article focuses on a debate within the field:

But Mr. McMenamin’s report has raised eyebrows in the forensic linguistics community. Earlier this month, the outgoing president of the International Association of Forensic Linguists, Ronald R. Butters, publicly questioned whether Mr. McMenamin could actually establish that Mr. Zuckerberg likely did not write the e-mails based on such slender evidence. For example, the would-be Zuckerberg e-mails had one instance of uncapitalized “internet,” while a sample of e-mails known to be sent by Mr. Zuckerberg had two capitalized instances of “Internet.” “Are we really doing ‘scientific’ and ‘linguistic’ analysis at all when we simply note instances or absences of this or that superficial textual feature?” Mr. Butters asked.

Ben quotes Carole Chaski in favor of a more optimistic outlook, and also cites some work by Fung and others using the Enron email database in an authorship-determination study (see e.g. Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, and Mourad Debbabi, "A novel approach of mining write-prints for authorship attribution in e-mail forensics", Digital Investigation 5(1) 2008; Farkhund Iqbal, Liaquat Khan, Benjamin Fung, & Mourad Debbabi, "E-mail Authorship Verification for Forensic Investigation", SAC2010.)

As I understand Ron Butters' point, it's not mainly about whether email authorship analysis is possible in principle, but rather about some particular characteristics of the Ceglia-Zuckerberg case: Was there enough evidence? Was the evidence of a suitable kind? And was the evidence evaluated in an appropriate and convincing way?

If you read Gerald McMenamin's report, you'll see that he looked at 11 "style-markers":

1. Punctuation: APOSTROPHES
2. Punctuation: SUSPENSION POINTS
3. Spelling: BACKEND
4. Spelling: INTERNET
5. Spelling: CANNOT
6. Syntax: RUN-ON SENTENCES
7. Syntax: SINGLE-WORD SENTENCE OPENERS
8. Syntax: SENTENCE-INITIAL "SORRY" [similarity]
9. Syntax: DISTANT OR AMBIGUOUS PRONOUN-REFERENT
10. Syntax: NO COMMA AFTER IF-CLAUSE
11. Discourse: MESSAGE-FINAL "THANKS!" [similarity]

He found that

There are two similarities (Nos. 8 and 11) and nine differences between the QUESTIONED writings and KNOWN-Zuckerberg writings, the differences demonstrating a compelling aggregate-array of distinct markers in the respective sets of writings.

He offers an evaluation that invokes the language of probability:

It is important to note that no single marker of these nine differing features is idiosyncratic to these writers. However, these nine contrasting markers constitute a unique set of markers. It would be improbable to find a single writer who simultaneously demonstrates both the QUESTIONED set and the KNOWN set.

The background assumption here are that different writers will exhibit such "style-markers" to different, but individually consistent, extents. If the pattern of style-markers in the contested documents is very different from the pattern observed in documents known to be written by Mark Zuckerberg, then Mark Zuckerberg probably didn't write the contested documents.

Ron Butters' point (though I shouldn't put words in his mouth) seems to be that the event-counts in this case are too small for credible estimates of probability. There are several interesting practical and theoretical aspects of this problem, but I've run out of time this morning, so I'll put them off to another day.

I'll close by recommending another article, which is about forensic speaker recognition, but lays out the technical and legal issues (and some of the conflicts between them) in an especially clear way:  Phil Rose, "Technical forensic speaker recognition: Evaluation, types and testing of evidence", Computer Speech & Language 20(2-3) 2006.

Update — more here.



32 Comments

  1. JFM said,

    July 25, 2011 @ 8:35 am

    This may be a bit OT, but I find myself using both "can not" and "cannot" in slightly different ways. Spontaneously, and without thinking too long about it, I guess I apply a difference in emphasis, as in "I CAN NOT do that" as opposed to "I cannot DO that". I'm admittedly a second-language user of English. The point being, I use both.

    (I have no idea how the above-mentioned email analysis treated "cannot" as a parameter, sothis may be totally irrelevant, in which case sorry for this derailment.)

    [(myl) Larry Horn has some relevant comments about this very point on ADS-L.]

  2. Emily Viehland said,

    July 25, 2011 @ 9:22 am

    Are the known writing samples contemporaneous with the contested ones, or more recent? I know _my_ email writing style changed over the course of college, and has certainly continued to evolve in the 14 years since graduation!

    [(myl) As I understand your question and the facts, the answer is "yes".]

  3. Nick Lamb said,

    July 25, 2011 @ 10:22 am

    My understanding is that Zuckerberg was actually a fairly capable CS student. So he would probably have been conscious that "internet" and "Internet" are different, just as we'd expect a British student of Politics to be careful in choosing to write "Conservative" vs "conservative" while perhaps an English speaker from elsewhere wouldn't worry about it much. These are "superficial features" of a text only if you aren't knowledgeable about the field.

  4. GeorgeW said,

    July 25, 2011 @ 10:24 am

    I am inclined not to capitalize 'Internet' but my spell checkers insist that I do. So, on this factor, forensic analysis would discover not my style, but my spell checker.

  5. Neil Coffey said,

    July 25, 2011 @ 11:03 am

    I guess the devil is in the detail, but this sounds like pseudoscience and it scares me that it is being used as evidence in a case involving (presumably) billions of dollars. I suspect that Ron Butters is probably correct that the numbers involved sound too tiny to get statistically meaningful results. And if I understand rightly, the author has "peeked" at the documents to draw up a list of features, and subsequently used those features for the analysis. That sounds methodologically dubious. Far more convincing would be to have a list of features that, for *any* set of random documents, identify or deny same authorship with a given error rate, and then apply that well-defined test to the documents in question.

    Or am I misunderstanding something…?

  6. Kathryn McCary said,

    July 25, 2011 @ 11:18 am

    Entirely a side-note but, as a lawyer, I found myself musing–while reading Professor McMenamin's report–on the likelihood that its language actually reflects his own habitual linguistic choices.

  7. Ron Butters said,

    July 25, 2011 @ 11:32 am

    Insofar as Mark is able to make out my position from Ben Zimmer's TIMES piece, he has pretty much got it right, except that he misunderstood the scope of what I believe was Zimmer's intent about terming Chaski "more optimistic" than I am. I don't think, from what Zimmer said, that Chaski would be any more optimistic about the scientific validity of McM's report than I am. She may also be "more optimistic" about Authorship Attribution (AuAt) in general than I am, but I'm not even sure that that is true, and in any case my paper at the IAFL explicitly acknowledged the potential validity of a truly scientifically constructed AuAt methodology (which Chaski and others have been working diligently towards). I simply questioned the validity of McMenamin's report, which I focused on in my IAFL conference talk earlier this month not because I think that AuAt is cargo-cult science but only to suggest that it might be possible for forensic linguists to set some sort of standards for reliability and methodology, standards that, I would argue, McMenamin's report does not meet. Even if McMenamin (and others) would not agree with me about the scientific merits of his report, presumably everyone would agree that some sort of standards are a good idea, and the question of whether linguists can meaningfully employ McMenamin-style methodology is a matter for scientific debate and empirical determination. In this I am doing little more than repeating what Larry Solan and Peter Tiersma called for in their book, SPEAKING OF CRIME (2005), which to my mind is the best general survey of AuAt research yet written.

  8. Rob P. said,

    July 25, 2011 @ 11:37 am

    Interesting to me as a lawyer is that according to his CV, the majority of the times McMenamin has testified (19/28 of the listed cases), he has been the only linguist involved in the case. In my world of patents, we often have technical and economics experts involved for the infringement/validity and damages issues respectively. It is pretty much inconceivable one side would allow the other to present un-rebutted expert testimony.

  9. Ginger Yellow said,

    July 25, 2011 @ 11:41 am

    The obvious question is whether the "KNOWN" emails are fully consistent on these points. And a second question would be whether the points were chosen because of the differences between KNOWN and QUESTIONED, or independently.

  10. diana said,

    July 25, 2011 @ 11:57 am

    Ron Butters is correct. The evidence is minimal and it is also invalid. The emails were saved in MS word, which changes their reliability. Why were the emails not saved on his email account? The segmental features of the emails are not substantial. We need more data!

  11. Roman said,

    July 25, 2011 @ 1:02 pm

    Don't we all have multiple styles? One for a friend with ttyl, or a business letter with sincerely yours….

  12. Jonathan said,

    July 25, 2011 @ 1:35 pm

    It seems to me that there are fairly simple scientific statements that can be placed on each of these features, and that the features can be combined to give a probabilistic statement and its variance. What would remain is the problem cited by Neil Coffey above: how were the 11 characteristics chosen?

    In short, for any characteristic, we have p = probability that they are in sample 1 and q = probability they are in sample 2. We can obviously test (probabilistically) for p=q, and we can quantify our uncertainties about the estimation of p and q, which would of course include notions of constancies of style.

    I agree with those who say that the samples here are probably too small to draw a firm conclusion, but what is more distressing is that the expert witness himself does not make the calculations. As an expert witness myself (albeit in another field) I have long heeded the maxim that scientific opinions which lack quantification of the uncertainty around them are meaningless. And, to the extent that Daubert-style rules require rigor, the failure to quantify the uncertainty is intself reason to ignore the testimony.

  13. Ron Butters said,

    July 25, 2011 @ 1:39 pm

    Sorry, make that "… whether linguists CAN MEANINGFULLY employ … " (5 lines form bottom)

  14. Svafa said,

    July 25, 2011 @ 2:07 pm

    I, like many here, agree that the sample size is too small to draw meaningful data, and would also question some of the choices for comparison. The spelling of words like "internet" and "cannot" can vary depending on intent or meaning. Similar to the first two markers, I would think it more useful to compare "cannot" with "can't" rather than "can not". Diction might be useful, but I wouldn't think spelling all that useful unless it had a genuinely alternate spelling without a separate meaning (for instance, "colour" vs "color").

    Like Roman, I'm also wondering as the possibility of multiple styles. Perhaps both samples were of similar types of communication, but my own style of writing often depends on the audience. If it's an email to a friend it is often terse and direct, while if it's an email to a work colleague it's often longer with an introductory paragraph, fewer contractions, and a great deal more explanation and description.

  15. Larry Solan said,

    July 25, 2011 @ 2:09 pm

    I am both a lawyer and linguist.

    Let us assume that McMenamin is right: The same person did not write both the known and questioned emails. The problem at this point is that the analysis does not appear from the report to be based on methodology developed through research in which ground truth concerning authorship is known in advance and the likelihood of the method yielding the correct result is determined. Rather, it appears to be based on a common sense intuition, which may indeed be right. But without the research backing up the method, we cannot be sure. It may well be that even with small samples, 9 stylistic differences given this amount of text is indicative of different authors. This discussion illustrates forcefully why research is essential to make authorship attribution more credible. Such research is ongoing by people in computational linguistics (including Carole Chaski referenced in other comments) and in computer science. There is obviously a lot of room for more to be done.

  16. David Walker said,

    July 25, 2011 @ 5:26 pm

    The entire question seems to hinge on how much writing was available to compare. In the case of the e-mails, probably not as much as you would like for this kind of analysis.

    The author of the political novel "Primary Colors", as I think someone has covered in this column, was outed as Joe Klein based on Mr. Klein's previous (large set of) known writings, and the 366 pages of "Primary Colors".

  17. Jeremy Wheeler said,

    July 25, 2011 @ 5:32 pm

    Maybe I've missed it, but wouldn't the clever thing to do be to compare the alleged Zuckerberg emails to the Ceglia authored ones?

  18. mgwa said,

    July 25, 2011 @ 7:51 pm

    I make my living as a statistician and agree that it's problematic to base this assessment on such limited data. In addition, for this to be useful, you'd have to know the frequency of use of the various alternatives among the general population (i.e., if both sets of emails use something that's fairly rare, it's more likely that their from the same person than if both use something that 90% of people use), and also how consistent people tend to be (i.e., do people tend to be consistent in their use of apostrophes?)

    It might conceivably be possible to tell if it's likely that Zuckerberg wrote these emails, but it would take a much more sophisticated analysis than was done here.

  19. Jethro said,

    July 25, 2011 @ 8:05 pm

    This is expert evidence in a legal claim, where the onus of proof lies with the plaintiff. If the sample and/or methodology is insufficient to conclusively prove that Zuckerberg wasn't the author of the questioned emails, then it is equally unable to prove that he was the author. So even if McMeniman's conclusion is rejected, its still a good result for Zuckerberg.

  20. the other Mark P said,

    July 25, 2011 @ 11:46 pm

    I'm glad Jethro spotted the important point: Zuckerberg's counsel only has to show that it is likely that he did not write those e-mails. He does not have to prove anything. For him not to cite linguistic evidence would be foolish in the extreme.

    If I was defence I would have been careful to keep a couple of examples up my sleeve to put to any counter expert. Thus the counsel for McMeniman would think carefully before employing one.

    The question of sample size depends on the nature of the writer. Some features are distinctive and require only a small sample. If Zuckerberg was careful to use the Internet/internet distinction at the time, then two or three wrong uses would be powerful evidence.

  21. Carole Chaski said,

    July 26, 2011 @ 6:33 am

    I am making two points: the first about my optimism and the second about forensic stylistics.
    Point 1: Optimism.
    Uber-editing at the NYT turned "Chaski is facing these problems head-on" into "Chaski is optimistic" and somehow the empirical basis for this dropped off the radar of Mark Lieberman's post. Oh well. What I told Ben Zimmer is that a quantitative (statistical) computational (software) approach to author id IS possible, based on my own empirical results as well as others. Some of my articles on this are listed at http://www.aliastechnology.com (there's a bad link on one I'll have to fix). The NYT article only referenced one group of researchers instead of others I also mentioned to Ben. My point is that my early work in computational forensic author id (not literary, none of which works for forensic data), is in line with a growing body of research from computer scientists and other computational linguists: we are all getting in the high 80's to mid 90's in terms of cross-validated accuracy rates, when we do pairwise author testing. I have also tested for minimal data requirements, and have found that 2000 words and/or 100 sentences per author affords the most robust results. In two different data sets, my software (ALIAS SynAID) has obtained 95% and 94% cross-validated accuracy. When I use this method in casework, I report the error rate for the methods in general, independent of any litigation, as well as the error rate for the particular data of the case. In 23 of 27 cases, ALIAS SynAID has obtained from 93% to 100% cross-validated accuracy at differentiating different authors in the case, so that classification of the questioned document was possible. (I don't move forward with classification if the accuracy rate isn't high).

    Point 2: Forensic Stylistics. While I was a Visiting Research Fellow at the US DOJ National Institute of Justice, I started empirically testing any language-based methods for determining authorship. What I found was that forensic stylistics is NOT empirically reliable; I first published this in 1997, then again in 2001, and so forth. Again, I am not the only researcher who has found this: Koppel in Israel, some computer science students at Swarthmore also showed that if you seriously use forensic stylistics "stylemakers" in a repeatable and objective way, the method is far worse than 50% accurate, and will most likely MISLEAD you into a wrong identification. Statistically, the intra-writer variation is greater than the inter-writer variation, for the kinds of features forensic stylistics uses, which is why real linguists don't analyze at these levels of linguistic structure anyway.

    Thanks for your interest; sorry my website isn't up to date — it's a lot harder to run a PR machine when research is the main priority.

  22. David said,

    July 26, 2011 @ 6:54 am

    I'd be curious to know how close the 11 emails came to the 2000 word/100 sentence standard.

  23. Carole Chaski said,

    July 26, 2011 @ 11:09 am

    Dear David,
    In order for an analyst in the Facebook case to assert a certain level of accuracy for his/her method, not only would you need to have the right amount of data in the 11 emails, but also the right features, counting mechanisms, and statistical classification procedure in order to claim a certain degree of accuracy. A method (as in programming) is a whole package. In fact, speaking empirically, you can have far more data than 2000 words/100 sentences and the forensic stylistics method can even get more erroneous as the intra-writer variation increases.
    Best,
    Carole

  24. mgwa said,

    July 26, 2011 @ 9:06 pm

    Carole – Thank you very much for your post. It sounds like your work is very methodologically sound and I will try to read more about it, since I'm a statistician with an interest in linguistics (I was fortunate enough to take a few linguistics courses when I was in grad school).

  25. Carole Chaski said,

    July 27, 2011 @ 6:08 pm

    Dear mgwa,
    please contact me at cchaski at linguisticevidence dot com!
    Best regards,
    Carole

  26. Nathan Myers said,

    July 28, 2011 @ 3:43 pm

    It's horse's mouth testimony like the above that makes LL worth every penny of the subscription price.

  27. Carole Chaski said,

    July 29, 2011 @ 8:54 am

    Dear mgwa,

    ooops, my mistake: cchaski@linguisticevidence.org (org not com).
    ILE is a non-profit research organization, always interested in talking to statisticians about collaborative projects.

    Best regards,
    Carole

  28. MeFi: Does digital writing leave fingerprints? | Pine Bookshelves said,

    August 7, 2011 @ 11:15 am

    […] in-depth discussion of the Facebook case, including a plethora of links, Language Log's post High-stakes forensic linguistics has got you […]

  29. Neg said,

    August 8, 2011 @ 3:27 pm

    So what is Gerald McMenamin response to this? In The Routledge Handbook of Forensic Linguistics (Routledge Handbooks in Applied Linguistics) by Malcolm Coulthard and Alison Johnson he claims stylistics meets Daubert standard of admissibility.

    The murder of Daniele Jones, John Olsson shows a stylistic analysis of text analysis, noting caps and punctuation, concluding Daniele Jones did not author the text, the suspect did. The sample size appeared to be cell phone texts, so smaller than in facebook.

    Here, at http://www.languagehat.com/archives/000931.php

    Alan Perlman, linguist at University of Chicago, and John Olsson writes "you correctly point out that a lot of this started with Jan Svartvik, and you correctly point out that people like Gerald McMenamin and Roger Shuy are very impressive in what they do, as is Kniffka."

    btw, has there been any credible forenisc authorship identification on Jonbenet letter and the Ramsey's?

  30. MeFi: Does digital writing leave fingerprints? | Send Easter Flowers said,

    August 8, 2011 @ 11:15 pm

    […] in-depth discussion of the Facebook case, including a plethora of links, Language Log's post High-stakes forensic linguistics has got you […]

  31. Marcel B. Maatley said,

    September 28, 2011 @ 3:29 pm

    Several years ago I was consulted on a case in Oregon. Defendant was convicted of murdering his wife solely on stylistic evidence, what I call “guilt by grammar.” Dr. McMenamin and another stylistic expert persuaded the judge that defendant had typed a photocopied document sent to Oregon State Police describing the woman's death. They concluded that every fact in the letter exonerating defendant was a lie, that every statement offering an instigative lead was a clever ruse, and that everything that said she was killed incriminated him. Dr. McMenamin relied on 69 markers. In a murder case at the same time in San Diego County, CA, Dr. McMenamin used a different set of markers to prove defendant wrote an incriminating note. If the two sets of markers had been switched between the two cases, the stylistic evidence would have proven each defendant had not written the note in his own case but had written the note in the other case.

    In his two books Dr. McMenamin says markers are decided on a case-by-case basis. He gives no objective, or even subjective, guidelines for deciding what is specific to one case vis-à-vis another. Surveying thirteen cases reported in transcripts of his testimony, in his presentations, or in his books, I found each case had some significant difference in theory, method and observations when compared to any other of the thirteen cases.

    I gave testimony in the Oregon case at a post conviction review hearing explaining the faults in the stylistic evidence used to convict. Dr. McMenamin submitted an affidavit criticizing my testimony. He admitted every key point on method, theory or markers that I made was correct, unless it was something he had also relied on. For example, he relied on frequency of markers when frequency supported identification (one marker), but did not when frequency supported non-identity (all the other markers), yet he said I was wrong to apply it in the many instances he had not

    Thus, I suspect in the Facebook case that his markers were selected specifically because his computer program turned up those and no other similarities, while he did not chart the all the other words and punctuations that showed differences. It is only a suspicion, not an assertion. The opposing attorneys in any of his cases should look into this issue in preparation for cross-examination. It is necessary they demand production of all his computer generated data. In the Oregon case the defense was never provided them.

    Regards,

    Marcel B. Matley
    San Francisco

  32. The Case of Mark Zuckerberg | Project ARCHIMAEDES said,

    May 12, 2013 @ 12:35 pm

    […] – Introduction Over the past few months, news has emerged[ref][ref] that yet another person is trying to sue Mark Zuckerberg for a share of Facebook. This time, […]

RSS feed for comments on this post