Linguistic Deception Detection: Part 1
« previous post | next post »
In "Reputable linguistic "lie detection"?", 12/5/2011, I promised to scrutinize some of the research on linguistic deception detection, focusing especially on the work cited in Anne Eisenberg's 12/3/2011 NYT article "Software that listens for lies". This post is a first installment, looking at the work of David Larcker and Anastasia Zakolyukina ("Detecting Deceptive Discussions in Corporate Conference Calls", Rock Center for Corporate Governance, Working Paper No. 83, July 2010).
[Update: as of 6/5/2019, the working papers version no longer exists, but a version under the same title was published in the Journal of Accounting Research in 2012.]
Let me start by saying that I think this is a terrific piece of work. The authors identified a large, meaningful, and important body of material to explore; they processed it intelligently in order to lay it out for analysis; they took thoughtful account of the previous literature and added their own relevant hypotheses; their experiments are carefully designed and very clearly reported, without the slightest hint of hype or exaggeration. Furthermore, the details of their experiment and its results provide many useful lessons for anyone interested in linguistic deception detection.
And they did succeed in detecting deception!
They looked at the transcripts of a large number of quarterly-earnings conference calls: 29,663 transcripts, comprising everything available from FactSet Research Systems Inc. for U.S. companies between 2003 and 2007. They automatically analysed these transcripts to eliminate prepared statements and operator instructions, resulting in "a database where we can track the question posed by a speaker and the answer from a corporate representative that follows after a particular question."
Their unit of analysis, which they call an "instance", was "all answers of a corporate representative (e.g. CEO, CFO, COO, etc.) at a particular conference call". They threw away "instances" that were too short (less than 150 words), and created separate files for CEOs and CFOs "[s]ince CEOs and CFOs are the most likely to know about financial statement manipulation, and these executives are the most common participants on the conference call". The result was a CEO sample with 16,577 "instances" (median length 1,526 words) and a CFO sample with 14,462 "instances" (median length 708 words).
In each case, they checked to see if the company in question had later revised its earnings estimates, using data from Glass, Lewis & Co. through 2009:
In order to identify serious restatements, as opposed to "trivial" restatements, we require [the earnings statement associated with] deceptive conference calls to exhibit a material weakness, a late filing, an auditor change, or a disclosure using Form 8-K. A material weakness implies that there is a deficiency in the internal controls over financial reporting that can make it easier for executives to manipulate. An auditor change can be a signal about deficiency in monitoring. A late filing implies that it takes time for a firm to correct the accounting, which suggests that the manipulation is complex and possibly intentional. Finally, Plumlee and Yohn [2008] show that a Form 8-K filing is related to more serious restatements.
Note that the whole Quarterly Earnings Report (and thus the whole conference call Q&A) is flagged as "deceptive" or "not deceptive", rather than any particular phrases, sentences or answers within the conversation. Similarly, the contents of the whole "instance" (one particular executive's contributions to the call) are going to be used in determining whether the Q&A was "deceptive". This is strikingly different from many "lie detection" or "stress detection" approaches, where the unit of classification is a particular word, phrase, or answer, typically no more than 10 or 20 words long.
They then made a list of linguistic features that might signal deception, among those visible in a transcript, based on the literature and on their own ideas.
Similar to traditional classfiication research, we estimate a simple binomial logistic model. The outcome variable is coded as one if a conference call is labeled as deceptive and zero otherwise. To estimate the prediction error of a classifier, it is necessary to estimate the out-of-sample prediction error, because the in-sample prediction error is very optimistic estimate of the prediction error on a new data set. One approach is to randomly split the sample into two parts, and use one part to estimate the model and the other part to obtain the out-of-sample prediction error using the estimated model. However, deceptive outcomes are rare events and single split may not provide enough variation to fit the model and to consistently estimate the out-of-sample prediction error.
So they used cross-validation, dividing the data into 5 random parts, and then using each of the 5 parts as the test set, with the other four used in training the model parameters. They then repeated this procedure 20 times with different random splits, and took the mean of all 100 estimated prediction rates.
The first deception-detection result that they report is for a subset of their data involving companies in the Financial Services sector. Here's their Table 5:
Here's what they say about this table:
Classification models based on the word categories for CEOs (CFOs) perform significantly better than a random classifier that does not use any information (Table 5). In particular, the AUC for CEOs is 58.93%, which is significantly better than 50% (the AUC for a random classifier). The AUC for CFOs is less impressive, 53.56%, however, significantly better that 50%. Overall accuracy (the percentage of calls classified correctly) for both CEOs and CFOs is higher than 60%. At the same time, the model that classifies all conference calls as non-deceptive would achieve the accuracy of around 90%, but would not perform significantly better than any other random classifier.The terminology used in this table requires some explanation.
The terminology in this table does indeed require some explanation. The background is the traditional contingency table:
The precision is the proportion of instances [linguistically] classified as "deceptive" where the classification was in fact correct: true positives / (true positives + false positives). Here the precision was 13.56% for CEOs and 11.57% for CFOs, so that the rate of false positives was between 96% and 98%. This is still an indication that the classification is adding some information, since the underlying rate of deceptive filings was only around 10%. In the case of the CEOs' answers, the classifier's precision was about 13.76/10.44 = 32% better than the precision of random guessing.
The cited "AUC" measure refers to the "Area Under the [ROC] Curve". You can read the Wikipedia article on the "Receiver Operating Characteristic" in order to learn about this concept in detail, but the basic notion is that random guessing should have an AUC of 50%, so their classifier's AUC of 58.93% (for CEOs) and 53.56% (for CFOs) is a good bit better than chance. A usefully intuitive way to think about the AUC is that it represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In a gambling game like financial services, you might be quite grateful for a 58.93% chance of perceiving a lying CEO as more deceptive than a truthful one.
Still, this is not what most people think of when they hear a term like "lie detector". To illustrate this, I've added something to their Figure 1:
The reddish blob in the upper left-hand corner represents the performance region associated with [a meaningful interpretation of] the "95% accurate" or "95% success" results repeatedly claimed for Nemesysco's Layered Voice Analysis, which is said to give such results for short stretches of speech, even single words and phrases. That's where a system would be in ROC space, if it had (say) a >0.95 true positive rate with <.05 false positive rate.
But maybe there's another interpretation of a system being "95% accurate" or having "a 95 % success rate". Table 5 from Larcker & Zakolyukina tells us that their system achieved 64.27% "accuracy" on financial firm CEOs — that is, it classified 64.27% of 1533 Q&A instances correctly, in the mean of 100 cross-validation runs. But as L & Z explain in the text, since only 10.44% of the instances were in fact deceptive, a model that simply called all conference calls "non-deceptive" would achieve almost 90% "accuracy".
So in terms of the "accuracy" measure, their model is doing worse than random guessing. Given the cost function that they've imposed, this is the right thing for it to do; but this observation underlines a problem with the "accuracy" measure, pointed out in a more pointed way in "Determining whether a lottery ticket will win, 99.999992849% of the time", 8/29/2004. It's quite possible and even easy for a completely worthless deception-detection system to achieve "95% accuracy" — all you need to do is to set it to guess "deceptive" very rarely, on a test set where only a couple of percent of the exemplars are in fact deceptive. (Or you could just make up a number…)
Returning to Larcker & Zakolyukina, I'll mention just one more of the things that they did. Here's their Table 6, in which they the results for CEOs and CFOs of all U.S. companies in their data set (not just those in the finance sector), and add a dimension for the seriousness
of the detected deception:
In order to provide insight into whether linguistic features of deception vary with the size of a restatement, we separate instances into several categories […] These criteria are the no-threshold (NT) criterion that ignores the magnitude of the bias, [and] the absolute value of bias criteria for the bias that is greater than 25th (AS25) and 50th (AS50) percentiles of the non-zero absolute value of bias distribution …
It's worth quoting this from their discussion:
Surprisingly, the signs of association for some word categories differ for CEOs and CFOs. These are third person plural pronouns and certainty words. Whereas the prior research finds both positive and negative relationship between third person plural pronouns and the likelihood of deception, the sign for certainty words is expected to be negative. However, deceiving CEOs use fewer certainty words, which is consistent with the theory, but deceiving CFOs use more certainty words, which contradicts
the theoretical predictions. […]
Deceiving CEOs use fewer self-references, more impersonal pronouns, more extreme positive emotions words, fewer extreme negative emotions words, and fewer hesitations. The use of fewer hesitations by CEOs in deceptive calls can be the consequence of CEOs having more prepared answers or answering planted questions. Similar to the results for extreme negative emotions words, CEOs use fewer swear words, which is inconsistent with our theoretical prediction. However, for CFOs the use of extreme positive emotions and extreme negative emotions words is not significantly associated with deception. In contrast to CEOs, CFOs use significantly more tentative words.
You should not allow the generic plurals here to mask the fact that these sub-effects are tiny contributions to a rather small overall information gain contributed by L & Z's classifier. And some of this differentiation between CEOs and CFOs may be over-fitting, despite the assiduous cross-validation. But still, I think it's worth noting that even in this rather homogeneous situation, two groups of people with somewhat different roles are apparently performing deception in different linguistic ways.
Taken as a whole, in my opinion, this work is a model for how to approach the problem of linguistic deception detection, and how to report what you find. It could not be more different from what we have seen from several of the players in the "voice stress" industry.
To come: discussions of the other research mentioned in Anne Eisenberg's 12/3/2011 article "Software that listens for lies".
Dougal said,
December 6, 2011 @ 1:48 pm
I've only barely glanced at the paper (the very wordy style and double-spacing don't help — I guess it's because they're business school/econ types), but because I was curious, the feature functions for their linear regression models seem to look like this:
CEOs: 0.678 – 0.004*I – 0.004*we + 0.025*genknlref – 0.005*posemone + 0.013*posemoextr – 0.156*swear – 0.017*negemoextr – 0.290*value
CFOs: 0.0001*wc – 0:019*they + 0.047*genknlref + 0:015*assent – 0.013*posemone – 0.058*anx + 0.011*certain – 0.909*shvalu
(where a positive coefficient means the model thinks it's more likely that the speaker is being deceptive: Pr(deceptive | X) = 1 / (1 + exp(- f(X))), with f(X) as above). I think the sets of features are different because they're doing some kind of sparsity regularization, or only including statistically significant coefficients, or something.
"genknlref" means "referring to general knowledge": phrases like "you know".
I'm not sure what values these variables take on — are they portions of the overall statement that fall into those categories?
I'd also be interested to see the effect of using different kinds of classifiers (maybe SVMs), and maybe using different, less-manually-constructed features. Are there syntactic features that might be relevant? And if we had access to audio or a closer transcription (including pauses, false starts, etc), what could help us there?
Very cool, in any case.
Charles Gaulke said,
December 6, 2011 @ 1:55 pm
It's also worth noting that this is hardly likely to be a very heterogeneous sample in terms of cultural background, education levels, motives for lying, etc. The likelyhood that this same system could ever catch me lying to a girlfriend about where I was last night, for example, is pretty much nil.
Mr Punch said,
December 6, 2011 @ 3:00 pm
I wonder if the sample contains enough CEOs who are former CFOs (or at least finance types) to support analysis that would help distinguish between role and training. Someone who's come up through marketing may speak differently from a CPA.
Money Talks: The Power of Voice — Comment on William J. Mayew and Mohan Venkatachalam’s reply | Fonetik at Stockholm University Blog said,
April 7, 2012 @ 1:15 pm
[…] Scientifically sound emotion analysis can be carried out, though not with LVA I insist that you should correct your work by carrying out the analyses using scientific technology. Since, as you wrote, you are interested in assessing “The power of voice”, not “The power of LVA”, there is no other way of assessing the Power of voice than starting by using relevant scientific technology. I am sure phoneticians will be happy to assist you in the process and that you have all to gain from discontinuing your use of Nemesysco’s LVA-technology. Meanwhile, you may wish to take a look at Mark Liberman’s Language Blog where the aspects that you are interested in studying are also discussed (Linguistic Deception Detection: Part 1, http://languagelog.ldc.upenn.edu/nll/?p=3608). […]