Over the past few days, we've discussed the possible relevance of corpus evidence in legal evaluations of ordinary-language meaning. Another (and socio-economically more important) legal application of computational linguistics is featured today in John Markoff's article, "Armies of Expensive Lawyers, Replaced by Cheaper Software", NYT 3/4/2011:
When five television studios became entangled in a Justice Department antitrust lawsuit against CBS, the cost was immense. As part of the obscure task of “discovery” — providing documents relevant to a lawsuit — the studios examined six million documents at a cost of more than $2.2 million, much of it to pay for a platoon of lawyers and paralegals who worked for months at high hourly rates.
But that was in 1978. Now, thanks to advances in artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost. In January, for example, Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.
Markoff's article features a quote from Tom Mitchell:
We’re at the beginning of a 10-year period where we’re going to transition from computers that can’t understand language to a point where computers can understand quite a bit about language.
This is probably true as a statement about systems widely deployed in practice; but in fact the ideas and algorithms behind this transition have been developed and demonstrated in research projects over the past 25 years or so. There are some recent new ideas, and there will no doubt be a regular progression of other new ideas in the future. But there isn't any recent development that deserves to be called a breakthrough. Rather, there are three mutually-reinforcing processes that have been under way for decades, and are now starting to make a practical impact in this as well other applications of speech and language engineering:
- A gradual accumulation of new techniques and (especially) refinement of older ones, which yield cumulative improvements in performance;
- Constant cost-performance improvements in computers, networks, and storage, which make it possible to apply (new and old) ideas on larger and larger scales, more and more cheaply;
- Increasingly digitization of communication and record-keeping, which makes larger stores of data available for training, and also makes deployment of automated systems easier and cheaper.
Markoff uses the term "e-discovery" for (semi-) automated techniques for pulling legally-relevant information out of piles of digital documents, and suggests a basic division:
E-discovery technologies generally fall into two broad categories that can be described as “linguistic” and “sociological.”
The most basic linguistic approach uses specific search words to find and sort relevant documents. More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”
The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls.
Then the computer pounces, so to speak, capturing “digital anomalies” that white-collar criminals often create in trying to hide their activities.
I'm not convinced that the linguistic/sociological division is really appropriate other than as an expository device. Rather, there seem to be a number of different application types (e.g. document retrieval vs. information extraction), a number of different goals (e.g. finding the history of a particular interaction vs. finding instances of a certain kind of irregular behavior), and a number of different capabilities (e.g. cross-document reference resolution vs. sentiment analysis) that might be deployed to solve a particular problem.
However, to get a better sense of why Markoff might have used the term "sociological" here, take a look at Cataphora's explanation of why "Context is everything".
Markoff's article focuses on the economic consequences of "e-discovery" and similar kinds of automation:
These new forms of automation have renewed the debate over the economic consequences of technological progress.
David H. Autor, an economics professor at the Massachusetts Institute of Technology, says the United States economy is being “hollowed out.” New jobs, he says, are coming at the bottom of the economic pyramid, jobs in the middle are being lost to automation and outsourcing, and now job growth at the top is slowing because of automation.
“There is no reason to think that technology creates unemployment,” Professor Autor said. “Over the long run we find things for people to do. The harder question is, does changing technology always lead to better jobs? The answer is no.”
I believe that "e-discovery" plays at best a minor role in the pointed questions about the value of law school now being asked by legions of currently-under-employed J.D.'s (see David Segal, "Is Law School a Losing Game?", NYT 1/8/2011). It's the recession that's mainly responsible for today's legal supply-demand imbalance, with law-school over-recruiting and law-student naiveté in second place. However, to the extent that Tom Mitchell is right about the practical deployment of computational linguistics over the next decade, the job market for young lawyers may get even worse.
Or — and this is more frightening for the rest of us — perhaps e-discovery (not to speak of e-brief-writing and so on) will radically improve the cost-effectiveness of lawyering and thus radically increase the volume of lawsuits.
[Full disclosure: Dick Oehrle, Cataphora's Chief Linguist, has been a friend since we were undergraduates together during the paleosilicic era.]
Update — there are now 204 comments on the NYT article, many of them interesting.