## Legal automation

Over the past few days, we've discussed the possible relevance of corpus evidence in legal evaluations of ordinary-language meaning. Another (and socio-economically more important) legal application of computational linguistics is featured today in John Markoff's article, "Armies of Expensive Lawyers, Replaced by Cheaper Software", NYT 3/4/2011:

When five television studios became entangled in a Justice Department antitrust lawsuit against CBS, the cost was immense. As part of the obscure task of “discovery” — providing documents relevant to a lawsuit — the studios examined six million documents at a cost of more than $2.2 million, much of it to pay for a platoon of lawyers and paralegals who worked for months at high hourly rates. But that was in 1978. Now, thanks to advances in artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost. In January, for example, Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than$100,000.

Markoff's article features a quote from Tom Mitchell:

We’re at the beginning of a 10-year period where we’re going to transition from computers that can’t understand language to a point where computers can understand quite a bit about language.

This is probably true as a statement about systems widely deployed in practice; but in fact the ideas and algorithms behind this transition have been developed and demonstrated in research projects over the past 25 years or so. There are some recent new ideas, and there will no doubt be a regular progression of other new ideas in the future. But there isn't any recent development that deserves to be called a breakthrough. Rather, there are three mutually-reinforcing processes that have been under way for decades, and are now starting to make a practical impact in this as well other applications of speech and language engineering:

• A gradual accumulation of new techniques and (especially) refinement of older ones, which yield cumulative improvements in performance;
• Constant cost-performance improvements in computers, networks, and  storage, which make it possible to apply (new and old) ideas on larger and larger scales, more and more cheaply;
• Increasingly digitization of communication and record-keeping, which makes larger stores of data available for training, and also makes deployment of automated systems easier and cheaper.

Markoff uses the term "e-discovery" for (semi-) automated techniques for pulling legally-relevant information out of piles of digital documents, and suggests a basic division:

E-discovery technologies generally fall into two broad categories that can be described as “linguistic” and “sociological.”

The most basic linguistic approach uses specific search words to find and sort relevant documents. More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”

The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls.

Then the computer pounces, so to speak, capturing “digital anomalies” that white-collar criminals often create in trying to hide their activities.

I'm not convinced that the linguistic/sociological division is really appropriate other than as an expository device. Rather, there seem to be a number of different application types  (e.g. document retrieval vs. information extraction), a number of different goals (e.g. finding the history of a particular interaction vs. finding instances of a certain kind of irregular behavior), and a number of different capabilities (e.g. cross-document reference resolution vs. sentiment analysis) that might be deployed to solve a particular problem.

However, to get a better sense of why Markoff might have used the term "sociological" here, take a look at Cataphora's explanation of why "Context is everything".

Markoff's article focuses on the economic consequences of "e-discovery" and similar kinds of automation:

These new forms of automation have renewed the debate over the economic consequences of technological progress.

David H. Autor, an economics professor at the Massachusetts Institute of Technology, says the United States economy is being “hollowed out.” New jobs, he says, are coming at the bottom of the economic pyramid, jobs in the middle are being lost to automation and outsourcing, and now job growth at the top is slowing because of automation.

“There is no reason to think that technology creates unemployment,” Professor Autor said. “Over the long run we find things for people to do. The harder question is, does changing technology always lead to better jobs? The answer is no.”

I believe that "e-discovery" plays at best a minor role in the pointed questions about the value of law school now being asked by legions of currently-under-employed J.D.'s (see David Segal, "Is Law School a Losing Game?", NYT 1/8/2011). It's the recession that's mainly responsible for today's legal supply-demand imbalance, with law-school over-recruiting and law-student naiveté in second place. However, to the extent that Tom Mitchell is right about the practical deployment of computational linguistics over the next decade, the job market for young lawyers may get even worse.

Or — and this is more frightening for the rest of us — perhaps e-discovery (not to speak of e-brief-writing and so on) will radically improve the cost-effectiveness of lawyering and thus radically increase the volume of lawsuits.

[Full disclosure: Dick Oehrle, Cataphora's Chief Linguist, has been a friend since we were undergraduates together during the paleosilicic era.]

Update — there are now 204 comments on the NYT article, many of them interesting.

1. ### Josh Bowles said,

March 5, 2011 @ 10:04 am

Don't forget internet marketing. I put off my PhD (in which I wanted to focus on computational semantics and pragmatics) to work for an internet company that went from 10 to 50 million dollars in revenue in under two years and is now slated to break 100 million. Internet marketing is now taught in CS classes (see for example, chapter 8 http://infolab.stanford.edu/~ullman/mmds.html), and relates from everything to data mining, bayesian classification, algorithmic game theory, machine learning, etc…. It is a fertile testing ground for academic ideas.
My first 3 months were spent doing research on the next 18 months of infrastructure build-out and hardware purchases that will support the expanded need of marketers to target consumers based on online behavior. There are numerous "marketing" companies employing various types of academic technology (Autonomy (http://www.autonomy.com/), Attensity (http://www.attensity.com/home/), RecordedFuture (https://www.recordedfuture.com/), and more). The simple connection between natural language search engines and internet behavior drives the need for the linguistics side of things.
I agree with

"This is probably true as a statement about systems widely deployed in practice; but in fact the ideas and algorithms behind this transition have been developed and demonstrated in research projects over the past 25 years or so. There are some recent new ideas, and there will no doubt be a regular progression of other new ideas in the future. But there isn't any recent development that deserves to be called a breakthrough"

Most of the techniques in use now are "quantity" driven: meaning they rely heavily on statistical inference. A real breakthrough will need to come on the qualitative side of things. That is, leveraging the numerical calculating power of a computer is not an innovation—it's common sense because computers are effectively good at numerical calculation. But getting an algorithm (or system) to distinguish qualitative differences between "fire engine red" and "rose red" is a task that doesn't make sense in the current state of the art. And this is where a breakthrough would be made (assuming we can make it).

2. ### Dan Lufkin said,

March 5, 2011 @ 12:53 pm

There's a niche in the legal industry in Washington (and likely elsewhere) that's occupied by an army of underemployed law-school graduates who get $50 an hour (i.e. peanuts) as no-bennies temps to review thousands of documents in discovery to determine whether they (the documents) are subject to attorney-client privilege. This is probably a decision that would be very difficult to automate, but that can be reduced to the sweatshop level with a keyboard. [(myl) I have no experience with this particular task. But in similar cases, the level of inter-subjective agreement among well-trained annotators can be remarkably low, and as a result, it's often much easier than you might think to create a program that agrees with human annotators as often as they agree with one another. If I were asked to take on a project of this kind, and if I cared about the quality of the results, I'd start from the premise that a judicious combination of document-classification technology and human judgment would be by far the most cost-effective and reliable approach.] 3. ### rkillings said, March 5, 2011 @ 12:55 pm Congratulations! Link farms aside, Language Log is now the number 1 hit on Google for "paleosilicic", edging aside the Deva Victrix – Chester wiki: "The Roman invasion of Britain marks the formal end of the British Iron Age, although some believe that the Iron Age still continues (unless recently superseded by the age of the "Bakelite People", or the "Paleosilicic")." 4. ### slobone said, March 5, 2011 @ 1:39 pm @rkillings, Yeah, but it's not a valid Scrabble word, so what good is it? 5. ### Watson goes to law school | LAWnLinguistics said, March 5, 2011 @ 2:44 pm […] Language Log, Mark Liberman discusses the article and explains that the new e-discovery applications that have proliferated over the past few years […] 6. ### Ran Ari-Gur said, March 5, 2011 @ 6:23 pm @rkillings: Wordnik is not exactly a link farm. In my experience it has a low signal-to-noise ratio (so far), but it was founded by a reputable lexicographer, and the people running it are trying to make it useful. 7. ### Brett R said, March 6, 2011 @ 7:16 am Can anyone confirm whether a change in the rate of split infinitives is indeed an indicator of malfeasance as claimed by the chief technology officer in the article? 8. ### John Cowan said, March 6, 2011 @ 3:59 pm Brett R.: Jane Austen lived a blameless life. 9. ### Dave Lewis said, March 6, 2011 @ 11:05 pm The Markoff article gets the standard terminology a bit wrong. E-Discovery refers simply to doing discovery on documents (or other data) in electronic format, not to any particular set of smart (or dumb) technologies for doing so. Combinations of document classification technology and human judgment (judicious or not) are rapidly become the norm in e-discovery, with pretty much all the major vendors and service providers now including both supervised and unsupervised machine learning technologies. The technology is way ahead of the legal frameworks, however, and everyone is somewhat nervously waiting for judicial guidance. For those interested in reading more, the overview papers for the TREC Legal Track are a good start. It's been running evaluations of information retrieval technologies for e-discovery since 2006. Also, the current issue (Volume 18, Number 4, December 2010) of Artificial Intelligence and Law is a special issue on e-discovery. There will be workshops on e-discovery this summer at ICAIL 2011 in Pittsburgh, and at SIGIR 2011 in Beijing. We'd love participation from the computational linguistics community. 10. ### Tom said, March 7, 2011 @ 11:04 am Last year at a legal conference I heard a futurist (whose name I don't remember) speak about how the practice of law will change. Quoting another futurist (I don't remember that name, either), he said in the future a law office will consist of a lawyer, a dog, and a computer. The computer will be there to answer legal questions. The dog will be there to occupy the lawyer so he won't mess with the computer. And the lawyer will be there to feed the dog. [(myl) There are versions for other professions as well. Here's one about pilots (though it's on a page about aviation law): What will the cockpit crew be like in the commercial aircraft of the future? Pessimists in professional circles already know the answer: A pilot and a dog. Yes, you heard me correctly: The pilot’s only job is to feed the dog and keep him awake; the dog is supposed to make sure that the pilot doesn’t touch anything. A version of the same joke from 2000 is here. Here's a version about "the data center of the future". This document starts with a version about "the factory of the future", sourced to Warren Bennis, to whom the quote is widely attributed. ] 11. ### Keith Schon said, March 8, 2011 @ 8:11 pm In repsone to Brett R's question, I work for Cataphora, and I was in the room when the "split inifinitives" quote was said. It wasn't meant as a literal statement. The discussion was about how the level of formality in written communications often changes when the writer thinks the communication may be read by people other than the intended parties. We have seen this in real world data, both because people watch what they write to a greater degree, and because they go back and "clean up" their existing data by deleting embarassing or incriminating outbursts. With that said, we don't have data specifically about split-infinitives. That was just an off-the-cuff quip, which unfortunately may not come through in the article. Keith Schon, Manager, Core Technology Group, Cataphora 12. ### Just another Peter said, March 9, 2011 @ 11:28 pm "the studios examined six million documents at a cost of more than$2.2 million, much of it to pay for a platoon of lawyers and paralegals who worked for months at high hourly rates."

So they paid the costly lawyers and paralegals… about 36c per document they examined. That doesn't seem like much.