NLLP: bag-of-words semantics?

« previous post | next post »

The First Workshop on Natural Legal Language Processing (NLLP) will be co-located with NAACL 2019. The phrase "natural legal language processing" in the title strikes me as oddly constructed, from a syntactic and semantic point of view, though I'm sure that NAACL attendees will interpret it easily as intended.

Let me explain.

The OED's entry for natural language, dating to 2003, gives one antique usage glossed as

1. A person's native language, a mother tongue. Also in extended use.

with citations going back to the 16th century:

1570 Queen Elizabeth I in J. Strype Ann. Reformation (1709) I. lvi. 615 So near a neighbour by situation, blood, natural language, and other conjunctions.
1589 R. Hakluyt Princ. Navigations iii. 813 Certaine wordes of the naturall language of Iaua… Sagu, bread of the Countrey.
1697 tr. Countess D'Aunoy's Trav. (1706) 131 She now mixes Italian, English, and Spanish with her own natural Language.
1713 Boston News-let. 31 Aug. 2/2 (advt.) A German Servant Man, named John Copler..speaks very broken English, broken French also, Dutch is his natural Language.

A slightly more modern sense, with citations back to the 18th century, is

2.a. A language that has evolved naturally, as distinguished from an artificial language devised for international communications or for formal logical or mathematical purposes.

1774 Ld. Monboddo Of Origin & Progress of Lang. II. iii. xiii. 445 It may also be said to be a natural language..since it follows the order of the human mind in forming the ideas of which language is the expression.
1864 F. M. Müller Lect. Sci. Lang. 2nd Ser. ii. 58 A grammatical wanted before the problem of an artificial language can be..solved. In natural languages the grammatical articulation consists either in separate particles or in modifications in the body of a word.
1933 L. Bloomfield Language xxviii. 506 The political difficulty of getting any considerable number of people all over the world to study, say, Esperanto, will probably prove so great that some natural language will outstrip it.

And the current usage, dating only to 1960 or so, is:

2.b. Computing. Human language, esp. when contrasted with languages designed to be used by computers. Usually attributive.

1960 D. R. Swanson in Science 132 1099 (title) Searching natural language text by computer.
1963 E. A. Feigenbaum & J. Feldman Computers & Thought v. 205 Question-answering machines are computer programs that can be interrogated in natural language (with some constraints) for the answers to questions about a universe of discourse.

That last sense led naturally to the phrase "natural language processing", glossed as

a form of computational linguistics in which natural-language texts are processed by computer (for automatic machine translation, literary text analysis, etc.); abbreviated NLP.

and cited from 1965:

1965 R. F. Simmons Nat. Lang. Processing & Time-Shared Computer 3 Work on mechanical translation, on syntactic analysis systems, and on information retrieval, question-answering, and document retrieval systems have all characterized our efforts in natural language processing so far.

So it's clear that NLP is the processing of natural (i.e. human, as opposed to computer) language, not the natural processing of language.

But what about "natural legal language processing"? Well, the clearly intended meaning is NLP  applied to legal language. As the NLLP home page explains,

As electronic information becomes increasingly available around the world, automated tools for processing that information have grown apace. These tools can be especially effective and time-saving on text where information can be distilled in interesting ways including auto-summarization, named-entity extraction, machine translation, sentiment analysis, topic classification and others. As a result, natural language processing (NLP) applications are popular in important commercial contexts such as finance and healthcare.

The Legal domain however is still largely underrepresented in the NLP literature despite its enormous potential for generating interesting research problems on a par with other important commercial areas. In fact the US Legal Services market alone is valued at 211 billion according to US government price indices.

The accessibility of legal texts in the US in particular was an issue in the past preventing some researchers from working on legal NLP problems. Over the last few years however, more legal corpora have come online at low- or no-cost including the BYU Corpus, the Free Law Project and the expansion of resources published by the Library of Congress through A variety of growing electronic legal resources already exist free of charge for countries in Europe and Asia. Thus we feel that the timing is excellent to bring together researchers from around the world to focus on NLP problems in this area.

So to put it another way, it's "legal NLP".

Except that we might take "legal NLP" to be in contrast with "illegal NLP". And "legal language" is certainly a thing, so if we take processing in the sense used in the phrase "natural language processing",  then the simple three-word phrase "legal language processing" would mean, following the OED's gloss,

a form of computational linguistics in which legal-language texts are processed by computer.

OK, then why add natural in the beginning of that phrase, to yield the four-word version "natural legal language processing"? Well, we're talking about a form of NLP, right? So apparently we want to include all four words. And there are only two syntactically plausible places to stick natural in:

  1. legal natural language processing
  2. natural legal language processing

Option 1 retains the subphrase "natural language processing", but may suggest a contrast with illegal NLP, as noted earlier.

Option 2 retains the fact that we're processing "legal language", although for anyone who tries to work the meaning out compositionally, it raises the question of what forms of "unnatural (or artificial) legal language" we're avoiding.  My guess is that the authors of the phrase (and most members of their audience) are practicing a sort of bag-of-words semantics, where it matters what words you include, and perhaps what pairs of words you include, but not so much how phasal meaning is recursively computed from the contents of the bag.

Full disclosure: I'm on the program committee for that NLLP workshop, and fully support its goals, under whatever name.





  1. Jerry Friedman said,

    December 17, 2018 @ 10:29 pm

    Option 2 retains the fact that we're processing "legal language", although for anyone who tries to work the meaning out compositionally, it raises the question of what forms of "unnatural legal language" we're avoiding.

    Along those lines, wouldn't "legal-language processing" be fine? I think your suggestion of "bag of words semantics" is right on target here and in many other places.

    I usually keep my hyphen obsession under control around here, but I'm going to say that "natural language processing" could also mean the kind of language processing that's natural, done by humans, the exact opposite of NLP. "Natural-language processing" would be unambiguous. But causes don't get any loster.

  2. Viseguy said,

    December 17, 2018 @ 11:38 pm

    As a user of the Lexis online legal-research service since 1975, I, um, naturally understand natural legal-language processing as being in contradistinction to Boolean legal-language processing — Boolean searches (using keywords joined by logical connectors such as OR, AND, and NOT) being the default search method in Lexis. In fact, I'm not sure that natural-language searches are still an option in Lexis, or if they ever really were. I've always suspected that these so-called natural-language searches were simply passed through an algorithm that made a stab at translating them into Boolean searches, and were then processed as such — which I suspect is not what is meant today by natural-language processing. (Hyphens galore in this post!)

  3. rosie said,

    December 18, 2018 @ 1:39 am

    Only this morning I heard the phrases "immigration white paper" and "Iran nuclear deal". The first seemed fine to me but the second jarred. How come? The latter phrase is a bag of words which doesn't work, but the former works because "white paper" is a lexeme, a term of art in [initially] UK politics. So's NLP. If Jerry Friedman's "legal-language processing" is inadequate, what's needed is "legal language NLP".

  4. unekdoud said,

    December 18, 2018 @ 4:02 am

    Natural AND Legal Language Processing might do the job, as long as it's parsed as a logical conjunction rather than the usual set union. This can be achieved by using the ∧ symbol, but now that would be really unnatural.

  5. Jerry Friedman said,

    December 18, 2018 @ 10:29 am

    rosie: It's not my "legal-language processing" (except for the hyphen), since it's in Prof. Lieberman's post. I can see that I made it look as if it were mine, though.

  6. Ralph Hickok said,

    December 18, 2018 @ 10:48 am

    Oddly enough, I have exactly opposite reactions to those phrases. "Iran nuclear deal" seems quite fine to me but "immigration white paper" is jarring :)

  7. Gregory Kusnick said,

    December 18, 2018 @ 12:02 pm

    Does legal language even count as "natural" language? Not in the first sense; it's nobody's L1. Nor even, arguably, in the second sense: it's a formal system devised for the purpose of legal argumentation.

    I suppose it's a human language ("natural" in the third sense) in that human beings speak it to each other. But that's true of mathematics as well, and I don't think anyone would count math as natural language.

  8. Bob Ladd said,

    December 18, 2018 @ 12:51 pm

    @Gregory Kusnick: Actually, in Bloomfield's classic book Language, mathematics is described as "the ideal use of language" and "merely the best that language can do". Bloomfield's point is a bit obscure (why the emphasis on language in the second quote, for example?) but he definitely considered mathematics to be essentially linguistic behaviour.

  9. J.W. Brewer said,

    December 18, 2018 @ 2:14 pm

    Argumentative legal language isn't particularly formal (in the technical sense) at all. It's meant to be persuasive to human beings, and normal human beings (including judges and jurors) are unlikely to be most effectively persuaded by the sort of rigorously formalized-and-notated reasoning that appeals to philosophy grad students with little exposure to the wider world and/or people with a computer-programming background who naively assume that natural language is like a programming language except inexplicably messy and imprecise. The most jargony and opaque legal language appears in loan documents and tax regulations and things like that not intended as advocacy. I'm skeptical that the same NLP approaches that would be useful for analyzing briefs and judicial decisions would be useful for analyzing that separate genre (or set of genres), but maybe the conference can cover both.

  10. David Marjanović said,

    December 18, 2018 @ 7:24 pm

    It's transparent and unambiguous to me: natural legal-language processing (note the all-clarifying hyphen) is the natural processing of legal language. Legal-language processing is a compound noun.

    {natural {{legal-language} processing}}

  11. David Marjanović said,

    December 18, 2018 @ 7:26 pm

    …while natural-legal-language processing would of course be the processing of natural(ly occurring?) legal language…

  12. Rebecca said,

    December 18, 2018 @ 11:23 pm

    My fascination with the elegant little animation of the hamburger / X button on the NAACL site has made it impossible to notice anything else about these conferences. Maybe it's common, but it's new to me.

  13. MJ said,

    December 19, 2018 @ 12:42 am

    I'd be inclined to just add another "language", as in "Legal Language NLP" or "Legal Language Natural Language Processing". Since they're both phrases, the repetition of "language" doesn't bother me.

  14. Milan said,

    December 19, 2018 @ 4:36 pm

    Another solution would be to choose a (slightly less common) synonym for "legal", e.g. "Juridical NLP" (or "juridic, jurisitic(al), jural")

RSS feed for comments on this post