Annals of parsing

« previous post | next post »

Two of the hardest problems in English-language parsing are prepositional phrase attachment and scope of conjunction. For PP attachment, the problem is to figure out how a phrase-final prepositional phrase relates to the rest of the sentence — the classic example is "I saw a man in the park with a telescope". For conjunction scope, the problem is to figure out just what phrases an instance of and is being used to combine.

The title of a recent article offers some lovely examples of the problems that these ambiguities can cause: Suresh Naidu and Noam Yuchtman, "Back to the future? Lessons on inequality, labour markets, and conflict from the Gilded Age, for the present", VOX 8/23/2016.  The second phrase includes three ambiguous prepositions (on, from, and for) and one conjunction (and), and has more syntactically-valid interpretations than you're likely to be able to imagine unless you're familiar with the problems of automatic parsing.

On the basis of general linguistic and real-world knowledge, I interpret this title as promising lessons for the present time, asserting that these lessons are derived from a comparison with various phenomena in the period known as the Gilded Age, and specifying that those phenomena are inequality, labor markets, and conflict. Reading the article confirms this interpretation.

The relevant linguistic knowledge includes the fact that "lessons for X" and "lessons from Y" are both common idioms. It's reasonable to expect a modern stochastic parser to know this — but of course  for and from are promiscuous modifiers, willing to hook up with almost any noun, verb, or adjective at all.

The relevant real-world knowledge includes the fact that socio-economic inequality is a frequently discussed feature of the Gilded Age and also of the present time, that such inequality is related to labour markets, and that it may lead to conflict.  Modern parsers don't think about things at this level — current directions of progress, like distributional semantics, don't really help much in cases like this one.

To illustrate some of the ways that parsing can go wrong,  I'll give the (variously wrong) results returned by three available on-line parsers. And let me note in passing that all of these parsers report their results in the general framework adopted by the Penn Treebank project, which flourished due to a sort of treaty that emerged from a summit meeting of computational linguists held 25 years ago (Ezra Black et al., "A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars", HLT 1991). This approach has been so widely used in "treebank" projects, across many languages and types of text, that it makes sense in my opinion to teach it more widely, at least to linguistics students if not to a broader audience.

The (rather old-fashioned) Link Grammar parser messes up the worst:

(S (NP (NP Lessons)
       (PP on
           (NP inequality)))
   ,
   (VP labour
       (NP (NP markets)
           , and
           (NP conflict))
       (PP from
           (NP the Gilded Age ,
               (PP for
                   (NP the present)
                   .)))))

You can visualize the induced structure more clearly in a tree diagram:

In other words, the article is about "Lessons on X, labour Y"; X is "inequality"; and Y is "markets and conflict from the Gilded Age", more specifically the "Gilded Age for the present".

The Berkeley parser is a bit better:

(ROOT
  (NP
    (NP (NNS Lessons))
    (PP (IN on)
      (NP
        (NP (NN inequality))
        (, ,)
        (NP (JJ labour) (NNS markets))
        (, ,)
        (CC and)
        (NP
          (NP (NN conflict))
          (PP (IN from)
            (NP (DT the) (NNP Gilded) (NNP Age)))
          (, ,)
          (PP (IN for)
            (NP (DT the) (NN present))))))
    (. .)))

Note that this version adds the "part of speech" tags immediately above the individual words:

Now we're talking about  lessons on X, where X="inequality, labour markets, and Y", and Y="conflict from the Gilded Age for the present". This is coherent but certainly not correct.

The Stanford parser almost gets it right:

(ROOT
  (NP
    (NP (NNS Lessons))
    (PP (IN on)
      (NP
        (NP (NN inequality))
        (, ,)
        (NP (NN labour) (NNS markets))
        (, ,)
        (CC and)
        (NP
          (NP (NN conflict))
          (PP (IN from)
            (NP (DT the) (NNP Gilded) (NNP Age))))
        (, ,)))
    (PP (IN for)
      (NP (DT the) (NN present)))
    (. .)))

Here "from the Gilded Age" modifies only "conflict", and has no direct relationship to "lessons".

I think the correct analysis has the three PPs as all parallel dependents of "lessons"

  • "on inequality, labour markets, and conflict"
  • "from the Gilded Age"
  • "for the present"

with the conjunction joining the three NPs

  • "inequality"
  • "labour markets"
  • "conflict"
(ROOT
  (NP
    (NP (NNS Lessons))
    (PP (IN on)
      (NP
          (NP (NN inequality))
          (, ,)
          (NP (NN labour) (NNS markets))
          (, ,)
          (CC and)
          (NP (NN conflict))))
     (PP (IN from)
            (NP (DT the) (NNP Gilded) (NNP Age)))
    (, ,)
    (PP (IN for)
      (NP (DT the) (NN present))))
  (. .))

Some people might prefer to add additional structure, say by binding "on inequality, labour markets, and conflict" more closely to "lessons", or even stacking up the PP modifiers recursively:

I'll leave this question to the syntacticians.

But whatever exactly the right answer is, it's not what the three online parsers came up with. There may be some systems out there than can do better on this particular example — but PP attachment and conjunction scope in English remain hard problems for computational linguistics.



7 Comments

  1. Bob Ladd said,

    August 24, 2016 @ 8:42 am

    Surely the editor of the journal could have made the title a lot more reader- (and presumably parser-) friendly before the piece was published, e.g. "Lessons for the present from the Gilded Age, on labour markets, inequality, and conflict". The presence of the comma before for the present is a real giveaway that the expression needs to be rearranged.

    Also, even in the actually published version, what effect does the presence of the "Oxford comma" have on the parsers' behaviour? That is, would it have made a difference if the title had been "Lessons on labour markets, inequality and conflict from the Gilded Age, for the present"?

    [(myl) You can try it at the links provided in the post. Omitting that comma makes the Stanford parser and the Berkeley parser do the same thing, and not the right thing.]

  2. Daniel Barkalow said,

    August 24, 2016 @ 11:45 am

    It looks like the Link Grammar parser was misled by being given an NP when it thought it should get an S, and was pretty much doomed by the fact that "labour" is the only remotely plausible verb. You might get something more representative out of "There are lessons on…"

    The Stanford parser's interpretation is probably the best guess without a bunch of pragmatics. (The present probably doesn't need lessons on [conflict from the Gilded Age], but could use lessons on [conflict from the Cold War Era].) I think [[…[NP PP]], PP] is more likely than [[… NP] PP, PP], all things being equal.

  3. Gregory Kusnick said,

    August 24, 2016 @ 12:03 pm

    My first reading had "for the present" attached to the conjunction. That is, today we'll talk about lessons on inequality, labor markets, and conflict; additional lessons will be covered in our next installment.

  4. D.O. said,

    August 24, 2016 @ 8:50 pm

    I would even rephrase it as "Labor markets, inequality, and conflict: lessons for the present from the Gilded Age", but maybe I was far too long around academic writing.

  5. Bob Ladd said,

    August 25, 2016 @ 12:47 am

    @ D.O.: Yes, of course, much better (or at least much more academic)!

  6. Idran said,

    August 25, 2016 @ 9:30 am

    Obviously there are ways that the title could have been made clearer, but I don't think that's the point of this article? Even unclear, it's still a valid English construction. The article isn't about improving the title, it's about automated efforts at parsing valid, if complicated, constructions.

  7. Yuval said,

    August 25, 2016 @ 2:05 pm

    I read this at first like Stanford did.
    Also, I think you misrepresented Link's parse: not markets and conflict are from the past, but labor.

RSS feed for comments on this post