Coordination parsing challenge

« previous post | next post »

Dan Bilefsky, "Hungarian Right, Center and Far, Make Gains", New York Times 4/11/2010:

Hungary’s center-right opposition party won first-round parliamentary elections here on Sunday, while a far-right party, whose black-clad paramilitary extremists evoke the Nazi era, made significant gains.

It's true that "center right" and "far right" are common collocations — but I wonder how many parsers can get that headline right.

Parsing is not the only way to fail in this case. Google Translate (which is phrase-based, as far as I know, and doesn't try to parse the input) renders the headline as "derecho de Hungría, el centro y la fecha, obtengan beneficios".

At first I was puzzled about how far could be translated as "la fecha" (= "the date"), but then I realized that it's probably from associating the English phrase "so far" with the typical Spanish translation "hasta la fecha".

[Hat tip to Evan Harper.]



24 Comments

  1. Peter Taylor said,

    April 12, 2010 @ 8:18 am

    "Obtengan" is interesting too. I would guess that it's effectively the ustedes form of the imperative (as opposed to third person plural subjunctive).

  2. Damien Hall said,

    April 12, 2010 @ 8:33 am

    There's also possible lexical confusion (it got me, anyway) arising from the frequent use in European politics of 'the centre' to refer to parties who self-identify and are identified with neither the right-wing nor the left-wing parties of their country. As I read, I understood from this headline that the Hungarian Right and the (Hungarian) Centre (parties) had made gains; I suppose I assumed that the 'Hungarian Far' might be another name for a party, though an odd one; and then I thought that the comma after 'Far' was misplaced, and that the NYT should be ashamed at not catching it. All this in a split-second; then, what was in fact the intended parsing occurred to me.

    In passing, I also wonder why the Spanish version from Google Translate puts the verb of this headline in the subjunctive. Can't explain that one. Interestingly, when I put the phrase through Google Translate into Spanish and French on my computer, I get

    Derecho húngaro, Centro y Lejano, obtener ganancias

    and

    Droite hongroise, Centre et de l'Extrême, faire des gains

    – so the 'so far' > 'hasta la fecha' error is eliminated, but in neither lenguage is the verb inflected. Apart from that, the translations into both languages are clumsy, but only as clumsy as the English original, IMHO. (There are oalso other small matters of idiomatic usage for the French; I can't judge as well for the Spanish, but I imagine the same is true there. However, that's not the topic of this post.)

  3. Damien Hall said,

    April 12, 2010 @ 8:34 am

    On obtengan, Peter Taylor (above) and I were commenting at the same time. Fair enough; thanks! There's still the question, though: why not just a present indicative?

  4. a soulless automaton said,

    April 12, 2010 @ 10:10 am

    I also doubt Google Translate has any real concept of language structure. Google seems to have a philosophy of never having people do something when it can be automated with the aid of large amounts of computing power, and is known to have some people with a background in AI, Machine Learning, &c.–including their Director of Research being an author of well-regarded AI textbooks.

    My guess is that the core of Google Translate runs some sort of context-based, statistical inference AI techniques, and is largely oblivious to any structure in the language other than what it "learns". The result would be that how it translates a single word would depend a great deal on the surrounding context, not in terms of what we would recognize as phrase or sentence structure, but just what other words appear in the document in what relative placement. The same principles, in much more limited form, are at work in Bayesian spam filters.

    The end result, much like statistical text generators given a large corpus, tends to be large amounts of text in some sort of linguistic uncanny valley, interspersed with moments of frightening lucidity and hilarious mistakes.

    (This might be what myl meant about it not parsing input, but thought others might find an elaboration interesting.)

  5. David Cantor said,

    April 12, 2010 @ 10:22 am

    In my experience, Google Translate has the annoying habit, in translations from English into German, of simply omitting the verb altogether. Easy to notice the error, but what if the point of my query was to get a suggestion on verb inflection?

  6. Marion Crane said,

    April 12, 2010 @ 10:34 am

    I experienced the same lexical confusion as Damien, resulting in further confusion as to how three factions in one political system could all make gains. Only when I read the first sentence in the following quote was I able to parse the headline correctly.

  7. Damien Hall said,

    April 12, 2010 @ 10:57 am

    It's also just struck me now that, as the sentence is constructed in the headline, the verb should be in the singular, as it (awkwardly, to be sure) sets up 'Centre' and 'Far' as types of 'Right'. One more reason for the confusion that Marion and I experienced, thinking at first that 'Centre' and 'Far' referred to different parties.

  8. J. W. Brewer said,

    April 12, 2010 @ 12:27 pm

    What Damien Hall said, with further speculation that maybe "Right" in this context is a mass noun rather than count noun? Thinking of what the two parties in question represent as variations or different levels of intensity on a common theme is not necessarily helpful, but that's just a reflection of the limitations of the metaphor of a single left-right axis like the number line that all political positions can be situated on, the far-from-novel shortcomings of which metaphor are certainly not the specific fault of whoever writes either the headlines or the copy for this particular newspaper. But I do think a more helpful headline might be something like "Professedly Non-Socialist Parties, Both Savory and Un-, Make Gains."

  9. Peter Taylor said,

    April 12, 2010 @ 1:34 pm

    a soulless automaton wrote:

    Google seems to have a philosophy of never having people do something when it can be automated with the aid of large amounts of computing power

    They seem to apply this philosophy to translations of their own pages too. I've just been poking around some stuff in the "Google webmasters' tools" and was puzzled to see a note in the site performance section telling me: "Información: guardar un máximo de 5,92 KB". My translation: "Information: store no more than 5.92kB".

    Then I realised that it was a bad translation of "Information: save up to 5.92kB". Clearly never been checked by a competent Spanish-speaker, let alone a professional translator.

  10. Theo Vosse said,

    April 12, 2010 @ 1:37 pm

    If the dictionary has an entry for "far" as a noun, and permits it without any kind of determiner, then I don't really see the problem. I agree that coordination is problematic in parsing, but in this case even the simplest standard "Chomskyan" approach (any XP can also be of the form XP+, "and" | "or", XP) would yield a valid parse.

  11. Rubrick said,

    April 12, 2010 @ 3:30 pm

    The original headline could easily be fixed by replacing the commas with dashes. Of course, dashes take more space.

    Of course, how many people follow events in Hungaria anyway?

  12. Fernando Colina said,

    April 12, 2010 @ 3:32 pm

    I know that linguists are weird, but do you always translate NYT headlines into Spanish?
    In the spirit of your madness: the Galician translation seems more understandable: "Dereita húngara, Centro e Extremo, Make Gañou". At least it translates "right" in a directional/political sense instead of the legal sense. "Make" almost got by me since I read more English than any other language these days. The sentence also has some problems with gender concordance, but overall it is faithful to the original, and possibly less flawed: the concept that center and far are subsets of the right is clearer, at least to me.

    [(myl) Three cheers for Galician! This might be a case where less training data for statistical MT is actually better… Anyhow, I picked Spanish just because I figured most LL readers would be able to evaluate the results.]

  13. Mark F said,

    April 12, 2010 @ 4:33 pm

    That headline has a very NYT feel to it. I think I've seen something like "NP, both X and Y, VP" before. Or maybe it's just "NP, [discursive content], VP". Does anyone else find that headline particularly Timesish?

  14. Andrew (not the same one) said,

    April 12, 2010 @ 6:52 pm

    Marion Crane: It's perfectly possible for three factions to make gains if there are, say, nine factions altogether – which is not unknown in European politics.

  15. Fernando Colina said,

    April 12, 2010 @ 10:02 pm

    myl: I have noticed that about Google Translate's algorithm: More data sometimes produces worse results. What is more annoying, though, is that results may vary from one try to the next.

    By the way, I propose "to googlelate" as a portmanteau for "to Google translate." I will not google it, since if I do I'll find out that somebody else has thought of it before me.

  16. Alen Mathewson said,

    April 13, 2010 @ 8:03 am

    Perhaps it's not totally relevant, but it tickled me so may I draw your attention to a headline in The Times (of London) yesterday which stated; "Government ducks vote on South African power station". The Netherlands has a Partij voor de Dieren, but as far as I'm aware, even they don't suggest that animals should have a vote on any given issue of the day.

  17. bulbul said,

    April 13, 2010 @ 8:52 am

    a soulless automaton,

    My guess is that the core of Google Translate runs some sort of context-based, statistical inference AI techniques …
    Well, duh. Ever heard of statistical as opposed to rule-based machine translation? The latter (a by-product of the naive optimism of the generative linguistics age) has been largely abandoned in favor of the former.
    What strikes me here is the total obliviousness to the context. Surely if I encounter words like "party" and "elections" in company of "center-right", I will know immediately what's what.

    [(myl) There are two different sorts of "context" at play here — there's the phrasal collocation (word sequence) context, and then there's the bag-of-words (lexical histogram) context. Unlike document-retrieval algorithms, modern statistical MT systems are generally more sensitive to the local word-sequence context than to the larger-scale word-histogram context. And in this case, a crucial stretch of the word sequence ( "center and far make gains") is quite unlikely to have occurred in any training material.]

  18. bulbul said,

    April 13, 2010 @ 2:31 pm

    Mark,

    thanks. But how is "center and far make gains" the crucial phrase? It seems we're dealing with the old problem of what is actually a word.

    [(myl) It's "far" and the two words on each side — if we're focusing on why "far" might have been rendered as "la fecha".]

    So just for laughs, I tried to translate the passage into Slovak using both Google and Bing (full disclosure: I have done some work for the people who develop Bing Translator):

    Bing:

    Strana maďarského centrum-právo námietky získal prvé kolo parlamentných volieb tu v nedeľu, zatiaľ čo ďaleko právo strany, ktorých čierna-Plátované polovojenského extrémistov evokovať nacistickej éry, sa značné zisky.

    Big fail: "center-right" = "centrum-právo" which back-translates as "center (noun)" and "right (noun, as in 'a right to')", while "far-right" = "ďaleko-právo" which back-translateds to "far (adverb)" and again, "right (noun, as in 'a right to')". The translation totally omits "won" and "make significant gains" they got half-way right (words in bold), but the reflexive pronoun "sa" and the comma made a mess of it (I've seen this before, I guess I should report it…).

    Google:

    Maďarsko je stredo-pravá opozičná strana vyhrala prvé kolo parlamentných volieb-tú v nedeľu, zatiaľ čo ďaleko-pravá strana, ktorej black-odetý polovojenskej extrémistov evokujú nacistickej éry, významný nárast.

    "stredo-pravá (adj) opozičná (adj) strana (n)" is the perfect Slovak equivalent of "center-right opposition party", so points for Google. Even "ďaleko-pravá (adj) strana" is close enough, although normally one would speak of "krajne pravicová strana" in journospeak. "Won" they got right and "významný (adj) nárast (n)" is good enough for "significant gains", but the verb ("zaznamenala", most likely) is missing. Deduct points for interpreting "Hungary's" as "Hungary is" (Bing got that one halfway right by using the adjective, but in the wrong case).

    A few more observations on "center-right" and "far-right":
    Finnish – Google got both right (keskusta-oikeistolainen (adj.N) / äärioikeistolaisen (adj.GEN)), Bing struck out;
    Hungarian – Google got both right (jobbközép / szélsőjobboldali), Bing only got the first one right;
    Dutch – Google both (centrum-rechtse / extreem-rechtse) Bing only the second;
    Croatian – Google both, though the syntax is not right (they have "desnog centra stranka" which is technically ok, but weird and should be "stranka desnog centra", same for "krajnje desnice stranke"), Bing doesn't have Croatian;
    Czech is essentially the same story as Slovak;
    Maltese – half/half for Google, since for the first one, they got the words right, but either the syntax is not right ("parti l-Ungerija oppożizzjoni ċentru-lemin" should be something like "Il-partit taċ-ċentru-lemin (n) tal-oppożizzjoni") or they should have gone with the adjective "ċentru-lemini". "Parti ferm tal-lemin" sounds kosher, but I'm not sure it's the right way of putting it in Maltese;
    Swedish – both for Google with some definiteness/declination issues (borgerliga (adj) / högerextrema(adj)), Bing got far-right right (extremhögern (n, as far as I can tell));
    Polish – both for Google (centroprawicowej / skrajnie prawicowa), none for Bing;
    Russian – both for Google (правоцентристская / крайне правые), Bing only got the second one right (крайне правых). That's interesting, because Russian doesn't count as one of the languages with less training data.

  19. bulbul said,

    April 13, 2010 @ 7:36 pm

    Mark,
    It's "far" and the two words on each side — if we're focusing on why "far" might have been rendered as "la fecha".
    Oh right, sorry, I totally forgot about that.
    In any case, they appear to have fixed it. The Google translation into Spanish now reads "de centro-derecha de Hungría partido de la oposición ganó las elecciones parlamentarias …" Or did someone just propose a better translation?
    And FYI, for Spanish, Bing Translator comes on top:

    Partido de la oposición de centro-derecha de Hungría ganó la primera ronda elecciones parlamentarias aquí el domingo, mientras que un partido de extrema derecha, cuyos extremistas paramilitares vestidos de negro, evocan la era Nazi, hizo importantes avances.

  20. Marion Crane said,

    April 14, 2010 @ 2:28 pm

    Andrew (not the same one): That is true (I live in a country that could do with fewer political parties myself), but the headline made me think that the three parties mentioned were the only parties involved (I am not familiar with Hungarian politics). The weird construction just set me completely on the wrong foot, not just about its immediate meaning but the consequences as well. Once I figured out how it was to be interpreted, I lost my confusion about political gains as well.

  21. Marion Crane said,

    April 14, 2010 @ 2:36 pm

    Also, using Google translate to turn this headline into Dutch has the same problem as the English version, and adds further confusion:

    Hongaarse rechts, midden en Verre, Zorg Winsten

    I see bulbul already tried out 'centre-right' and 'far-right', but I'm more concerned about the phantom 'zorg' in there. Unless it's trying to go for 'zorgen voor winst' but fails hopelessly in rendering it correctly, not to mention that that is not what the headline means.

  22. Elek Mathe said,

    April 17, 2010 @ 3:33 am

    I'm Hungarian and I was able to parse the headline, but only because I'm familiar with the events described. So if the target audience of the New York Times is Hungarians who read English language newspapers, then they did a good job.

  23. Ray Dillinger said,

    May 14, 2010 @ 2:36 pm

    I'm a computer linguist. I've worked on a lot of projects at several different companies that do interesting things with language. I do not work at Google. It's hard for me to say things that are both true and useful about the current state of computational linguistics in limited space, but I'll try.

    Every time somebody proposes a new better translation in Google translate, Google gets a correspondence pair for those two languages. Using such correspondence pairs (in thousands or hundreds of thousands) they can train their learning statistical method, whatever it is, to try to produce instances of one from corresponding instances of the other.

    They do …. okay, I guess. It's startling how well they're doing translation considering their avowed policy of solely using statistical methods on massive datasets rather than the time and effort of human beings drawing paychecks to craft explicit translation rules. The so-called fourth paradigm is showing some fruit.

    A purely grammar based approach is fairly fragile. If you keep adding rules to cover special cases that you encounter in interactively written language, it will bloat to the point where it produces millions of possible interpretations of some inputs. If you produce a grammar that allows you to parse with less ambiguity (say, dozens rather than millions) it will reject up to half your inputs as unparseable.

    Virtually all systems which attempt parsing therefore use some statistical methods. Even strictly grammar-based parsers have to annotate each rule with a frequency or preferability or transition cost or whatever they call it, which they can then use to produce only the few most likely translations using the A* algorithm and then stop. But Google translate, according to what I've read, is nearly unique in relying exclusively on statistical methods. Every other serious translation effort I'm aware of, no matter how statistics-oriented, even if there's no formal grammar at all, at least has someone sit down and identify the structure words and annotate some corpuses with parts of speech.

    At Google, they come down very hard on the side of the line that believes even experts are often wrong about their subject matter and that statistical methods alone can find what's there rather than what someone expects to find. As linguists, I suppose you could consider them strictly observationist rather than prescriptionist. Which is roughly the same as counting them as linguists who actively reject all models of and theories about language beyond statistics.

    I don't know if it will keep getting better. I don't know if the problems they're encountering with large example sets are merely technical or if they represent fundamental limitations of the current approach. I do know that it is interesting to watch.

  24. Baltazar said,

    October 7, 2010 @ 4:07 am

    I "googled" the word "observationist" by entering a search for: defintion of observationist. I understand the implicit meaning of the word, but it is considered a slang term. Personally it makes no sense to me why the word has not been added to the enlglish lexicon.

RSS feed for comments on this post