Amy was found dead in his apartment

« previous post | next post »

I'm spending three days in Tampa at the kick-off meeting for  DARPA's new BOLT program. Today was Language Sciences Day, and among many other events, there was a "Semantics Panel", in which a half a dozen luminaries discussed ways that the analysis of meaning might play a role again in machine translation. The "again" part comes up because, as Kevin Knight observed in starting the panel off, natural language processing and artificial intelligence went through a bitter divorce 20 years ago. ("And", Gene Charniak added, "I haven't spoken to myself since.")

The various panelists had somewhat different ideas about what to do, and the question period uncovered a substantially larger range of opinions represented in the audience. But it occurred to me that there's a simple and fairly superficial kind of semantic analysis that is not used in any of the MT systems that I'm familiar with, to their considerable detriment — despite the fact that algorithms with decent performance on this task have been around for many years.

The kind of analysis that I have in mind is determining pronoun co-reference. To illustrate the point, I took a look at one of the major human-interest stories of the day, the release of the coroner's report on the death of Amy Winehouse.  I found and read several stories on this topic in French, Spanish, and Italian, and ran them through Google Translate.

In general, GT did a terrific job — the stories are generally understandable and in some cases even readable. But the translation of pronouns is lousy.

The French nominative pronoun elle is translated correctly as "she" in all the cases that I saw. But the possessive pronouns sa and son, which agree in gender with their head noun rather than with their referent, are almost always translated wrong.

It might not be surprising that this is true when the possessive pronouns happen to agree with masculine heads (though really, this shouldn't make any difference). Thus "Amy Winehouse est morte d'un abus d'alcool, selon les conclusions du coroner", Le nouvel Observateur 10/26/2011:

La chanteuse britannique Amy Winehouse est morte des suites d'un empoisonnement à l'alcool, selon les conclusions de l'officier de police judiciaire chargé d'enquêter sur les causes de son décès en juillet dernier.

British singer Amy Winehouse is dead following a heavy drinking, according to the findings of the judicial police officer charged with investigating the causes of his death last July.

L'interprète de "Rehab", qui connaissait des problèmes de dépendance à l'alcool et à la drogue depuis des années, a été retrouvée morte dans son lit, à son domicile londonien, le 23 juillet. Elle avait 27 ans.

The interpreter of "Rehab", who knew the problems of addiction to alcohol and drugs for years, was found dead in his bed at his home in London, July 23. She was 27 years old.

But it's also true when the possessive pronoun's agreement is feminine — presumably because news articles are more often about males than females. Thus "Amy Winehouse a succombé à un abus d'alcool", Le Figaro 10/26/2011:

L'enquête sur le décès en juillet de la chanteuse britannique attribue sa disparition à l'absorption de doses d'alcool élevées après des semaines d'abstinence. Elle présentait un taux d'alcoolémie plus de cinq fois supérieur à la limite légale.

The inquest into the death in July of the British singer attributed his death to the absorption of high doses of alcohol after weeks of abstinence. She had a blood alcohol level more than five times the legal limit.

Trois mois après sa disparition, les circonstances du décès d'Amy Winehouse sont enfin connue.

Three months after his death , the circumstances of the death of Amy Winehouse is finally known.

Elsewhere in the same story, the gender-neutral plural possessive ses is inappropriately translated as "its":

Pour mémoire, à 200 mg d'alcool, un individu commence à perdre ses reflexes, le seuil d'une consommation potentiellement fatale étant 350 mg.

For the record, 200 mg of alcohol, a person begins to lose its reflexes, the threshold of a potentially fatal consumption is 350 mg.

The situation in the translations of Spanish articles is similar, except that the overall translation seems to be somewhat worse, and su/sus is sometimes translated as "his", and sometimes as "their". Thus "Amy Winehouse murió por exceso de alcohol", Univision 10/26/2011:

Amy fue encontrada sin vida en su departamento de Londres al los 27 años y, enseguida, su partida se volvió noticia y conmocionó tanto a sus fans, como a sus colegas.

Amy was found dead in his apartment in London at the age of 27 and soon after, he left and became news so shocked his fans, and their colleagues.

And similarly in Italian —  "L'errore di Amy Winehouse `E' stata morte accidentale'", La Repubblica 10/26/2011:

Nel suo appartamento sono state trovate tre bottiglie di vodka, due grandi e una piccola. Ma i test tossicologici hanno confermato l'assenza di sostanze illegale. I risultati dell'inchiesta danno dunque ragione alla famiglia della cantante, che sosteva avesse smesso di drogarsi e che ad ucciderla fosse stato l'alcol.

In his apartment were found three bottles of vodka, two large and one small. But the tests have confirmed the absence of illegal substances. The survey results, therefore, give reason to the family of the singer, who was standing had stopped using drugs to kill her and that he had been drinking.

Note that in the last example, the last English "he" corresponds to a phrase in which there is no visible pronoun at all, a circumstance that is would be even more common in translating Chinese.

I don't in any way mean to suggest that determining pronoun co-reference is a completely solved problem, or that reference resolution alone would be enough to determine the correct translation, or that it's obvious how to integrate such analysis into the architecture of today's typical phrase-based statistical MT engines. But it seems to me like a relatively accessible set of problems that more people should be working on.



15 Comments

  1. Chris said,

    October 26, 2011 @ 6:50 pm

    I'm glad to see DARPA sponsoring this, though I'm generally disappointed with the measures of MT performance. Are they going with the BLEU/NIST metrics? I know Doug Jones at MIT has done work on measuring MT performance using the Interagency Language Roundtable skill level assessments, but imho, those are incoherent, vague, wishy-washy, redundant, um, and whole host of other things a set of measures should never ever be. How much discussion of assessment measures has there been so far?

    [(myl) I don't think that anyone could fairly accuse the MT community of inadequate discussion of metrics. See e.g. here, here, here, here, etc. ]

  2. John said,

    October 26, 2011 @ 7:53 pm

    And the translation of that last bit of Italian also seems to have had a mess of trouble (with the VS word order among other things):

    "[the family] that insisted that she had stopped using drugs and that it had been the alcohol to kill her."

    (Being a bit "literal" or archaic, as you like, with that "to kill her".)

    By my reckoning, it:

    – misses sosteva introducing a conjunctive clause with gapped "che"

    – misses that "e che" introduces a parallel clause and puts the following "ad ucciderla" with drogarsi (inexcusable really)
    – totally misconstrues "fosse stato l'alcol" which doesn't need a pronoun because the subject is there (l'alcool) as is the object (attached to the complementary infinitive (ucciderLA).

  3. marie-lucie said,

    October 26, 2011 @ 8:46 pm

    More about the translations:

    French:
    – La chanteuse britannique Amy Winehouse est morte des suites d'un empoisonnement à l'alcool, …

    – British singer Amy Winehouse is dead following a heavy drinking, …

    rather: … as a consequence of alcohol poisoning

    Spanish : The last part of the Spanish translation is gibberish:

    – Amy fue encontrada sin vida en su departamento de Londres al los 27 años y, enseguida, su partida se volvió noticia y conmocionó tanto a sus fans, como a sus colegas.

    – Amy was found dead in his apartment in London at the age of 27 and soon after, he left and became news so shocked his fans, and their colleagues.

    rather: … right away, her departure (= her death) became news and shocked her fans as well as her colleagues.

  4. Chris Callison-Burch said,

    October 26, 2011 @ 10:19 pm

    More MT metrics evaluations here, here, here, here and , here. We are nothing if not thorough.

  5. corey said,

    October 27, 2011 @ 8:41 am

    re: "… according to the findings of the judicial police officer charged with investigating the causes of his death last July."

    It seems this poor dead officer is overworked… having to investigate Amy Winehouse's death in addition to his own death…. :P

  6. Boris said,

    October 27, 2011 @ 9:08 am

    Maybe slightly off topic, I was amused that my mother's new GPS device, when set to Russian, seems to think that the word for thousand in Russian is masculine instead of feminine. It's funny because everything else is rendered perfectly in Russian (I learned a few highway related terms I didn't know when I left Russia at the age of 11). Surely that must be a much simpler task than translation of pronouns.

  7. bianca steele said,

    October 27, 2011 @ 9:28 am

    The Baldwin paper you linked seems to assume the number of possible antecedents has already been reduced by a kind of structural analysis (i.e., non-stochastically, determining the grammar of the text). Wouldn't integrating that technique into statistically based methods result in an enormous number of nouns to choose from?

  8. Ross Presser said,

    October 27, 2011 @ 10:53 am

    There are times when the pronoun's referent isn't even in the sentence:

    "Bill came home from New York last night. Alice was found dead in his apartment."

    where the apartment belongs to Bill. How can machine translation ever cope with this, unless it uses larger logical blocks than sentences?

  9. Dan T. said,

    October 27, 2011 @ 11:02 am

    To determine the proper pronoun, the machine would have to not only analyze the complete writing (across multiple sentences), but also be able to know or determine the sex of the referent, which can itself be tricky. "Amy" or "Alice" are usually female, but what about Alice Cooper? "Pat" or "Alex" might be either sex. "Kim" or "Kelly" are usually female in current-day America, but were common male names in other times and places.

  10. Q. Pheevr said,

    October 27, 2011 @ 11:20 am

    @Dan T. – That seems to me like the sort of thing where it would be really useful to combine reference resolution with the kind of statistical information Google Translate and its ilk already use. Rather than guessing that someone named Amy is likely to be female, the algorithm could pick up on the fact that references to Amy Winehouse in English text tend to go along with instances of she, her, hers, and herself more than they tend to go along with instances of he, him, his, and himself.

  11. The Ridger said,

    October 27, 2011 @ 5:30 pm

    GT had a real fit with this line from Esenin in which he's illustrating just how long ago a memory of his grandmother and a kitten was:

    Все прошло. Потерял я бабку,
    А еще через несколько лет
    Из кота того сделали шапку,
    А ее износил наш дед.

    (It's all gone. I lost nana, / And a few years later / That cat was made into a hat / And our grandfather wore it out.)

    The salient bit is the very last line, "A yeyo iznosil nash ded" which is literally "and it(obj) wore-out our grandfather(subj)". In this case "it" is feminine, because it refers back to the hat (shapka). Also of interest is the emotive ordering of "kota togo" which is "cat that" (in genitive) rather than the neutral "togo kota". GT gave this:

    A few years later
    Of the cat that made ​​a hat,
    And our worn out her grandfather.

    It's striking how much the little things trip up the machines.

  12. David said,

    October 27, 2011 @ 6:11 pm

    I ran the story in GT for Swedish and it did fine with the personal pronouns. I don't use GT for Swedish a lot, but I was very impressed with some of the things it came up with. The long sentence "No cause of death was determined by an initial autopsy, and though she had struggled with addictions to drugs and alcohol throughout her life, her family said over the summer that a toxicology report found no illegal substances in her system." was translated as "Ingen dödsorsak fastställdes genom en första obduktion, och om hon hade kämpat med missbruk av droger och alkohol under hela hennes liv, sa hennes familj under sommaren att en toxikologisk rapport funnit några olagliga substanser i hennes system."

    GT manages the s-passive ("fastställde-s" = "was determined", rather than a word-for-word "blev fastställd" or an erroneous "var fastställd"), V2 word order when fronting a subordinate clause ("sa hennes familj" = lit. "said her family") and rendering the English past "found" as a bare supine in a subordinate clause ("sa… att en… rapport funnit" = "said… that a… report found", not "sa att en rapport fann"). However, GT goes on to render "no illegal substances" as "några olagliga substanser" ("SOME illegal substances"): somewhere, a negation is missing. The translation wasn't free of errors, but they seemed more often to be lexical than grammatical.

    Anyway, GT did fine with the personal pronouns. So to test it I typed in the sentence "Amy was found dead in her apartment" which was correctly rendered as "Amy hittades död i SIN lägenhet" = the reflexive possessive. (Not "hennes", "her", since that implies that the apartment belonged to some woman other than Amy.) However, when you change "her" to "his", the translation stays the same. Even the sentence "She was found dead in his apartment" is rendered as "Hon hittades död i sin lägenhet". Clearly, the treatment of possessive pronouns is vexing GT. (Though at least its "sin", the common-gender singular, which agrees with "lägenhet", and not neuter "sitt" or plural "sina".)

  13. marie-lucie said,

    October 27, 2011 @ 10:19 pm

    French:

    …selon les conclusions de l'officier de police judiciaire chargé d'enquêter sur les causes de son décès …

    … according to the findings of the judicial police officer charged with investigating the causes of his death…

    This officer has to be the coroner, a position which does not have an equivalent in many other justice systems. The French article does use coroner in the title, and defines it in the body of the text, where the circumlocution is not recognized by the translating software.

  14. YankeeTranslator said,

    October 28, 2011 @ 11:27 am

    Well, given that in French-Canadian culture (and I'm assuming French culture as well), Amy is a unisex name, so perhaps that affected the ability of the MT to determine gender?

  15. marie-lucie said,

    October 28, 2011 @ 7:08 pm

    YT: in French-Canadian culture (and I'm assuming French culture as well), Amy is a unisex name

    ??????

    Perhaps you are confusing Amy with Aimé (masc) and Aimée (fem), which are pronounced the same?

RSS feed for comments on this post