I'm spending three days in Tampa at the kick-off meeting for DARPA's new BOLT program. Today was Language Sciences Day, and among many other events, there was a "Semantics Panel", in which a half a dozen luminaries discussed ways that the analysis of meaning might play a role again in machine translation. The "again" part comes up because, as Kevin Knight observed in starting the panel off, natural language processing and artificial intelligence went through a bitter divorce 20 years ago. ("And", Gene Charniak added, "I haven't spoken to myself since.")
The various panelists had somewhat different ideas about what to do, and the question period uncovered a substantially larger range of opinions represented in the audience. But it occurred to me that there's a simple and fairly superficial kind of semantic analysis that is not used in any of the MT systems that I'm familiar with, to their considerable detriment — despite the fact that algorithms with decent performance on this task have been around for many years.
The kind of analysis that I have in mind is determining pronoun co-reference. To illustrate the point, I took a look at one of the major human-interest stories of the day, the release of the coroner's report on the death of Amy Winehouse. I found and read several stories on this topic in French, Spanish, and Italian, and ran them through Google Translate.
In general, GT did a terrific job — the stories are generally understandable and in some cases even readable. But the translation of pronouns is lousy.
The French nominative pronoun elle is translated correctly as "she" in all the cases that I saw. But the possessive pronouns sa and son, which agree in gender with their head noun rather than with their referent, are almost always translated wrong.
It might not be surprising that this is true when the possessive pronouns happen to agree with masculine heads (though really, this shouldn't make any difference). Thus "Amy Winehouse est morte d'un abus d'alcool, selon les conclusions du coroner", Le nouvel Observateur 10/26/2011:
La chanteuse britannique Amy Winehouse est morte des suites d'un empoisonnement à l'alcool, selon les conclusions de l'officier de police judiciaire chargé d'enquêter sur les causes de son décès en juillet dernier.
British singer Amy Winehouse is dead following a heavy drinking, according to the findings of the judicial police officer charged with investigating the causes of his death last July.
L'interprète de "Rehab", qui connaissait des problèmes de dépendance à l'alcool et à la drogue depuis des années, a été retrouvée morte dans son lit, à son domicile londonien, le 23 juillet. Elle avait 27 ans.
The interpreter of "Rehab", who knew the problems of addiction to alcohol and drugs for years, was found dead in his bed at his home in London, July 23. She was 27 years old.
But it's also true when the possessive pronoun's agreement is feminine — presumably because news articles are more often about males than females. Thus "Amy Winehouse a succombé à un abus d'alcool", Le Figaro 10/26/2011:
L'enquête sur le décès en juillet de la chanteuse britannique attribue sa disparition à l'absorption de doses d'alcool élevées après des semaines d'abstinence. Elle présentait un taux d'alcoolémie plus de cinq fois supérieur à la limite légale.
The inquest into the death in July of the British singer attributed his death to the absorption of high doses of alcohol after weeks of abstinence. She had a blood alcohol level more than five times the legal limit.
Trois mois après sa disparition, les circonstances du décès d'Amy Winehouse sont enfin connue.
Three months after his death , the circumstances of the death of Amy Winehouse is finally known.
Elsewhere in the same story, the gender-neutral plural possessive ses is inappropriately translated as "its":
Pour mémoire, à 200 mg d'alcool, un individu commence à perdre ses reflexes, le seuil d'une consommation potentiellement fatale étant 350 mg.
For the record, 200 mg of alcohol, a person begins to lose its reflexes, the threshold of a potentially fatal consumption is 350 mg.
The situation in the translations of Spanish articles is similar, except that the overall translation seems to be somewhat worse, and su/sus is sometimes translated as "his", and sometimes as "their". Thus "Amy Winehouse murió por exceso de alcohol", Univision 10/26/2011:
Amy fue encontrada sin vida en su departamento de Londres al los 27 años y, enseguida, su partida se volvió noticia y conmocionó tanto a sus fans, como a sus colegas.
Amy was found dead in his apartment in London at the age of 27 and soon after, he left and became news so shocked his fans, and their colleagues.
And similarly in Italian — "L'errore di Amy Winehouse `E' stata morte accidentale'", La Repubblica 10/26/2011:
Nel suo appartamento sono state trovate tre bottiglie di vodka, due grandi e una piccola. Ma i test tossicologici hanno confermato l'assenza di sostanze illegale. I risultati dell'inchiesta danno dunque ragione alla famiglia della cantante, che sosteva avesse smesso di drogarsi e che ad ucciderla fosse stato l'alcol.
In his apartment were found three bottles of vodka, two large and one small. But the tests have confirmed the absence of illegal substances. The survey results, therefore, give reason to the family of the singer, who was standing had stopped using drugs to kill her and that he had been drinking.
Note that in the last example, the last English "he" corresponds to a phrase in which there is no visible pronoun at all, a circumstance that is would be even more common in translating Chinese.
I don't in any way mean to suggest that determining pronoun co-reference is a completely solved problem, or that reference resolution alone would be enough to determine the correct translation, or that it's obvious how to integrate such analysis into the architecture of today's typical phrase-based statistical MT engines. But it seems to me like a relatively accessible set of problems that more people should be working on.