Language Log

Language-related efforts to help out in Haiti

January 22, 2010 @ 6:30 pm · Filed by Chris Potts under Computational linguistics, Linguistics in the news

Posting on behalf of Phil Resnik:

This post brings together a bunch of news about language-related efforts to help out in Haiti:

First, Jeff Allen, an expert in language technology for Haitian creole and for language resources more generally, has arranged with CMU's Language Technology Institute to release, with unusually friendly terms of use, relevant language resources they collected over a number of years. This includes a set of speech data, and, probably more immediately relevant, parallel data (translations of medical phrases and sentences in the medical domain) in English and Haitian creole.

Second, a group called Crisis Commons is organizing "CrisisCamp" events in a variety of cities, where volunteers get involved in "activities such as crisis mapping, data and RSS feed aggregation. In addition, people with specialized skills such as translation, computer programing and literacy advocates are encouraged to participate." Of particular interest to Language Log readers, they've got a Language and Translation team involved in a variety of efforts, one of which is trying to get machine translation capabilities up and running. Results so far are pretty rudimentary, e.g. the demo:

Source: Tanpri nou bezwen manje dlo ak tant pou nou ka kouche tanpri voye je nou gade kafou soutou nan z?n b?ten ak titus tanpri nou bezwen manje paske kay nou

Target: please our need meal river and aunt in our quart lay please broadcast eye our regard intersection unknown in unknown unknown and unknown please our need meal because dwelling our

Human translation: Please we need food and water and tents so that we can sleep. Please go look at Carrefour particularly in the area of Titus and Betem (or Boten) . We need food because – Incomplete

Information collected by Crisis Commons on language and translation can be found here. Christopher Taylor has been working on collecting up parallel text, aiming to get a statistical MT system up and running quickly, which might, one hopes, do better than just a dictionary-based approach. In either case, of course, nonstandard orthography, spelling errors, etc., are going to be a challenge. Still, maybe automatic translation can help in some ways.

Where did the human translation above come from? The specific answer is here. The general answer is that Rob Munro of Stanford has been coordinating volunteer translators as part of an impressive broader effort involving the use of text messaging for crisis-related communication. In this Wired article, he is quoted as saying, "The total number of texts is in the thousands, and they arrive every five seconds in busy times, to every 10 minutes overnight". Munro says the average turnaround time for translation is around 10 minutes, which is really striking. In fact, it has me wondering about use cases that would make automatic MT's increase in scalability/speed worthwhile given the presumed decrease in quality. (I believe we need to spend more time on ways to combine automatic methods with human effort in order to achieve both higher quality and better scalability/lower cost, but that's a topic for another day.) For more details on Munro's efforts.

Third, the "Tweak the Tweet" effort at University of Colorado, conceived of by graduate student Kate Starbird, involves getting people to annotate the natural language in tweets using a small vocabulary of emergency-related hashtags, in order to make them more amenable to automatic processing. One of their examples:

TWEET-BEFORE: Altagrace Pierre needs help at Delmas 14 House no. 14.

TWEET-AFTER: #haiti #name Altagrace Pierre #need help #loc Delmas 14 House no. 14.

Adding structured information to unstructured tweets, particularly in tandem with geolocation, could enable a whole variety of useful applications. (Typing "Delmas 14 House no. 14, port au prince, haiti" into this function gets you the latitude and longitude for the address. Call me old fashioned, but I find this pretty astonishing, assuming of course that the info it returned is correct.) Colorado's Prof. Leysia Palen, quoted in Discovery News, concedes that it is not realistic to expect volunteers and organizations originating tweets to include these annotations. The Tweak the Tweet site has an alternative suggestion, though, which is that volunteers translate tweets into the hashtag syntax (see "How to Help"). Perhaps someone ought to contribute some money to get Mechanical Turkers doing this?

These are just a few things I've become aware of during the last day or so that might interest language folks. The level of energy, innovation, and willingness to help is a great thing to see, and I'm sure there's a lot more out there…

January 22, 2010 @ 6:30 pm · Filed by Chris Potts under Computational linguistics, Linguistics in the news

Permalink

6 Comments

Shannon said,

January 24, 2010 @ 3:12 pm

You might also be interested in knowing that one of the team members at CU Boulder helping with Tweak the Tweet, Sarah Vieweg, has an MA in Linguistics. http://www.cs.colorado.edu/~palen/connectivIT/people_2/photo.html
William Lewis said,

January 24, 2010 @ 10:07 pm

MSR just released a translation engine for Haitian Creole very early this morning (http://microsofttranslator.com). Our focus is primarily on translating Creole into and out of English, but translations can be made to and from other languages as well. The engines were trained on much of the data at CMU (thank you, Jeff!), but other sources were used too. Much of the content at http://haiti.ushahidi.com/, which I assume is the source of the Source sentence above, is problematic for our engine due to noise (e.g., lack of punctuation, misspellings, encoding problems such as accented characters showing up as "?", etc.), but we're working on improvements.
Philip Resnik said,

January 25, 2010 @ 10:00 am

Neat to see MSR's engine up and running. Trying the example I gave above (which is particularly challenging, mind you), the MSR translator's current output is "Do we need food, water and the tent of meeting, so that we may rest your eyes be o intersection soutou in z? b? ten and titus do we need food for your". Using the Bible as one readily available source of translated training data is obviously creating some artifacts at this point ("tent of meeting"); but at least that's a step closer to the intended meaning ("tent", rather than "aunt"). I expect that this will improve as more training data become available.

The MSR translation team's blog says an API will be available. I'd like to encourage the group to make the API as rich as possible in order for others to build on the work, via MT system combination, mash-ups, etc. For example, if an output lattice were available, it would be possible to visualize imperfect translator output using something like Philipp Koehn's CAITRA tool (a descendant of Chris Callison-Burch's Linear B concept), which might make it easier for people to make sense of rough and imperfect translations.
William Lewis said,

January 30, 2010 @ 4:04 pm

Thanks for your helpful comments. We have been working on improvements to our API. Most changes have been focused on what would be most relevant to users who wish to add translation to their websites and products, but we have also been planning changes that would benefit researchers specifically. I'll let you know when we have some updates.

BTW, we have released a couple of updated releases to our translator since this post. The example sentence above now translates as: "Please we need food, water and tents connection please note carrfour soutou in experiencing? b? thyme and titus please we need food for our House". Still not perfect, but better (with the exception of one regression). Yes, more training data and other fixes have helped.
Philip Resnik said,

January 30, 2010 @ 5:26 pm

Thanks, William. For people's information, it looks like Google Translate's Haitian Creole alpha is out. Not to place too much emphasis on one sentence, but since we've been talking about it, here's the Google output on that sentence:

Please we need water and food tent so we can lay our eyes to please look at the intersection soutou z? Nb? Ten Titus and please eat because we need our homes

And the CrisisCommons.org efforts are continuing — more on that soon.
Kevin Brubeck Unhammer said,

January 31, 2010 @ 6:28 am

The Apertium ht-en translator gives this:

$ echo 'Tanpri nou bezwen manje dlo ak tant pou nou ka kouche tanpri voye je nou gade kafou soutou nan z?n b?ten ak titus tanpri nou bezwen manje paske kay nou' | apertium -d . ht-en
Please We need ate water with tent for us can sleep please sent our eye looked *kafou above all the *z?We *b?*ten And *titus please we need food because our house

(Unknown words are marked with a star here.)

Apertium is rule-based (shallow-transfer) MT system, and all code and data is completely free and open source (I believe the licence of the CMU data was incompatible with the GPL, although I hope someone can prove me wrong).

RSS feed for comments on this post

Language-related efforts to help out in Haiti

6 Comments

Shannon said,

William Lewis said,

Philip Resnik said,

William Lewis said,

Philip Resnik said,

Kevin Brubeck Unhammer said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta