Using the wisdom of crowds to translate language

« previous post | next post »

Today's Morning Edition on National Public Radio had a piece by Joel Rose on linguists' contributions to efforts to translate the Haitian language Kreyol, using the knowledge of Haitians dispersed around the world: transcript here, with a link to the audio version. This is an update on work reported on by Phil Resnik here on Language Log back in January, and in fact Phil is one of the three linguists quoted in the piece; the other two are Rob Munro (at Stanford) and Judy Klavans (at Maryland).

A couple quibbles. First, Rose refers to Kreyol as "the local Creole dialect", as if "Creole" were a language (with Kreyol as one of its dialects), while "creole" is just a name for a type of language (in particular, a language with a specific sort of history). Kreyol is often referred to as Haitian Creole (compare Hawaiian Creole), sometimes as Haitian Creole French (with a bow to its history), and sometimes (very misleadingly) as a dialect of French. Creole languages have mostly picked up their own names, unmoored from the names of their basis languages (Tok Pisin, Bislama, Gullah), though sometimes these names are derived from European words for 'creole (language)': Krio (Sierra Leone, English-based), Kreyol (Liberia, English- and French-based), and Kreyol in Haiti.

Then, Rose refers to "relatively obscure languages, such as Urdu, Pashto and Farsi", a description that will come as a surprise to the many millions of speakers of these politically important national languages.


  1. Zwicky Arnold said,

    June 22, 2010 @ 11:24 pm

    Ack, I didn't mean to leave comments set to off. Now fixed. And here's one from Jens Fiederer:

    OK, I had no trouble recognizing the names, and was even fairly successful at associating Urdu and Pashto with Pakistan and Afghanistan (without 100% certainty – Farsi I hadn't the slightest doubt about). But I asked my very bright son (home from college for the summer), and the best he could come up with was "I think they might be languages".

    But while I'm no linguist, I've had some courses and I am very well read. I'm pretty sure if you asked 50 randomly sampled Americans, you'd be lucky to get ONE who recognized them all.

  2. Dan M. said,

    June 22, 2010 @ 11:51 pm

    I'd have thought Pashto was the most obscure of those a few years ago, but now it's been in the news, what with us making a business getting speakers of it killed.

  3. Stuart said,

    June 23, 2010 @ 12:35 am

    I'm biased since Urdu was the language my father learned at school, and his childhood home was Quetta, so Pashto and Farsi were also familar languages, but surely after several years fighting in that part of the world already this century these languages and their 150million or more speakers can't be THAT obscure, even in the US?

  4. Mark P said,

    June 23, 2010 @ 8:42 am

    I heard the NPR piece and was surprised at the reference to those "relatively obscure" languages. Reporters are always looking for some kind of simple label, but given the current US involvement in the area, surely he could have found a better hook.

  5. Mr Punch said,

    June 23, 2010 @ 10:12 am

    Urdu is not, of course, obscure; but like Kreyol, its existence is sometimes denied.

  6. J. W. Brewer said,

    June 23, 2010 @ 11:26 am

    Whether a given language is obscure (or "relatively obscure," which is not so strong a claim) depends on whose point of view is being considered. Here, presumably the relevant POV would be that of the median NPR listener, who no doubt fancies himself more sophisticated and cosmopolitan than his median fellow citizen. (Whether the sort of enthusiastic people who read LL are aware of the languages is irrelevant to their relative obscurity in the NPR context; it's possible that 50%+ of us are aware of Dyirbal.) So you could test the accuracy of the claim by asking a bunch of median NPR listeners to start naming as many foreign languages as they could, in whatever order they happened to come to mind, w/o googling or consulting any written references. How many would name one or more of these three within the first 20 or 30? How many would name one or more before they just ran out of steam? I don't know the answers, but I wouldn't be surprised if they were consistent with the claim of relative obscurity to the speaker's intended audience. "Persian" (as a name for the majority language in Iran) might also prove less obscure than "Farsi," since it's possible that lots of non-specialists out there have never had occasion to become aware of the Burma->Myanmar switch among the learned with respect to the English word for that tongue.

  7. Chris Callison-Burch said,

    June 23, 2010 @ 11:26 am

    In the context of the workshop, relatively obscure languages referred to low resource languages for which we (MT researchers) don't have much training data. Although statistical machine translation didn't make radio piece, one of the applications of crowdsourced translations that we talked about at the workshop was the creation of bilingual parallel corpora.

    My interest in crowdsourced translation is for creating data for low resource languages. So far I have experimented with using Mechanical Turk to try to solicit translations for Armenian, Azerbaijani, Basque, Bengali, Cebuano, Gujarati, Kurdish, Nepali, Pashto, Sindhi, Tamil, Telugu, Urdu, Uzbek, and Yoruba. My initial experiments were simply to determine whether there were any speakers of those languages on MTurk by asking them to translate a list of words, and asking them to self report on what languages they speak.

    Here is a (probably badly formatted) table of results that shows the number of speakers that I had for each language (who self-reported that they spoke the language and who translated my control words correctly), the total number of words that they translated, and their location (determined by Geolocation on their IP address):

    Language # speakers # words translated Where from
    Armenian 3 80 UnitedStates (0.67), UnitedKingdom (0.33)
    Azerbaijani 11 1610 India (0.45), Turkey (0.18), UnitedStates (0.18), Undetermined (0.18)
    Basque 5 1740 UnitedStates (0.40), Slovakia (0.20), India (0.20), Undetermined (0.20)
    Bengali 35 4910 India (0.51), Bangladesh (0.29), Undetermined (0.09), UnitedStates (0.06)
    Cebuano 19 4620 Philippines (0.47), Canada (0.11), UnitedStates (0.11), Netherlands (0.11), UnitedKingdom (0.11), Korea,South (0.05), Spain (0.05)
    Gujarati 58 6160 India (0.81), Undetermined (0.09), UnitedStates (0.07)
    Kurdish 0 0
    Nepali 12 780 India (0.50), UnitedStates (0.25), Nepal (0.17), Undetermined (0.08)
    Pashto 4 130 India (1.00)
    Sindhi 4 110 India (0.75), Pakistan (0.25)
    Tamil 229 56300 India (0.88), Undetermined (0.05)
    Telugu 130 30930 India (0.75), UnitedStates (0.12), Undetermined (0.12)
    Urdu 32 3630 India (0.47), Pakistan (0.34), Undetermined (0.06), UnitedStates (0.06)
    Uzbek 0 0
    Yoruba 1 20 Nigeria (1.00)

  8. Michou said,

    June 23, 2010 @ 7:21 pm

    Computer translation using freely available data must cost much less than funding the graduate students and native linguists who want to document Haitian, for example by compiling dictionaries and writing grammatical descriptions.

    [(myl) Translation systems are not built out of "grammatical descriptions" and the like, anyhow; but in the case of Haitian, what "freely available data" do you have in mind? The total amount of parallel text involving Kreyol is small, as far as I know, and there's not even a great deal of monolingual text. This is a challenge for many languages, including quite a few with millions or tens of millions of speakers.

    The only way to change this is to enlist the communities of speakers — see here for one idea of how to start the process even in small and endangered languages without any current literacy — but it's not a problem with an obvious or easy answer.]

  9. Jason F. Siegel said,

    June 23, 2010 @ 8:48 pm

    At the Indiana University Creole Institute, we have a corpus coming out relatively soon of the Creole dialect spoken in Northern Haiti, around Cape Haitian. It won't be completely free of charge, however.

    As far as monolingual texts go, there are a great deal of them, but they have not yet been digitized. I imagine that will change with the great work being done by the Digital Library of the Caribbean.

    And speaking as a graduate student whose support has come largely for studying Haitian Creole (typing this message from Cayenne, home to many Creole speakers of Haitian and French Guianese varieties, among others), there is support out there, but we need more.

  10. Stephen Jones said,

    June 24, 2010 @ 12:53 am

    Urdu is not, of course, obscure; but like Kreyol, its existence is sometimes denied.

    The claim is that both Urdu and Hindi are what used to be called Hindustani, and that the differences between the two are basically politically motivated.

  11. Zwicky Arnold said,

    June 24, 2010 @ 9:44 am

    Embroidering on Stephen Jones's comment: The name Hindustani has an old-fashioned, whiff-of-the-Raj quality to it, and many object to it on the grounds that it's built on Hindustan (land of the Hindus) as the name of (roughly) South Asia, assigning priority to the Hindus and the language name Hindi.

    For some time, the widely accepted name of the language was the copulative compound Hindi-Urdu, roughly parallel to Serbo-Croatian. This at least gives equal treatment to two ranges of varieties within a larger linguistic domain.

    In any case, the objection (for some people) to Urdu is not just that the differences between Urdu and Hindi are "politically motivated" — meaning that the two ranges of varieties belong linguistically to a single language, with the varieties distinguished as separate languages only for political purposes — but that (in the eyes of these people) this language is Hindi. That is, these people claim that Urdu is "just a dialect of Hindi" and so doesn't exist as a separate language. This view is quite parallel to the view of Kreyol as "just a dialect of French", with no existence as a separate language.

  12. J. W. Brewer said,

    June 24, 2010 @ 1:47 pm

    One noteworthy fact about the three "relatively obscure" languages listed is that their English names are opaque: i.e., they do not signal to an Anglophone the (original) ethnic group or geographical location with which the language is associated (Farsi->Persia(n)->Iran(ian) and Pashto->Pathan->Afghan/Pakistani borderlands both being, I am assuming, opaque to most Anglophones). Even if they never had previously stopped to consider its existence, most Anglophones could infer that "Armenian" is probably a language spoken in Armenia and/or by people of Armenian ethnicity wherever located. But Urdu as language-spoken-by-some-Pakistanis is something you need to learn as arbitrarily as the-capital-of-Montana-is-Helena. And if you aren't a language buff, why would you learn it? I would hypothesize that even transparent names that refer to subnational groups/locations might have an advantage (e.g. Bengali and Javanese could be less baffling-sounding than Telugu or Cebuano). Possible counterexample might be Swahili as the possibly single-best-known language indigenuous to Sub-Saharan Africa among Americans, which I would attribute to the rather puzzling/comical way in which it became a vogue language for multi-culturalist purposes among more progressive-minded Americans (including schoolteachers) from the 1960's forward. And of course it's not like Africa has many languages with non-opaque names (there's no "Nigerian" or "Senegalese," for example, and you have to know Bantu morphological patterns for "Kinyarwanda" to be transparent).

  13. John Cowan said,

    June 25, 2010 @ 5:29 pm

    Well, J.W., we could fix that problem by resuming the use of the name Persian for that language, as has historically been the case, and as the Academy of Persian Language and Literature recommends.

  14. Army1987 said,

    June 25, 2010 @ 8:10 pm

    Yeah, what the hell is wrong with calling it Persian?

  15. Stephen Jones said,

    June 25, 2010 @ 9:51 pm

    Adding to what Arnold has said, I enclose this quote from Wikipedia.
    Standard Hindi and Urdu are nearly identical in grammar and share a basic common vocabulary but differ in literary conventions and specialised vocabulary with Urdu retaining strong Persian, Arabic and Turkic influences, and Hindi relying heavily on Sanskrit. Before the Partition of British India, the terms Hindustani, Urdu and Hindi were synonymous; all covered what would be called Urdu and Hindi today

    The word 'HIndustani' is redolent of the Raj because it is mainly post-partition that attempts have been made to further distance the two dialects. Hindi-Urdu is a reasonable compromise as far as the name goes. Both Hindi and Urdu are considered separate national languages in India, though as they have different scripts that is probably a bureaucratic necessity and doesn't affect the mutual intelligibility of most variants.

    Another example where the dialect/language debate is still open, despite official political pronouncements, is Scots and English.

  16. Stephen Jones said,

    June 25, 2010 @ 9:53 pm

    And I don't think anybody is claiming French and Haitian Creole are mutually intelligible, Arnold.

  17. Max Pinton said,

    June 26, 2010 @ 3:59 pm

    I'd always been under the impression that calling Persian "Farsi" in English was sort of like calling French "français."

  18. Stephen Jones said,

    June 26, 2010 @ 4:32 pm

    The trouble is Persian is not just spoken in Persia (it's the second language of Afghanistan) and a large number of Iranians don't speak Persian.

  19. John Cowan said,

    June 26, 2010 @ 10:58 pm

    Stephen Jones: And English is not just spoken in England; indeed, only a small minority of even native speakers live in England. I'm sure there are a fair number of people in England who don't speak English either.

    Names are but names. Persian has been the name used for the language in English since 1556, and the OED's most recent quotation is from 2002, so it cannot be called obsolete.

  20. Stephen Jones said,

    June 26, 2010 @ 11:19 pm

    Point taken, John, but what about political sensitivities. In Afghanistan they insist the language be called Dari; surely calling it Persian is only going to exacerbate things.

  21. SSK said,

    June 29, 2010 @ 2:59 pm

    @Stephen Jones:
    Wikipedia says:

    In Afghanistan Dari Persian ("Fārsi e Dari") is also simply called Persian ("Fārsi"). It is not to be confused with Dari or Gabri of Iran, a language of the Central Iranian sub-group, spoken in some Zoroastrian communities.[9][10]

  22. Stephen Jones said,

    June 29, 2010 @ 11:55 pm

    Here's Article 16 of the Afghan Constitution:
    Article 16 of the constitution states that "from amongst Pashto, Dari, Uzbeki, Turkmani, Baluchi, Pashai, Nuristani and other current languages in the country, Pashto and Dari shall be the official languages of the state." In addition, other languages are considered "the third official language" in areas where they are spoken by a majority

  23. Azad Pashtun said,

    June 11, 2013 @ 6:16 am

    I am going to make a case against the use of the word 'Pathan' for 'Pashtun' here. Term 'Pathan' is now outdated at least in formal contexts and we are hoping people will abandon it as it is used so often in derogatory ways. However, it is not as offensive as the N word for African-Americans.
    Gladly the world is adopting the original Pashto words 'Pashtun' and 'Pakhtun' and realizing that we do not receive the word 'Pathan' that kindly, which comes from India/Pakistani Punjab and was later adopted by the British. Doesn't feel so great when the most famous British book on Pashtun history is called 'The Pathans'.
    Although very common, 'Pathan' has failed to become part of Pashto. No Pashtun will refer to him/herself as 'Pathan' especially if they are speaking Pashto. The pronunciation of 'Pathan' contains a 'ha' sound in the middle. Ironically, Pashto speakers find it hard to pronounce 'ta' and 'ha' combined. We often drop the 'ha' from the middle of words. Even 'Muhammad' is pronounced as 'Mamad' by many speakers. Abdul Samad Khan Achakzai, political leader/linguist, recommended we formally adopt 'Mamad'. ممد instead of Muhammad. So we can't even properly say 'pa-tha-an' like an Urdu/Hindi/Punjabi speaker can. The original words 'Pashtun' and 'Pakhtun' have 'sh' ش or 'kh' خ sounds only.

RSS feed for comments on this post