« previous post | next post »

Looking at Geoff's post on machine-translated phishing scam messages, the message certainly does come across as very similar to the English output we in the biz frequently see coming out of statistical machine translation of Chinese. This includes Chinese-specific issues like recovering correct determiners from a language that does not express them overtly (I hope that the [not this] letter meets you in good spirits), as well as the ubiquitous phenomenon of sentences that are locally coherent — thanks to phrase-level translations and good statistical language-models for English — but globally nonsensical. I don't claim to know what makes a text poetic, but it seems to me that this combination of local coherence and larger-scale disconnectedness must be at least partly responsible for what Geoff describes as the "strange poetry" of machine translationese.

For some very recent work on human translationese, see Moshe Koppel and Noam Ordan's nice discussion of Translationese and its Dialects at the recent Association for Computational Linguistics Conference in Portland. They used text categorization methods to tease apart interference from the specific source language from more general effects of the translation process itself. Koppel is applying text categorization ideas in a lot of interesting areas: he's also the also the lead author of the paper Mark wrote about the other day in Biblical scholarship at the ACL.


  1. Wm Annis said,

    July 8, 2011 @ 10:49 am

    And then there's what I call "Old High Translationese," for that tortured English Classicists produce when turning Homer, say, into English.

    Some philologists get so used to this sort of language that it becomes their native tongue. Tolkien readily comes to mind.

  2. Emily said,

    July 8, 2011 @ 1:02 pm

    Is Old High Translationese what Dwight Macdonald called "that fantastic nineteenthcentury translator's prose"? ("Yon man . . . Ay me! And once again, Ay mel" "Why weepest thou? Thus stands the matter, be well assured." "In fear of what woe fore shown?")

  3. Emily said,

    July 8, 2011 @ 1:03 pm

    Sorry, broken link. It should have been:

  4. David J. Littleboy said,

    July 9, 2011 @ 3:14 am

    In Japanese history class at Yale, they made us read Max Weber (for the historiography) in translation. It was horrifically heavy Germanic plodding. Yuck. Later I found a better translation of one of his books, and it turns out he's a seriously sharp bloke very much worth reading. But, yes. Machine translation from Japanese to English also generates some wonderfully poetic language. As I've probably mentioned before, both Chinese and Japanese do not indicate the singular/plural distinction (unless context requires it, which it turns out it seldom does), so unless you actually do the work to figure out what's being talked about, the probability of getting any given sentence right is (1/2^n), where n is the number of noun groups in the sentence at hand. And that's without worrying about the indefinate/definate distinction you mentioned.

  5. maidhc said,

    July 9, 2011 @ 4:06 am

    I have a few comments on the previous thread.

    Aren't people from Hong Kong more likely to have better English than people elsewhere in China, because of being a British colony until a few years ago? I know several people from Hong Kong and they all have excellent English (admittedly this is a small sample).

    "Nigerians speak good English". I suppose the argument is that Nigeria was a British colony, but then so was HK. Nigerian Spam English is still readily distinguishable from ordinary idiomatic English. But it is comprehensible.

    However, this comprehensibility is the product of a long evolution. I received a letter from Nigeria in the post some time around 1996. It was obviously some sort of scam, but the English was so bad we couldn't figure out what it was he was trying to get me to do.

    The entry cost was much higher then, since he would have had to print out the letter, put it in an envelope and buy a stamp. But from such small beginnings a mighty industry has grown.

    I'm not arguing with the analysis of the particular spam in the previous thread, which sounds pretty convincing. I did want to comment on these couple of off-hand remarks, though.

  6. maidhc said,

    July 9, 2011 @ 4:20 am

    Maybe I should add that I have met a number of Nigerians who speak perfectly good English (with a bit of an African lilt, perhaps). But these are people who are well educated and are respectably employed. The spammers are a totally different class of people.

  7. The Ridger said,

    July 9, 2011 @ 7:31 am

    Re the spammers: Never discount the utility of sounding undereducated, even if you're supposed to be a bank manager. It assists you in projecting that image of a guy who's not only crooked but dumb (as well as foreign) which will be of great use to you in selecting marks who think they're going to be smarter than you and walk away with all your money instead of the mere 30% you're offering.

    It will also help when the scam is the poor Christian woman with the dead husband and no kids trying to keep her money out of the hands of her Muslim in-laws, too, in proving how likely it is she doesn't have a lawyer and needs your help.

  8. Ellen K. said,

    July 9, 2011 @ 7:43 am

    @The Ridger

    But we aren't talking about an email that makes the sender come across as under educated. We're talking about email that is incomprehensible and makes errors that no actual human would make.

  9. Emily said,

    July 9, 2011 @ 11:10 am

    @David J. Littleboy: Machine translation from Japanese to English also generates some wonderfully poetic language.

    I'm not sure if this is machine-translated (the misspellings imply that there were some transcription errors/typos as well) but it's a classic of the "Engrish" genre:

  10. Glenn Bingham said,

    July 9, 2011 @ 8:36 pm

    @Emily. So as not to identify the guilty, I will omit full names, but two of my dorm mates in college had a contest to see who could graduate with the lowest grade-point average. The 2.000x went to four places to decide the winner.

    When you say "makes errors that no actual human would make," I know you are blaming the affair on the machine that translated the message, but the human perpetrators still had to OK the final product before launching the e-mail…and humans most likely wrote the program. When I hear "computer error," I always think of missing a nail and clobbering my thumb and then attributing it to "hammer error."

    "We're talking about email that is incomprehensible…" Well, except that Pullum wrote a whole article on what it meant and apparently immediately identified it as a solicitation for personal ID to get money. It takes a dedicated linguist to figure out what barley has to do with it, but although bruised a bit, the message arrived.

    Just as spelling doesn't matter in a contextual presentation as long as the first and last letters for each word are in place (for many of us), I wonder to what degree the syntax and semantics needs to be in tact to still get the idea across? How low can a message score on the grammar scale and still pass? I actually rate the phishing scam higher on the Hinman-Shaw scale than some non-machine-translated writing I have saved from native language students. They didn't make the 2.0000.

  11. Glenn Bingham said,

    July 9, 2011 @ 8:45 pm

    Oops! Too many E's. The last a response to Ellen K. Thanks for making my wheels spin.

    Of course, I could read samples like Emily's until the cowcomes home!

  12. Ellen K. said,

    July 9, 2011 @ 10:54 pm

    @Glenn Bingham

    Do you have anything to say about my actual point? All you've done is nitpick over how I worded description of my point. You haven't said anything at all about my actual point. Namely, that the sender does not sound uneducated. Or, at least, not simply uneducated. Lack of education along does cause writing like that.

    And don't read too much into a brief description. I didn't say it was totally completely incomprehensible. I left the word unmodified. Further detail was unnecessary, since readers can see for themselves to what degree it's incomprehensible.

  13. John Cowan said,

    July 10, 2011 @ 12:17 am

    William Annis: Tolkien could certainly write archaically, but he also had full control of modern English. As he wrote in Letter #171 to Hugh Brogan:

    The proper use of 'tushery' is to apply it to the kind of bogus 'mediaeval' stuff which attempts (without knowledge) to give a supposed temporal colour with expletives, such as tush, pish, zounds, marry, and the like. But a real archaic English is far more terse than modern; also many of the things said could not be said in our slack and often frivolous idiom. Of course, not being specially well read in modern English, and far more familiar with the works in the ancient and ’middle’ idioms, my own ear is to some extent affected; so that though I could easily recollect how a modern would put this or that, what comes easiest to mind or pen is not quite that. But take an example from the chapter that you specially singled out (and called terrible): Book iii, ’The King of the Golden Hall’. "'Nay, Gandalf,' said the King, 'You do not know your own skill in healing. It shall not be so. I myself will go to war, to fall in the front of the battle, if it must be. Thus shall I sleep better.'"

    This is a fair sample – moderated or watered archaism. Using only words that are still used or known to the educated, the King would really have said: 'Nay, thou (n')wost not thine own skill in healing. It shall not be so. I myself will go to war, to fall…' etc. I know well enough what a modern would say. 'Not at all, my dear G. You don't know your own skill as a doctor. Things aren't going to be like that. I shall go to the war in person, even if I have to be one of the first casualties' – and then what? Theoden would certainly think, and probably say 'thus shall I sleep better'! But people who think like that just do not talk a modern idiom. You can have 'I shall lie easier in my grave', or 'I should sleep sounder in my grave like that rather than if I stayed at home' – if you like. But there would be an insincerity of thought, a disunion of word and meaning. For a King who spoke in a modern style would not really think in such terms at all, and any reference to sleeping quietly in the grave would be a deliberate archaism of expression on his part (however worded) far more bogus than the actual 'archaic' English that I have used.

  14. Ben Hemmens said,

    July 11, 2011 @ 7:25 am

    The paper on Translationese has essentially one result, and that is that the words therefore, thus, consequently, hence, however, nevertheless, also, furthermore and moreover are overrepresented in translations into English compared to natively English texts.

    This is very familiar to me as a translator from German to English and editor of texts written by German speakers. I'm not a linguist, but I would attempt to describe the phenomenon thusly: in "good" English prose we generally try to achieve an organic and rather linear development of the thoughts from one sentence to another. The content of sentences is relatively self-contained. If we can get away without using these overt cohesive markers to link one sentence to the preceding one, we do it. German prose works differently; my personal explanation is that English paragraphs try to work like a story, but German paragraphs often work like an oil painting: ideas are not completed in single sentences, but rather, each sentence touches on different aspects of the ideas being expressed in the paragraph, just as any one location in an oil painting is a product of several translucent layers of paint.

    (My wife, who has learned English as a foreign language, talks about key sentences and I suppose that's also true: German paragraphs often don't have one.

    Lacking what native readers of English would recognize as a good logical progression of ideas, German has lots of words that can be used as cohesive markers and uses them liberally to give some semblance of structure to things: weiters, des Weiteren, folglich, in weiterer Folge, damit, somit, daher, dafür, wo hingegen, also, allerdings, trotzdem, nichtsdestotrotz, and a good few more. I'm not saying that German prose is worse than English, it just works differently.

    What needs to be done to make proper English prose out of German prose is very often a large-scale recasting of paragraphs and even of longer passages. It requires leaving out some information and adding other information and having different numbers of sentences. But of course there are many categories of translation in which adding or leaving out information is in strong conflict with the purpose of the translation, and the first source of corpus material in the paper is a classic example: transactions of the European Parliament. Another reason occurs to me why these words would be overrepresented in these translations and that is that they are transcripts of interpretation (which in the EP is mostly simultaneous). Simultaneous interpreting is a hellish sport if you ask me, and I believe its practitioners survive by accepting its imperfections and concentrating on the essentials. The essentials are not how wittily the speaker leads from one point to another, but first and foremost to miss no substantive piece of information that is spoken. So the cohesive markers allow the interpreters to tell the listeners: "here comes another point supporting what the speaker just said" or "here comes a contrasting point", etc. – while also gaining a second or two to listen to the speaker themselves.

    In the newspaper texts the reason for over-representation is probably less noble and simply a function of too little time and too little money to do the rewriting involved.

    But there are other types of translation in which the ideal is certainly to produce a text that will not appear in any way odd to native speakers of the target language, and which allow the freedom to rewrite on a wider scale; and if the translators are paid well enough to take the extra time needed for this work, I'm sure they achieve results harder to distinguish from native English. I get high-value marketing texts edited by people whom I know cannot speak German. But you still have the issue of subject matter that simply doesn't exist in the same way in the English-speaking world. I just did some work on engineered timber construction elements: that happens to be an area where the German-speaking countries lead the world, and so inevitably English speakers will find the texts a bit odd: they haven't seen products quite like this before.

    PS the underrepresentation of "actually" doesn't surprise me; it's a terrible false friend for speakers of several other languages.

    PPS I'd be curious to know why the authors think English is more similar to romance languages than to German.

  15. Glenn Bingham said,

    July 13, 2011 @ 10:43 pm

    @Ellen K
    You are absolutely totally completely correct. The "bad" translation does not produce a tone of undereducation, but challenges comprehensibility.

    What I didn't express effectively was that, accepting your observation, how do we measure such degrees of success in communication? How many disruptions in grammar (in the broadest sense) does it take to produce a sense of undereducation? If any? And further, to what extent can the grammar be mangled before the level of comprehension reaches zero? At what point does the machine translation fail completely?

  16. Ben Hemmens said,

    July 14, 2011 @ 3:48 am

    "how do we measure such degrees of success in communication?"

    That's difficult, because it is highly situation-dependent. The only real answer is that if the target person or group did what you wanted them to do, then your communication attempt succeeded and therefore whatever language you used was good enough.

    Every so often a public personage in Austria get themselves publicly ridiculed for their dodgy English (the latest was (ex-)MEP Ernst Strasser, who got stung by British journalists, with highly amusing video evidence). But if that level of English fluency was a big hindrance to everyday communication, the Austrian economy would have collapsed long ago.

    In this case, recognizing what the spammer wants you to do is not sufficient, for most recipients, to get them to do what he/she wants. The reasons have a lot more to do with the facts that we are all sceptical of "money for nothing" and we all know that spammers exist, than with the language. If someone came up with a much more convincing scenario and managed not to look like any of the scams we already know, success would be a lot more tolerant of a few language errors.

RSS feed for comments on this post