Wordy Bengal

« previous post | next post »

Headline in bdnews24.com:

"Bangladesh adds 700,000 words to Google Translate in a day" (3/28/15)

An official announcement from Google Inc is yet to be made but State Minister for ICT Zunaid Ahmed Palak broke the news in his Facebook post on Friday.

He congratulated all involved in the post.

“We’ve done it! Our target was 400,000. Setting a record by adding so many words in a day has taken Bangladesh to a new high,” he said.

“Bengalis across the globe could add over 700,000 words and phrases to Google Translate since yesterday, thanks to all who responded to our call. The credit goes to all, the pride is of Bangladesh,” he added.

Over 4,000 volunteers keyed in at least 400,000 words to the translation app  in a day from 81 places in the country .

The initiative, co-organised by the ICT Division, Google Developers Group and Bangladesh Computer Council (BCC), kicked off on Thursday morning at Agargaon.

This huge number of Bengali words added to Google Translate raises a lot of questions.  If 700,000 words were added in a single day, one wonders what the total number of words are in Bengali.  (English has a particularly large vocabulary, but it is only somewhere between a quarter of a million and three quarters of a million words.)

What were the criteria for selection?

How did they avoid duplicates?

What is the total vocabulary of Bengali?  (Volunteers had already added over 65,000 words on February 21.

Will Google Translate accept all 800,000 or so entries keyed in by volunteers?  Wikipedia has stringent procedures for changes that are made to its articles.  How will Google Translate maintain quality?

[h/t Geoff Wade]


  1. Tom S. Fox said,

    March 29, 2015 @ 9:51 pm

    Please tell me you cited the Gobal Language Monitor’s figure of one million as a joke.

  2. Yerushalmi said,

    March 29, 2015 @ 11:46 pm

    Are you sure it's 700,000 Bangladeshi words? It's probably 700,000 foreign-Bangladeshi word pairings. So it could be 20,000 words submitted by an English-Bangladeshi translator, 20,000 submitted by a French-Bangladeshi translator, etc. Not to mention that among 700,000 submissions by geographically scattered people will necessarily contain many duplicates.

  3. Chris McG said,

    March 30, 2015 @ 5:06 am

    Just because they're geographically scattered doesn't mean the effort was uncoordinated given they were all definitely online for the entire time. Careful planning, delegation and ongoing discussion would reduce the amount of duplicates – I mean, I assume it wasn't just someone shouting "3, 2, 1, Go!" and everyone typing in the first words they thought of, rather different people being given different and specific areas to work on.

  4. Chris McG said,

    March 30, 2015 @ 5:08 am

    Sorry for typing your name wrong before.

    Also I forgot to mention that I agree with your idea that it was probably counting each target language of translation separately to get to 700,000.

  5. Lane said,

    March 30, 2015 @ 5:18 am

    It doesn't say they were distinct words, and the direct quote says "over 700,000 words and phrases." It sounds like a call went out to volunteers: "If you speak Bengali and English, please go and add words and phrases at http://translate.google.com/community." And the volunteer response was bigger than expected, but with lots of duplication. (I wonder how they quality-control this? People do like to punk this kind of thing.)

  6. Victor Mair said,

    March 30, 2015 @ 8:43 am

    @Tom S. Fox

    No, that was not a joke; it was my bad. So bad, in fact, that I've removed it from the original post.

    I knew about this, of course:

    "The 'million word' hoax rolls along"

    And it even passed through my mind as I was writing the post, but various distractions and the figure of three quarters of a million cited by the Oxford dictionary website caused it to slip through.

    Mea culpa! And thanks for catching that right away.

  7. Circeus said,

    March 30, 2015 @ 9:21 am

    From the point of view of machine translation, all variants of a word count as strings, which may also affect the count: verb and case declensions are both present in bengali. For comparison, the official French scrabble dictionary has ca 60,000+ entries translating to over 360,000 possible valid strings, not counting diacriticals.

  8. Vicki said,

    March 30, 2015 @ 10:13 am

    Adding to what Circeus said: variants, conjugations, and such are relevant in both directions of the translation. If you're doing Bengali-English, you need to link the appropriate forms of the Bengali words for "go" to "gone" and "went."

  9. Leo E said,

    March 30, 2015 @ 11:41 am

    Here's a video (http://tinyurl.com/pw9q9po) teaching volunteers how to contribute entries for the independence day event, at the end of which the presenter says that it's "desher jonne, bhashar jonne" – for the country, for the language. The article that accompanies it goes into some of the ideological factors behind the movement but quotes a few of the same sources as the first linked article in the original post here, as well as a (Chinese?) Google representative from the US লিনে হা Line Ha (?) who is quoted as saying আর বাংলা অবশ্যই পৃথিবীর অন্যতম বড় ও সমৃদ্ধ ভাষা। "And Bengali is without a doubt one of the world's greatest and most bountiful languages." (At first I thought it was anuttama and not anyatama, which would have meant *the* greatest language.) One of the examples in the video is "ponero bar biye korte cay" – "he wants to get married 15 times," which apparently counts as one entry. Other example entries he gives are also phrases and sentences.

    The only wordcount for a dictionary I could find from a quick search is Jnanendramohan Das's 1937 Bangla Bhasar Abhidhan (Dictionary of the Bengali Language) at 150,000, though I don't know about the Samsad Bangla Abhidhan, most recent edition 2004.

  10. Dave Cragin said,

    March 30, 2015 @ 8:52 pm

    Because of the way Google translate works, many entries for the same word/phrase/sentence are actually desirable. Single submissions would be problematic because they are solely dependent on that one individual’s ability and perspective.

    Google translate works by finding patterns in data using algorithms that IBM developed in the 1980s. It statistically picks the most common “pattern” to translate a word or phrase. As people vary in their ability to translate, this is essential.

    Some people lack an ability to see language in a broader scope and despite being fluent in 2 languages, they can’t contribute effectively with translations (but they are unaware of this).

    For example, early in my study of Chinese when I could only speak, not write, I mentioned to a Chinese colleague that I hadn’t realized that the jiu 酒 in 葡萄酒(putaojiu – wine) and 啤酒 (pijiu – beer) both referred to alcohol.

    He said “that’s wrong, jiu only means wine.” I mentioned the CDs taught it as alcohol and showed him 2 different dictionaries that included the alcohol definition. He said “the CDs are wrong, the dictionaries are wrong, you’re wrong, jiu just means wine.” (He was a PhD scientist, i.e., not an uneducated individual, and quite fluent in English).

    I expect every country has individuals like him who focus on what they consider as "accurate literal translations" and can't let themselves see beyond this. However, when not using official documents as its source, for google to be successful, they need more broadbased input. Hence, the wisdom of crowds…..

  11. Leo E said,

    March 30, 2015 @ 9:15 pm

    In retrospect, I'm wondering why "he wants to get married 15 times" would be an entry by itself in Google translate. I don't know about machine translation, but I would guess the program benefits from having as much raw data to compare, so that any accurate translation is valuable. I shudder to think of the project of measuring the number of words in Bengali (or other Indian languages) since the tendency to use and adapt English is so extreme that those English words might have to be counted as Bengali because of the phonetic changes they undergo. Like the phrase "machine intelligence" which appears here as মেশিন ইন্টেলিজেন্স (meśin inṭelijens) – in French would be something like intelligence des machines, which seems to have a lot more phonologically in common with English, although of course Bengali preserves the overt syntactic relationship between the words without having to have an "of" or a subordinating marker.
    The above comment by Dave Cragin helps me understand the value of getting input like this, though.

  12. Akito said,

    March 31, 2015 @ 1:52 am

    @Dave Cragin

    Isn't alcohol in a general sense called 酒精 or 木精? The word 酒 used alone refers, I think, only to alcoholic beverage. I sympathize with your Chinese colleague who insisted on "wine". Perhaps he used the word generically, brewed or distilled, emphasizing the "beverage" part of the meaning.

  13. Victor Mair said,

    March 31, 2015 @ 11:22 am

    @Dave Cragin

    Jiǔ 酒 ≠ "wine"

    You were so right that JIU3 means "alcohol" and your Chinese colleague was so wrong to insist that it only means "wine". In fact, by itself it doesn't mean "wine" at all.

    By chance, yesterday afternoon, we heard a talk on the following topic in our department: "Wine Road before the Silk Road: Hypotheses on the Origins of Chinese and Eurasian Drinking Culture". It was delivered by Peter Kupfer, Professor, Johannes Gutenberg University, Mainz, Germany. Peter was accompanied by my colleague, Patrick McGovern, author of Uncorking the Past: The Quest for Wine, Beer, and Other Alcoholic Beverages (2009).

    Peter and Pat had just come from a conference on "Understanding Jiu: The History and Culture of Alcoholic Beverages in China" that was held on March 26, 2015 at UC Davis, which has one of the world's outstanding centers for (o)enology and viticulture.


    Also present at the Penn seminar yesterday was Christoph Harbsmeier, a Sinologist from the University of Oslo. The discussion during and after Peter's presentation was vigorous and productive.

    The consensus of the participants at the Penn seminar is in agreement with the definition for jiǔ 酒 in Paul Kroll's new A Student's Dictionary of Classical and Medieval Chinese:


    MC tsjuwX
    1 gen. term for alcoholic beverages produced through fermentation, incl. those with infusions or spices that sometimes lend various colors such as rose-pink or amber. Although most drinks designated by this word are made from cereals and are thus akin to beer, from Western Han times it also ref. grape-wine (first brought from Central Asia) and “burnt-wine” (brandy), the former becoming esp. popular during Tang times; use “wine” as preferred rendering for its inclusiveness; to use “ale” is misleading as it ref. only to a specific type of beer which is actually most similar to → 醴 lǐ.

    s.v. 醴, Kroll has:

    MC lejX
    1 sweet liquor, made with malt (nie 糵) and glutinous millet (shu 黍); often translated as “mead,” which is a serviceable rendering but technically inaccurate since honey is not an ingredient of this beverage; the more correct translation is “ale.”
    2 day-old wine.


    Note on the phonetic notation: The -X is part of Baxter's tonal spelling system, in which pingsheng syllables are unmarked; shangsheng syllables are indicated by a final -X; qusheng syllables are marked with a final -H, and rusheng syllables can be identified by the final obstruent. See William H. Baxter and Laurent Sagart, Old Chinese: A New Reconstruction (2014). Their Old Chinese reconstruction for jiǔ 酒 is tsuʔ. Axel Schuessler's Old Chinese reconstruction of jiǔ 酒 is tsjəuB, where B is a superscript and indicates a tonal category (Minimal Old Chinese and Later Han Chinese: A Companion to Grammata Serica Recensa [2009]).

    We have had several posts about jiǔ 酒 on Language Log:

    "Let the Beer-Divider Be Chief!" (8/5/09)


    "Don't Drive in the What, er?" (8/4/09)


    "Ethanol tampons" (12/5/14)


    Tom Standage has a great chapter on "Beer in Mesopotamia and Egypt" in his A History of the World in 6 Glasses (2006), pp. 14-36. It is available on this blog:


    About two-thirds of the way down the page, at the beginning of the section titled "The Origins of Writing", we find an illustration with this caption: "The evolution of the written symbol for beer in cuneiform. Over the years the depiction of the beer jar gradually became more abstract". (from 3200-1000 BC)

    Here are the early forms of the Chinese character jiǔ 酒 for comparison:


    There is a clear resemblance between the Sumerian and the Chinese symbols for "beer", both of which depict a jug. It's interesting that the oracle bone forms (second half of second millennium BC) for 酒 all have the three drops of water as a semantophore, whereas the bronze inscriptional forms (first millennium BC) and even some of the seal forms (latter part of the first millennium BC) lack the three dots for liquid, making the character for jiǔ 酒 identical to that for yǒu 酉 ("an ancient vase used in making and storing fermented millet liquors")


    Bottom lines:

    The Chinese word for "wine" is pútáojiǔ 葡萄酒 ("grape jiǔ"), where pútáo 葡萄 (there are many different ways to write this in Chinese characters) is a term for grape borrowed from an Iranian language.

    If it's made from grain, which jiǔ 酒 traditionally was, it's not wine.

    The Japanese alcoholic beverage called "sake" and made from fermented rice (N.B.: a grain) is written with the kanji 酒 (also has the Sino-Japanese pronunciation "shu").

    Conventionally, loosely, and poetically, some folks may prefer to translate jiǔ 酒 as "wine", but sensu stricto it is technically not "wine".

    [Thanks to Brendan O'Kane]

  14. Akito said,

    March 31, 2015 @ 11:08 pm

    The word "alcohol" can refer to ethanol or methanol. 酒 can refer to any beverage (wine or otherwise) that contains the former. The latter is not meant to be drunk. That's why the Chinese colleague finds it (as I do) difficult to equate 酒 with alcohol.

  15. Akito said,

    April 1, 2015 @ 2:14 am

    Sorry for the interruption. Of course, not all liquids containing ethanol are beverages, so the word 酒 applies to a much smaller domain than "alcohol".

  16. Victor Mair said,

    April 1, 2015 @ 6:50 am


    I'm glad you made your own correction. What you say in your third comment is exactly what I was going to write this morning.

    Not all jiǔ 酒 is wine; not all alcohol is drinkable.

  17. Victor Mair said,

    April 1, 2015 @ 2:56 pm

    From Paul Kroll:

    As my definition says, jiu is the general term for beverages produced through fermentation and usually "more akin to beer" but it may also include what we normally call "wine," so I prefer to retain the latter translation for its general inclusiveness. Some colleagues now prefer rendering jiu as "ale," apparently using "ale" as an elegant synonym of "beer" (though ale is actually just one type of beer [like porter, stout, lager, etc.] and, as I point out, is technically nearer to what is called in Chinese "li"). But I have trouble with the image of traditional Chinese scholars tossing down "ale" as though they are Anglo-Saxon or Viking warriors, or frequenting "ale-houses" which has a decidedly lower-class connotation in English.

  18. julie lee said,

    April 1, 2015 @ 8:48 pm

    What about translating 酒 jiu as "liquor"? Has someone already suggested it in the comments? In present-day American-English usage doesn't " liquor" include wine, beer, ale, sake, and other alcoholic beverages? As far as I know, in present-day common Chinese usage 酒 jiu can refer to any alcoholic beverage, be it wine, beer, ale, sake, etc.

  19. Dave Cragin said,

    April 2, 2015 @ 9:28 pm

    Thanks for all of the interesting comments about 酒 jiu.

    I should mention my sense was that part of my colleague’s over-reaction related to the older Chinese education tradition in which the student never should question the teacher. Even though I wasn’t his student, he may have viewed the situation this way and this may have added to his dogmatic view.

    This said, the students I teach at Peking University often speak up in class and some offer opinions that differ with mine, so things are changing.

RSS feed for comments on this post