Language Log

MuRIL

December 19, 2020 @ 8:51 am · Filed by Mark Liberman under Computational linguistics

[Note that the "To view or add a comment" message is from LinkinIn, not LLOG…]

Some more details, from Manish Singh, "Google expands languages push to serve non-English speakers in India", TechCrunch 12/16/2020:

Google executives also detailed a new language AI model, which they are calling Multilingual Representations for Indian Languages (MuRIL), that delivers more efficiency and accuracy in handling transliteration, spelling variations and mixed languages and other nuances of languages. MuRIL provides support for transliterated text when writing Hindi using Roman script, which was something missing from previous models of its kind, said Partha Talukdar, research scientist at Google Research India, at a virtual event Thursday. […]

Talukdar noted that the previous model Google relied on proved unscalable as the company had to build models for each language separately. “Building such language-specific modeling for each and every task is not resource efficient as we often don’t have training data for tasks like this,” he said. MuRIL significantly outperforms the earlier model — by 10% on native text and 27% on transliterated text. MuRIL, which was developed by Google executives in India and has been in use for about a year, is now open-source.

The MuRIL models are available here, along with some additional discussion and explanations.

December 19, 2020 @ 8:51 am · Filed by Mark Liberman under Computational linguistics

Permalink

7 Comments

Victor Mair said,

December 19, 2020 @ 10:00 am

The linguists, AI researchers, and executives at Google Research India are to be warmly congratulated for this remarkable achievement.

Meanwhile, comparative linguists, specialists in Sinitic languages and scripts, and AI scientists should ponder why we are still light years away from having a similar tool for China — a MuRSL (Multilingual Representations for Sinitic Languages), as it were.
John from Cincinnati said,

December 19, 2020 @ 1:07 pm

I was waiting for anyone else to complain, but … this post displays for me as nothing but a large blank frame followed by the note about how the comment messge is from LinkedIn, not LLOG.

Examining the html source I can see the desired LinkedIn URL, and I would plaintext it here except that LLOG sometimes discards messages with embedded URLs. So I'll put that in a second comment, which if it doesn't show up you'll know why.

[(myl) The embed code is

<iframe title="Embedded post" src="https://www.linkedin.com/embed/feed/update/urn:li:share:6745368806815875072" width="504" height="854" frameborder="0" allowfullscreen="allowfullscreen"></iframe>

What browser and OS are you using? This works for me in four browsers under two operating systems…]
John from Cincinnati said,

December 19, 2020 @ 1:08 pm

The desired link is here.
Philip Taylor said,

December 19, 2020 @ 2:06 pm

The "allowfullscreen" attribute is not permitted in the DTD — whether this might cause the problem I do not know.

Line 393, Column 177: there is no attribute "allowfullscreen"

…4" height="854" frameborder="0" allowfullscreen="allowfullscreen">
AntC said,

December 19, 2020 @ 3:44 pm

I'm not seeing anything specific to Indian languages here: couldn't there be just any random collection of languages in the balloons on the LHS of the diagram?

[(myl) Of course — it's a question of how you train the system, and what mix of languages you expect the users to hit it with.

Aside: This comment is a good example of "snarky um"…]

And indeed there's a mix of IE and Dravidian languages, so every chance somebody could make something similar for other disparate bunches of languages, such as @Victor's suggestion for Sinitic. (Perhaps for Sinitic there are special difficulties obtaining transliterations for non-Mandarin.)

It all comes down to the training data and the BERT settings.

On which topic: "Translated Data : We obtain translations of the above monolingual corpora using the Google NMT pipeline." Unless Google Translate amongst Indian languages is orders-of-magnitude better than the frequently hilarious offerings that appear on LLog for other languages, that looks like a case of Garbage In, Garbage Out.

[(myl) The linked site gives performance results for more than a dozen tasks.]
Victor Mair said,

December 19, 2020 @ 4:50 pm

From H. Krishnapriyan:

>> MuRIL provides support for transliterated text when writing Hindi using Roman script,

A pioneering individual effort in doing this is that of Seshadrivasu Chandrashekharan, who developed baraha which initially did this for Kannada and later for a number of other Indian languages. His efforts were available to me for use from about 20 years back.

https://en.wikipedia.org/wiki/Baraha

https://baraha.com/v10/index.php
Victor Mair said,

December 21, 2020 @ 9:13 am

From a Chinese colleague:

Pity, but not a surprise, for most Chinese, even those well-educated believe Cantonese, for example, is a dialect, not a language. It is unconventional and trivial for them to record an oral "dialect" while the universal written Chinese prevails.

RSS feed for comments on this post

MuRIL

7 Comments

Victor Mair said,

John from Cincinnati said,

John from Cincinnati said,

Philip Taylor said,

AntC said,

Victor Mair said,

Victor Mair said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta