"Lost" languages?

« previous post |

The use of the word lost in this recent story caught my attention — Pankaj Doval, "Google set to revive lost Indian languages", The Times of India 10/3/2024:

As it gets deeper into India with generative AI platform Gemini and other suite of digital offerings, Google has taken up a new task in hand – reviving some of the lost Indian languages and creating digital records and online footprint for them.

I'll say more later about Google's important and interesting contribution to an important and interesting problem. But first, what does the article mean by "lost Indian languages"? I started from the idea that languages that are "lost" are extinct, i.e. no longer spoken — and a web search for the phrase "lost languages" confirms that others have the same interpretation.

However, the Times of India article makes it clear that this is not what they mean:

The idea is to enable people to easily carry out voice or text searches in their local dialects and languages.

As the work moves towards completion, people in the hinterland and various regions can easily do voice search in their own languages to gain accurate and valuable information from, say, Google's Gemini AI platform or carry out live translations, harness YouTube better to target their communities.

The project has so far reached 59 Indian languages, including 15 that currently do not have any kind of a digital footprint and were rather declining in usage.

The project has so far reached 59 Indian languages, including 15 that currently do not have any kind of a digital footprint and were rather declining in usage.

And the article includes this graphic, listing 8 of those 59 languages:

Looking these languages up on Ethnologue and Wikipedia tells us that some of them have as many as 10 to 20 million speakers (details below), so they're far from extinct. And it's misleading to say that they "have been recorded digitally by Google for first time" — for example, the Wikipedia article for Bajjika says that "Lakshmi Elthin Hammar Angna (2009) was the first formal feature film in Bajjika. Sajan Aiha Doli le ke came after that". And YouTube has quite a few items partly or entirely in Bajjika, including a Bajjika Channel.

Of course, some of the cited languages are smaller — thus Wikipedia says that Duruwa has 18,151 speakers, while Ethnologue give the number as 12,000. And the motivation for the Google project is that all of these languages are (or were) "under documented" or "under resourced", in the sense that they lack the digital resources needed for robust modern language technologies such as speech-to-text, text-to-speech, text understanding, and so on. And there's a general concern that this situation makes language potentially "endangered" and thus at risk of being lost.

It's possible that Indian English generally uses "lost language" in this sense, though I'm guessing that the author of the article (or someone else in the editorial chain) made the choice.

Anyhow, it's worth spending a few minutes on a widely-used attempt to clarify the relevant terminology — the Expanded Graded Intergenerational Disruption Scale (EGIDS), originally proposed in M. Paul Lewis and Gary Simons, "Assessing endangerment: expanding Fishman’s GIDShttp://ipv6.lingv.ro/RRL%202%202010%20art01Lewis.pdf" (2010):

ABSTRACT: Fishman’s 8-level Graded Intergenerational Disruption Scale (GIDS) has served as the seminal and best-known evaluative framework of language endangerment for nearly two decades. It has provided the theoretical underpinnings for most practitioners of language revitalization. More recently, UNESCO has developed a 6-level scale of endangerment. Ethnologue uses yet another set of five categories to characterize language vitality. In this paper, these three evaluative systems are aligned to form an amplified and elaborated evaluative scale of 13 levels, the E(xpanded) GIDS. Any known language, including those languages for which there are no longer speakers, can be categorized by using the resulting scale (unlike the GIDS). A language can be evaluated in terms of the EGIDS by answering five key questions regarding the identity function, vehicularity, state of intergenerational language transmission, literacy acquisition status, and a societal profile of generational language use. With only minor modification the EGIDS can also be applied to languages which are being revitalized.

Here's the EGIDS table from the Wikipedia article:

The current edition of Ethnologue offers a "Language Cloud" for each language, consisting of a scatter plot whose x-axis is the EGIDS level, and whose y-axis is the estimated number of speakers. Here's the Language Cloud for Bajjika:

Ethnologue's explanation of their dot colors:

  • Purple = Institutional (EGIDS 0-4) — The language has been developed to the point that it is used and sustained by institutions beyond the home and community.
  • Blue = Developing (EGIDS 5) — The language is in vigorous use, with literature in a standardized form being used by some though this is not yet widespread or sustainable.
  • Green = Vigorous (EGIDS 6a) — The language is unstandardized and in vigorous use among all generations.
  • Yellow = In trouble (EGIDS 6b-7) — Intergenerational transmission is in the process of being broken, but the child-bearing generation can still use the language so it is possible that revitalization efforts could restore transmission of the language in the home.
  • Red = Dying (EGIDS 8a-9) — The only fluent users (if any) are older than child-bearing age, so it is too late to restore natural intergenerational transmission through the home; a mechanism outside the home would need to be developed.
  • Black = Extinct (EGIDS 10) — The language has fallen completely out of use and no one retains a sense of ethnic identity associated with the language.

Ethnologue's Language Cloud for Duruwa uses a yellow dot:

I believe that Google's work on under-documented languages of India has been led by Partha Talukdar, whose LinkedIn page says "I lead the Languages group at Google DeepMind, India focusing on making LLMs work well for speakers of more number of languages. The goal is to make sure benefits of AI are available to a broader population where language is not a barrier anymore."

One of his group's relevant contributions is "IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages":

As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench – the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models.

 



Leave a Comment