Toward the decipherment of Harappan

« previous post | next post »

As documented here (2009), here (2010), here (2013), and here (2017), it's controversial whether the Indus Valley (IV) inscriptions are really a "script" or something more like a set of logos.  Many people have tried, but it hasn't been definitively cracked.  Now computer scientists are making new attempts to unlock its secrets.

"An ancient language has defied decryption for 100 years. Can AI crack the code?

Scholars have spent a century trying to decipher ancient Indus script. Machine learning may finally help make sense of it all."

By Alizeh Kohari, Rest of World (2/8/22)

This is a long article.  Since it is on a subject that has intrigued me for half a century, plus I personally know some of the key players in the drama and, moreover, I believe that it is innately of great interest and importance, I will provide generous quotations from this substantial piece.

The article begins on a hopeful note:

Jiaming Luo grew up in mainland China thinking about neglected languages. When he was younger, he wondered why the different languages his mother and father spoke were often lumped together as Chinese “dialects.”

Right away you know that Jiaming, a computer scientist, is a smart cookie, much smarter than your average language dilettante or typical pedant who doesn't know the difference between a language and a dialect.

When he became a computer science doctoral student at MIT in 2015, his interest collided with his advisor’s long-standing fascination with ancient scripts. After all, what could be more neglected — or, to use Luo’s more academic term, “lower resourced” — than a long-lost language, left to us as enigmatic symbols on scattered fragments? “I think of these languages as mysteries,” Luo told Rest of World over Zoom. “That’s definitely what attracts me to them.”

In 2019, Luo made headlines when, working with a team of fellow MIT researchers, he brought his machine-learning expertise to the decipherment of ancient scripts. He and his colleagues developed an algorithm informed by patterns in how languages change over time. They fed their algorithm words in a lost language and in a known related language; its job was to align words from the lost language with their counterparts in the known language. Crucially, the same algorithm could be applied to different language pairs.

Luo and his colleagues tested their model on two ancient scripts that had already been deciphered: Ugaritic, which is related to Hebrew, and Linear B, which was first discovered among Bronze Age–era ruins on the Greek island of Crete. It took professional and amateur epigraphists — people who study ancient written matter — nearly six decades of mental wrangling to decode Linear B….  Linear B is now recognized as the earliest form of Greek.  

Luo and his team wanted to see if their machine-learning model could get to the same answer, but faster. The algorithm yielded what was called “remarkable accuracy”: it was able to correctly translate 67.3% of Linear B’s words into their modern-day Greek equivalents. According to Luo, it took between two and three hours to run the algorithm once it had been built, cutting out the days or weeks — or months or years — that it might take to manually test out a theory by translating symbols one by one. The results for Ugaritic showed an improvement on previous attempts at automatic decipherment.

Their success with Linear B and Ugaritic emboldened the researchers to try their hand at one of the world's greatest remaining undeciphered scripts, that of the Indus Valley or Harappan.

British India, 1872-1873. Alexander Cunningham, an English army engineer turned archeological surveyor, clomped about the ruins of a town in Punjab province that locals called Harappa. On the face of it, there wasn’t much to survey: about two decades earlier, engineers working to link the cities of Lahore and Multan had stumbled across the site and used many of the bricks they found — perfectly preserved, fire kilned — as ballast for nearly 100 miles of railway track, blithely unaware they were remnants of one of the world’s oldest civilizations.

Cunningham didn’t know this either — the Indus Valley civilization wouldn’t be formally “discovered” until the 1920s — but he knew the site had some historical value. Burrowing through the ruins, he and his team chanced upon stone implements they surmised were used for scraping wood or leather. They gathered shards of ancient pottery and what appeared to be a clay ladle. The most striking discovery, though, was a tiny stone tablet, roughly 1.5 inch by 1.5 inch. “On it is engraved very deeply a bull, without a hump, looking to the right, with two stars under the neck,” Cunningham wrote in his report. “Above the bull there is an inscription in six characters, which are quite unknown to me. They are certainly not Indian letters; and as the bull which accompanies them is without a hump, I conclude that the seal is foreign to India.”

Scholars often point out that the Indus script, as the collection of some 4,000 excavated inscriptions, comprising between 400 and roughly 700 unique symbols, is known, might be one of the most deciphered scripts in history. More than a hundred attempts have been published since the 1920s. One theory links it to the Rongorongo script of Easter Island, also still undeciphered; another, offered by a German tantric guru claiming to have achieved his solution through meditation, links it to the cuneiform script used to write the Sumerian language.

For some groups in South Asia, the quest to decode the Indus script is almost existential. India and Pakistan, increasingly riven by their respective strains of religious nationalism, have markedly different relationships to their shared ancient past. The Pakistani state, deeply wedded to the idea of itself as a Muslim homeland, largely ignores its pre-Islamic heritage; its Indian counterpart, on the other hand, has taken to scouring history to find justification for the claim that India has always been a Hindu nation.

Up until the discovery of Harappa, the earliest Indians were believed to be people who lived between 1500 and 500 B.C. and composed the Vedas, the Sanskrit texts that form the basis of modern-day Hinduism. The discovery of a civilization of people who lived before the Vedic people upended the story of India. Given that it undermines their claims of indigeneity, proponents of Hindutva — the most mainstream strain of Hindu nationalism — balk at the theory of a pre-Vedic civilization, even as evidence for it accumulates across disciplines, including archaeology, genetics, and linguistics.

[I]t is remarkable how little we know about the original people of the Indus Valley, who at one point constituted nearly 10% of the world’s inhabitants. It is especially galling given how much more we know about their contemporaries, such as the people of the Egyptian and Mesopotamian civilizations. Part of the reason for this is the continued elusiveness of the Indus script.

Putting machines to work on the Indus script is trickier than using them to reverse-engineer Linear B. We don’t have a great deal of information about the Indus script: most crucially, we don’t know what other language it may be related to. As a result, a model like Luo’s wouldn’t work for the Indus script. That’s not to say technology can’t help, though. In some ways, computer modeling has already played a crucial role: by showing that the Indus script is a language at all. 

For most of the 20th century, the Indus inscriptions were widely accepted as representations of an undeciphered language. Then, in 2004, a group of Harvard researchers — cultural neurobiologist and comparative historian Steve Farmer, computational theorist Richard Sproat, and philologist Michael Witzel — published a paper essentially rubbishing nearly all existing research on the matter. The Indus seals, they claimed, were nothing more than a collection of religious or political symbols — similar to, say, highway signs — and all attempts to decipher them as a language were a waste of time. To underscore their point, Farmer offered a $10,000 reward to anyone who could find an Indus inscription containing at least 50 symbols.

Most Indologists and other Indus script researchers dismissed these arguments. One group of mathematicians, however, turned to computers to investigate the claims. Ronojoy Adhikari, a professor of statistical physics at the University of Cambridge, was one of them. 

Before Cambridge, Adhikari worked at the Institute of Mathematical Sciences, in Chennai. In 2009, he attended a talk by Iravatham Mahadevan, an Indian civil servant turned epigraphist. Mahadevan, who died in 2018, had already cracked Tamil-Brahmi, another undeciphered script, then turned his attention to the Indus script.

Adhikari remembers being fascinated. “I’m a person from the sciences; I don’t have a humanities background,” he said. “But what I found very attractive in Mahadevan’s way of looking at the problem was that he had a very quantitative, almost scientific, approach. He was asking, how many times does a particular symbol occur? What does it occur against? What is the context in which it is occurring? And it appeared to me that because it had already been so quantified, it would be easy to translate this into a formal mathematical analysis.”

A few other data scientists in attendance joined forces with Adhikari. They knew they couldn’t decipher the script. “So the question we asked was: Can we at least tell whether it’s conveying any sort of linguistic information?”

Led by computer scientist Rajesh Rao, the researchers devised a computer program to see if they could answer this question: Was the Indus script a language? “You can give me any sequence of symbols, I don’t care what they are — hieroglyphics, written language, sheet music, computer code — and I will look at them from the point of view of a mathematician,” explained Adhikari. “Meaning, I will simply count how many times one sign occurs next to another.”

Their program drew on the work of Claude E. Shannon, a mid-century American mathematician, engineer, and decoder of wartime codes, who formulated the notion of information entropy — essentially a mathematical measure of disorder. In linguistic systems, symbols occur with somewhat fixed frequencies. “For instance, I just can’t pick up a letter from the alphabet, string it with another letter from the alphabet, and expect to get an English word,” explained Adhikari. In common English, for instance, the letter “q” is nearly always followed by “u.” This semiflexibility is a marker of all linguistic systems. Computer code, on the other hand, is completely rigid: the slightest deviation, and it falls apart.

The researchers fed their program the 4,000 inscriptions that form the entirety of the Indus script. For good measure, they also ran the program on other linguistic samples (English characters and words, Sanskrit, Tamil, Sumer, and Tagalog) and some nonlinguistic scripts (DNA, protein, Beethoven’s Sonata no. 32, and a computer code called Fortran). The program took about 45 minutes.

“I remember the first time that plot was generated,” recalled Adhikari. On the graph, the curves depicting music, protein, and DNA sequences hovered high, close to the maximum level of entropy, indicating a high level of randomness. Lower down, the known languages are all in a tight cluster. Fortran appears further below.

As for the Indus script, it appears with the other languages, just under Sanskrit and mapping almost cleanly onto Tamil. “It felt fantastic. It really felt very good. It’s nice to have a hunch, but to be able to prove it — I remember thinking, Yes, we’ve really got something here.” 

There is a big difference, of course, between showing that a script encodes a language and decoding what it says.

Bahata Ansumali Mukhopadhyay met Adhikari over a decade ago. At the time, she was a disenchanted software developer looking for an escape route. When Adhikari, who had begun exploring deep learning approaches to work on the script, was in the market for an assistant, she eagerly volunteered.

Deep learning is the dominant technique in artificial intelligence today. It is primarily a form of pattern recognition: the more data you feed a machine, the better it becomes at interpreting future data. But the large-dataset approach isn’t particularly useful when it comes to low-resource (to use Luo’s term) subjects, such as the Indus script, where data is limited. Mukhopadhyay was quick to realize this. 

“I was supposed to be coding,” she said sheepishly. “But, I spent most of my time reading.”

Mukhopadhyay went down one rabbit hole after another. She parsed Mesopotomian, Akkadian, Sumerian, and Old Persian dictionaries. She taught herself how to read Egyptian hieroglyphics. “I realized just how subtle symbolism can be,” she said. “Like the god Horus, his eye was torn into fragments. Each part is imagined as a fraction — and then from there, the ancient Egyptians created their symbols for fractions.”

Even as she helped build software to aid research on the Indus script, her doubts about the approach were building. “See, if the Indus script were an alpha syllabary [a writing system split into units of consonants and vowels, as in Urdu/Hindi], then machine learning and artificial intelligence would have been very suitable,” she explained. But because the inscriptions appear to be pictorial in nature, they posed a greater challenge. “Here you have to understand the historical symbolism used in India. How will artificial intelligence tackle that? How would AI know these symbols represent the fragments of Horus’ eye?”

For the past few years, Mukhopadhyay has been independently researching the Indus inscriptions, focusing on individual symbols. This involves coming up with a particular theory and then testing it — something computers aren’t very good at.

Mukhopadhyay’s theory, for which she made a case in a peer-reviewed paper in Nature, is that the Indus seals were used for taxation and trade control — a collector might carry one around, for instance, as a sort of license. In a subsequent paper, by examining words used for “elephant” — piri, piru, pilu —and “ivory” — pirus — in near Eastern languages at the time of the Indus civilization, she has argued that the Indus people spoke an earlier form of Dravidian, the linguistic ancestor of current languages like Telugu, Tamil, and Kannada. If researchers can successfully identify a contemporary linguistic relation to the Indus script, it could hold the key to deciphering it. As Mukhopadhyay explains her work, her earrings jiggle. They are artsy depictions of elephant heads. “Pilu,” she said, smiling.

Current iterations of AI aren’t designed to deploy the sort of approach adopted by Mukhopadhyay. Adhikari, who is now also less bullish about the prospect of machine decipherment, is skeptical it ever will be. “I think there are many aspects of cognition we cannot encode in a convenient framework,” he said. “I wouldn’t hazard a guess, but I don’t see it happening in my lifetime. I think we need to understand our brains much better.” Moreover, he added, not all information is quantifiable in a way that computers can understand. “A machine understands one, two, three very well. Two plus two equals four, yes. But …” His gaze drifted beyond his computer screen. “But that this sunset here looks like a beautiful flame — well, it is this sort of abstraction that holds the key to this script.”

Regardless of the approach used, AI is dependent on high-quality data being available in a machine-readable format. This remains a key challenge when it comes to ancient texts, given that they often come to us chipped, eroded, or incomplete in some other form. Scholars can spend decades debating the uniqueness of symbols: Is that a scratch next to a known character, for instance, or a new character altogether? Given how little there is to work with when it comes to long-lost languages, noisy or incomplete data can seriously curtail decipherment efforts.

For the past two decades, Vancouver-based Bryan K. Wells and Berlin-based Andreas Fuls have been quietly digitizing all known Indus seals and symbols. They append contextual information — such as where they were excavated, when, and alongside what artifacts — and add new ones as they are excavated. The Interactive Corpus of Indus Texts (ICIT) currently contains information about 4,537 inscribed artifacts, 5,509 texts, and 19,616 sign occurrences, featuring a total of 707 unique Indus symbols — a much higher number than the 417 previously identified.

The earlier corpora were compiled by hand. As a result, Wells argues, they were so limited that they risked undermining script research. “You know the old computer saying,” he said recently over Skype, “Garbage in, garbage out.” Nearly 50 researchers around the world currently use the database.

For now, the mysteries of the Indus script continue to elude decipherment. Last year, in a follow-up paper to their work automating the decoding of Ugaritic and Linear B, Luo and his team made a small but crucial advance: an algorithm aimed at identifying possible related languages of undeciphered writing systems. Potentially, this could help address the problem of deciphering scripts that don’t yet have a known language they can be compared against. When Luo and his team tested their model on the Iberian language, which has historically been linked to Basque, their findings suggested the two languages were not in fact close enough to be related — a conclusion that corroborated recent scholarship on the matter.

But while Iberian, said Luo, has at least 80 unique symbols, the Indus script has at least 400, making it exponentially more challenging. Still, theoretically speaking, modern machines can handle this level of computation. Could it be possible simply to “brute force” a problem like the Indus script — to analyze it against all contemporary South Asian languages and see which emerges as its closest linguistic relation? “That’s a good thought,” Luo said, after pausing to think. “If I had time, I would definitely try that.”

Luo is quick to point out that he doesn’t expect any decipherment of lost languages to be fully automated. “My thinking is: Let the system propose a list of candidates and let the experts see, Okay, maybe this theory is more correct than the other,” he said. “It definitely reduces the effort and the number of hours that experts have to expend.”

Not everyone is willing to entertain help from machines. Before settling on Iberian, Luo and his colleagues had considered tackling Etruscan, an undeciphered script from pre-Roman Italy. “One of our co-authors emailed a bunch of professors working in this field,” recalled Luo, chuckling. One of them wrote back, shooing them away. “He replied in quite angry tones, ‘machines can never compete with humans.’”

In the history of studies on the Indus Valley script, here are some other individuals who should not be overlooked:

Gregory Possehl of the University of Pennsylvania, who held that Vedic civilization did not destroy the Indus Valley civilization, but that  it succeeded it.

Asko Parpola, of the University of Helsinki, who maintains that he did indeed decipher the IV script (see his Deciphering the Indus Script [1994]) and that the language was Dravidian.

Yuri Knorozov, a graduate of Moscow State University, who played a decisive role in the decipherment of the Mayan script.  Here is a conceptual description of his key contribution:

In 1952, the then 30-year-old Knorozov published a paper which was later to prove to be a seminal work in the field (Drevnyaya pis’mennost’ Tsentral’noy Ameriki, or "Ancient Writing of Central America".) The general thesis of this paper put forward the observation that early scripts such as ancient Egyptian and Cuneiform which were generally or formerly thought to be predominantly logographic or even purely ideographic in nature, in fact contained a significant phonetic component. That is to say, rather than the symbols representing only or mainly whole words or concepts, many symbols in fact represented the sound elements of the language in which they were written, and had alphabetic or syllabic elements as well, which if understood could further their decipherment.


Knorozov also applied these insights to the investigation of the Indus Valley script, simultaneously extensively employing computers in his research already before the mid-sixties.

Regular readers of Language Log will understand why I have such high regard for Knorozov.  Like Peter Stephen Du Ponceau, he downplayed the pictographic, ideographic, and logographic aspects of written symbols in writing systems and emphasized their phonetic properties in the formation of words.  This is why Knorozov and Du Ponceau operate at a higher level of analysis and insight than middling scholars and are able to make major breakthroughs instead of simply repeating what has been taken for granted for centuries, when it is often quite wrong.

David W. McAlpin, who was an assistant professor of South Asian linguistics when I came to Penn in 1979.  He was denied tenure, but I think he was investigating something of enormous potential, namely that Elamite and Dravidian were cognate languages, and he was working toward the description of Proto-Elamo-Dravidian.  Since Harappan was sandwiched between Dravidian and Elamite, his hypothesis had the potential for a tremendous breakthrough in South Asian linguistics.

Even though my friends were on both sides of the IV script decipherability debate, I have remained agnostic.  After the recent work described in this probing Rest of World article, and being a believer in the awesome power of AI, I'm now tilting toward the possibility that we will one day understand those short IV texts.


Selected readings


[h.t. Don Wyatt]


  1. Scott P. said,

    February 14, 2022 @ 8:46 am

    The Interactive Corpus of Indus Texts (ICIT) currently contains information about 4,537 inscribed artifacts, 5,509 texts, and 19,616 sign occurrences, featuring a total of 707 unique Indus symbols — a much higher number than the 417 previously identified.

    My understanding is that this is one of the outstanding issues with Indic script — determining whether signs that very slightly are different signs, or the same sign with regional or scribal variations.

  2. Philip Anderson said,

    February 14, 2022 @ 6:15 pm

    “Etruscan, an undeciphered script from pre-Roman Italy.”
    The script is well understood, even the meaning of many words, but not whether it’s related to any other language.

  3. Peter Grubtal said,

    February 16, 2022 @ 11:20 am

    A large part of Knosorov's achievement was recognising the significance and correct interpretation of Bishop Landa's alphabet, vid. "Breaking the Maya Code", Michael_D._Coe

    It's amazing to think that there must have been still Mayans around who could read the script when the Spanish arrived in the Yucatan. Mayan history is often presented as having terminated well before the Spanish came on the scene.

    On computer analysis indicating that the Harappan script looks like natural language, I am sceptical. Not so long ago some academics claimed that an analysis of that kind showed that the Voynich MS had characteristics of a genuine language. With what is known of the origins of Voynich, and a perusal of it, I have no trouble dismissing it as a catchpenny hoax (albeit probably a 16th Cent. hoax).

    Phaistos disk anyone?

  4. SusanC said,

    February 17, 2022 @ 2:59 am

    @Peter Grubtal:

    The Voynich Manuscript is, indeed, a target for this kind of approach.

    There's a strong suspicion that it's a meaningless forgery, either by some contempories of Athanasius Kircher or even by the book dealer, Voynuch, who "discovered" it.

    The question is, can you tell, statistically? Are there statistical properties that real languages have, but naive forgeries don't, such that you tell which is which without knowing the language?

    It's a legitimate and interesting research question,

    Even if the Voynich manuscript is a hoax, it might in principle be a real language – an existing language enciphered somehow, or a conlang like Esperanto,

    The differences between the conjectured "A" and "B" authors of the Voynich manuscript is one route in. So, ok, we have two authors whose patterns of symbol ysage are statistically different. Are these close enough that they could be from two speakers of the same language, or different enough that they must be two people making up two different lots of nonsense.

  5. SusanC said,

    February 17, 2022 @ 3:19 am

    And then there's the Mayan Primary Standard Sequence:

    If the makers of a collection of archeological artefacts have been so kind as to write on them the name of the type of pottery vessel and the name of its contents (e.g. chocolate), there's a good prospect for deciphering the script by statistically correlating what we know about each object (e.g. what kind of vessel is it) with the symbols written on it,

  6. Peter Grubtal said,

    February 17, 2022 @ 4:52 am

    You're right, there are possibilites for Voynich other than a naive forgery, i.e. a more sophisticated forgery, or perhaps even, perish the thought, my (and many others') hunch is wrong, and its meaning will one day be revealed to the world. The illustrations in the book are for me almost more suspicious than the text: they just look too "New Agey": dangled worms for someone with more money than sense like the credulous Holy Roman Emperor, Rudolf II. He's one of the possible targets for the book. Although the expression is modern, New Ageism has a long history.

    It's interesting you mention Athanasius Kircher. He put about the idea that a language could be expressed by purely ideographic writing, hence would be Victor Mair's bête noire if he were still around.

    The link to the Mayan vessels is very interesting. I love Coe's book, and there was even a very good tv documentary on the deciphering of Mayan a couple of years ago, complete with animations of the glyphs.

    I mangled Knorozov's name in my first note.

RSS feed for comments on this post