Language Log

Archive for Computational linguistics

Computational phylogeny of Indo-European

June 27, 2025 @ 6:24 am· Filed by Victor Mair under Classification, Computational linguistics, Evolution of language

Alexei S. Kassian and George Starostin, "Do 'language trees with sampled ancestors' really support a 'hybrid model' for the origin of Indo-European? Thoughts on the most recent attempt at yet another IE phylogeny". Humanities and Social Sciences Communications, 12, no. 682 (May 16, 2025).

Abstract

In this paper, we present a brief critical analysis of the data, methodology, and results of the most recent publication on the computational phylogeny of the Indo-European family (Heggarty et al. 2023), comparing them to previous efforts in this area carried out by (roughly) the same team of scholars (informally designated as the “New Zealand school”), as well as concurrent research by scholars belonging to the “Moscow school” of historical linguistics. We show that the general quality of the lexical data used as the basis for classification has significantly improved from earlier studies, reflecting a more careful curation process on the part of qualified historical linguists involved in the project; however, there remain serious issues when it comes to marking cognation between different characters, such as failure (in many cases) to distinguish between true cognacy and areal diffusion and the inability to take into account the influence of the so-called derivational drift (independent morphological formations from the same root in languages belonging to different branches). Considering that both the topological features of the resulting consensus tree and the established datings contradict historical evidence in several major aspects, these shortcomings may partially be responsible for the results. Our principal conclusion is that the correlation between the number of included languages and the size of the list may simply be insufficient for a guaranteed robust topology; either the list should be drastically expanded (not a realistic option for various practical reasons) or the number of compared taxa be reduced, possibly by means of using intermediate reconstructions for ancestral stages instead of multiple languages (the principle advocated by the Moscow school).

Read the rest of this entry »

Permalink Comments (4)

Unicode CJK Unified Ideographs Extension J and the nature of the sinographic writing system

June 16, 2025 @ 8:06 am· Filed by Victor Mair under Computational linguistics, Writing systems

Submitted by Charles Belov:

I've been browsing through the proposed Unicode 17 changes, currently undergoing a comment period, with interest. While I don't have the knowledge to intelligently comment on the proposals, it's good to see that they are actively improving language access.

I'm puzzled that some new characters have been added to the existing Unicode CJK Unified Ideographs Extension C (6 characters) and Unicode CJK Unified Ideographs Extension E (12 characters) rather than added to a new extension. But the most interesting is the apparently brand-new Unicode CJK Unified Ideographs Extension J, with over 4,000 added characters.

Read the rest of this entry »

Permalink Comments (32)

A new voice morphing application

March 4, 2025 @ 9:20 am· Filed by Mark Liberman under Computational linguistics

Over the years, we've documented various applications of voice morphing technology besides the malicious creation of "deep fake" audio clips. Here's a new one: Amrit Dillon, "AI erases call centre staff’s Indian accents", The Times 3/2/2025:

A French company which operates the largest number of call centres in the world is using artificial intelligence to soften Indian accents in real time to make customer conversations easier and shorter.

Teleperformance said that it was sometimes difficult for customers calling call centres in India — and the Philippines — to understand workers’ accents, leading to frustration and longer than necessary calls.

“When you have an Indian agent on the line, sometimes it’s hard to hear, to understand,” Thomas Mackenbrock, the company’s deputy chief executive, told Bloomberg News. “The technology can neutralise the accent of the Indian speaker with zero latency. This creates more intimacy, increases customer satisfaction, and reduces the average handling time. It is a win-win for both parties.”

The software, called “accent translation”, has been developed by Sanas, a start-up based in Palo Alto, California.

Read the rest of this entry »

Permalink Comments (19)

Remaining problems with TTS

November 2, 2024 @ 9:38 am· Filed by Mark Liberman under Computational linguistics, Orthography

(…and with the New York Department of Environmental Conservation…)

Like many other online text sites, the New York Times now offers synthetic text-to-speech readings for (most of) its stories. TTS quality has improved enormously since the 1980s, when I worked with Bill Dunn from Dow Jones Information Services on (the idea of) a pre-internet version of digital news delivery, including synthesized audio versions. (See "Thanks, Bill Dunn!", 8/6/2009, for a bit more of the story.)

And this morning, while doing some brainless form checking, I listened to the audio version of Victor Mather and Jesus Jiménez, "After 7 Years, P’Nut the Squirrel Is Taken Away and Then Put Down", NYT 11/1/2024, which starts this way:

P’Nut, a pet squirrel with a popular Instagram page, was seized by state government officials on Wednesday in Pine City, N.Y., and later euthanized to test for rabies.

Read the rest of this entry »

Permalink Comments (10)

Psychotic Whisper

October 27, 2024 @ 5:31 am· Filed by Mark Liberman under Artificial intelligence, Computational linguistics

Whisper is a widely-used speech-to-text system from OpenAI — and it turns out that generative AI's hallucination problem afflicts Whisper to a surprisingly serious extent, as documented by Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, and Mona Sloane,"Careless Whisper: Speech-to-Text Hallucination Harms", In The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024:

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Read the rest of this entry »

Permalink Comments (12)

"Lost" languages?

October 6, 2024 @ 10:22 am· Filed by Mark Liberman under Computational linguistics, Endangered languages

The use of the word lost in this recent story caught my attention — Pankaj Doval, "Google set to revive lost Indian languages", The Times of India 10/3/2024:

As it gets deeper into India with generative AI platform Gemini and other suite of digital offerings, Google has taken up a new task in hand – reviving some of the lost Indian languages and creating digital records and online footprint for them.

I'll say more later about Google's important and interesting contribution to an important and interesting problem. But first, what does the article mean by "lost Indian languages"? I started from the idea that languages that are "lost" are extinct, i.e. no longer spoken — and a web search for the phrase "lost languages" confirms that others have the same interpretation.

However, the Times of India article makes it clear that this is not what they mean:

The idea is to enable people to easily carry out voice or text searches in their local dialects and languages.

As the work moves towards completion, people in the hinterland and various regions can easily do voice search in their own languages to gain accurate and valuable information from, say, Google's Gemini AI platform or carry out live translations, harness YouTube better to target their communities.

The project has so far reached 59 Indian languages, including 15 that currently do not have any kind of a digital footprint and were rather declining in usage.

Read the rest of this entry »

Permalink Comments (10)

Putin: "pollutant"? "pooch and"?

September 1, 2024 @ 8:31 am· Filed by Mark Liberman under Artificial intelligence, Computational linguistics

The transcriptions on YouTube are generally pretty good these days, but sometimes the results are weird.

A notable recent example is the transcription of Donald Trump's 8/31/2024 Fox interview with Mark Levin, where the system renders "Putin" first as "pollutant" and then as "pooch and".

Read the rest of this entry »

Permalink Comments (12)

Skinning a bear with Rosanne Barr

August 5, 2024 @ 10:14 am· Filed by Mark Liberman under Computational linguistics, Psychology of language, Syntax

…vs. having a video conversation with her…

Attachment ambiguity of the week: "RFK Jr. says he dumped dead bear in Central Park after ditching plan to skin it in bizarre video with Roseanne Barr", NY Post 8/4/2024.

Read the rest of this entry »

Permalink Comments (6)

The evolving PubMed landscape

July 9, 2024 @ 7:08 am· Filed by Mark Liberman under Computational linguistics

Following up on "Are LLMs writing PubMed articles?", 7/7/2024, Cervantes suggested a factor, besides LLM availability, that has been influencing the distribution of word frequencies in PubMed's index:

As an investigator whose own papers are indexed in PubMed, and who has been watching the trends in scientific fashion for some decades, I can come up with other explanations. For one thing, it's easier to get exploratory and qualitative research published nowadays than it once was. Reviewers and editors are less inclined to insist that only hypothesis driven research is worthy of their journal — and, with open access, there are a lot more journals, including some with low standards and others that do insist on decent quality but will accept a wide range of papers. It's even possible now to publish protocols for work that hasn't been done yet. So it doesn't surprise me at all that words like "explore" and "delve" (which is a near synonym, BTW) are more likely to show up in abstracts, because that's more likely to be what the paper is doing.

I agree, although it remains unclear whether those changes have been strong enough to explain the effects documented in Dmitry Kobak et al., "Delving into ChatGPT usage in academic writing through excess vocabulary", arXiv.org 7/3/2024.

Read the rest of this entry »

Permalink Comments (3)

Stochastic parrots extended

June 27, 2024 @ 8:48 am· Filed by Mark Liberman under Artificial intelligence, Computational linguistics

Philip Resnik, "Large Language Models are Biased Because They Are Large Language Models", arXiv.org 6/19/2024:

This paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. We do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.

Read the rest of this entry »

Permalink Comments (33)

AI voice-over?

May 11, 2024 @ 2:05 pm· Filed by Mark Liberman under Computational linguistics

On 5/8/2024, the Defense Visual Information Distribution Service (DVIDS) offered a "Graphical representation of how the precision cutting charges will be used on key bridge section":

Several bits in the voice-over suggest that it was generated by a text-to-speech program — I'll note a couple of them below. And the failure to capitalize "Key Bridge" in the page's title might also be a symptom of AI-generation?

Read the rest of this entry »

Permalink Comments (11)

Yay Newfriend again

April 22, 2024 @ 6:41 am· Filed by Mark Liberman under Artificial intelligence, Computational linguistics

I got an echo of Saturday's post about chatbot pals, from an article yesterday in Intelligencer — John Herrman, "Meta’s AI Needs to Speak With You" ("The company is putting chatbots everywhere so you don’t go anywhere"):

Meta has an idea: Instead of ever leaving its apps, why not stay and chat with a bot? This past week, Mark Zuckerberg announced an update to Meta’s AI models, claiming that, in some respects, they were now among the most capable in the industry. He outlined his company’s plans to pursue AGI, or Artificial General Intelligence, and made some more specific predictions: “By the end of the decade, I think lots of people will talk to AIs frequently throughout the day, using smart glasses like what we’re building with Ray-Ban Meta.”

Read the rest of this entry »

Permalink Comments (9)