Archive for Computational linguistics

Remaining problems with TTS

(…and with the New York Department of Environmental Conservation…)

Like many other online text sites, the New York Times now offers synthetic text-to-speech readings for (most of) its stories. TTS quality has improved enormously since the 1980s, when I worked with Bill Dunn from Dow Jones Information Services on (the idea of) a pre-internet version of digital news delivery, including synthesized audio versions. (See "Thanks, Bill Dunn!", 8/6/2009, for a bit more of the story.)

And this morning, while doing some brainless form checking, I listened to the audio version of Victor Mather and Jesus Jiménez, "After 7 Years, P’Nut the Squirrel Is Taken Away and Then Put Down", NYT 11/1/2024, which starts this way:

P’Nut, a pet squirrel with a popular Instagram page, was seized by state government officials on Wednesday in Pine City, N.Y., and later euthanized to test for rabies.

Read the rest of this entry »

Comments (10)

Psychotic Whisper

Whisper is a widely-used speech-to-text system from OpenAI — and it turns out that generative AI's hallucination problem afflicts Whisper to a surprisingly serious extent, as documented by Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, and Mona Sloane,"Careless Whisper: Speech-to-Text Hallucination Harms", In The 2024 ACM Conference on Fairness, Accountability, and Transparency,  2024:

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Read the rest of this entry »

Comments (12)

"Lost" languages?

The use of the word lost in this recent story caught my attention — Pankaj Doval, "Google set to revive lost Indian languages", The Times of India 10/3/2024:

As it gets deeper into India with generative AI platform Gemini and other suite of digital offerings, Google has taken up a new task in hand – reviving some of the lost Indian languages and creating digital records and online footprint for them.

I'll say more later about Google's important and interesting contribution to an important and interesting problem. But first, what does the article mean by "lost Indian languages"? I started from the idea that languages that are "lost" are extinct, i.e. no longer spoken — and a web search for the phrase "lost languages" confirms that others have the same interpretation.

However, the Times of India article makes it clear that this is not what they mean:

The idea is to enable people to easily carry out voice or text searches in their local dialects and languages.

As the work moves towards completion, people in the hinterland and various regions can easily do voice search in their own languages to gain accurate and valuable information from, say, Google's Gemini AI platform or carry out live translations, harness YouTube better to target their communities.

The project has so far reached 59 Indian languages, including 15 that currently do not have any kind of a digital footprint and were rather declining in usage.

The project has so far reached 59 Indian languages, including 15 that currently do not have any kind of a digital footprint and were rather declining in usage.

Read the rest of this entry »

Comments (10)

Putin: "pollutant"? "pooch and"?

The transcriptions on YouTube are generally pretty good these days, but sometimes the results are weird.

A notable recent example is the transcription of Donald Trump's 8/31/2024 Fox interview with Mark Levin, where the system renders "Putin" first as "pollutant" and then as "pooch and".

Read the rest of this entry »

Comments (12)

Skinning a bear with Rosanne Barr

…vs. having a video conversation with her…

Attachment ambiguity of the week: "RFK Jr. says he dumped dead bear in Central Park after ditching plan to skin it in bizarre video with Roseanne Barr", NY Post 8/4/2024.

Read the rest of this entry »

Comments (6)

The evolving PubMed landscape

Following up on "Are LLMs writing PubMed articles?", 7/7/2024, Cervantes suggested a factor, besides LLM availability, that has been influencing the distribution of word frequencies in PubMed's index:

As an investigator whose own papers are indexed in PubMed, and who has been watching the trends in scientific fashion for some decades, I can come up with other explanations. For one thing, it's easier to get exploratory and qualitative research published nowadays than it once was. Reviewers and editors are less inclined to insist that only hypothesis driven research is worthy of their journal — and, with open access, there are a lot more journals, including some with low standards and others that do insist on decent quality but will accept a wide range of papers. It's even possible now to publish protocols for work that hasn't been done yet. So it doesn't surprise me at all that words like "explore" and "delve" (which is a near synonym, BTW) are more likely to show up in abstracts, because that's more likely to be what the paper is doing.

I agree, although it remains unclear whether those changes have been strong enough to explain the effects documented in Dmitry Kobak et al., "Delving into ChatGPT usage in academic writing through excess vocabulary", arXiv.org 7/3/2024.

Read the rest of this entry »

Comments (3)

Stochastic parrots extended

Philip Resnik, "Large Language Models are Biased Because They Are Large Language Models", arXiv.org 6/19/2024:

This paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. We do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.

Read the rest of this entry »

Comments (33)

AI voice-over?

On 5/8/2024, the Defense Visual Information Distribution Service (DVIDS) offered a "Graphical representation of how the precision cutting charges will be used on key bridge section":

Several bits in the voice-over suggest that it was generated by a text-to-speech program — I'll note a couple of them below. And the failure to capitalize "Key Bridge" in the page's title might also be a symptom of AI-generation?

Read the rest of this entry »

Comments (11)

Yay Newfriend again

I got an echo of Saturday's post about chatbot pals, from an article yesterday in Intelligencer — John Herrman, "Meta’s AI Needs to Speak With You" ("The company is putting chatbots everywhere so you don’t go anywhere"):

Meta has an idea: Instead of ever leaving its apps, why not stay and chat with a bot? This past week, Mark Zuckerberg announced an update to Meta’s AI models, claiming that, in some respects, they were now among the most capable in the industry. He outlined his company’s plans to pursue AGI, or Artificial General Intelligence, and made some more specific predictions: “By the end of the decade, I think lots of people will talk to AIs frequently throughout the day, using smart glasses like what we’re building with Ray-Ban Meta.”

Read the rest of this entry »

Comments (9)

Ask Dalí

A new feature at the Dalí Museum in St. Petersburg FL:

Read the rest of this entry »

Comments (2)

Yay Newfriend

Worries about future applications of AI technology focus on many things, including new forms of automation replacing human workers, realistic deepfake media spreading disinformation, and mass killing by autonomous military machines. But there's something happening already that hasn't gotten as much commentary: chatbots designed to be pals or romantic connections.

Read the rest of this entry »

Comments (8)

"Our digital god is a CSV file?"

Barry Collins, "The 5 Weirdest Things Elon Musk Told Britain’s Prime Minister About AI", Forbes 11/3/2023:

5. Our New Digital Gods Are Giant Spreadsheets

Musk and Sunak spent some time discussing the difficulties of regulating AI and how it differs from other branches of technology. And that led to a rather strange discussion about the nature of large language models and what they actually are.

Musk described AI models as a “gigantic data file” with “billions of weights and parameters.”

“You can’t just read it and see what it’s going to do. It’s a gigantic file of inscrutable numbers,” he said.

“It sort of ends up being a giant comma-separated value file,” Musk added, describing the kind of file you might open with Microsoft Excel. “Our digital god is a CSV file? Really? OK.”

Read the rest of this entry »

Comments (20)

"Emote Portrait Alive"

EMO, by Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo from Alibaba's Institute for Intelligent Computing, is "an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses".

As far as I know, there's no interactive demo so far, much less code — just a github demo page and an arXiv.org paper.

Their demo clips are very impressive — a series of X posts from yesterday has gotten 1.1M views already. Here's Leonardo DiCaprio artificially lip-syncing Eminem:


Read the rest of this entry »

Comments (5)