Language Log

Against physics

September 10, 2022 @ 6:51 pm· Filed by Mark Liberman under Computational linguistics, Phonetics and phonology

Or rather: Against the simplistic interpretation of physics-based abstractions as equal to more complex properties of the physical universe. And narrowing the focus further, it's a big mistake to analyze signals in terms of such abstractions, while pretending that we're analyzing the processes creating those signals, or our perceptions of those signals and processes. This happens in many ways in many disciplines, but it's especially problematic in speech research.

The subject of today's post is one particular example, namely the use of "Harmonic to Noise Ratio" (HNR) as a measure of hoarseness and such-like aspects of voice quality. Very similar issues arise with all other acoustic measures of speech signals.

I'm not opposed to the use of such measures. I use them myself in research all the time. But there can be serious problems, and it's easy for things to go badly off the rails. For example, HNR can be strongly affected by background noise, room acoustics, microphone frequency response, microphone placement, and so on. This might just add noise to your data. But if different subject groups are recorded in different places or different ways, you might get serious artefacts.

Read the rest of this entry »

Permalink Comments (6)

Our Lady of the Highway: A linguistic mystery

September 8, 2022 @ 12:19 pm· Filed by Mark Liberman under Computational linguistics, Speech technology

Current text-to-speech systems are pretty good. Their output is almost always comprehensible, and often pretty natural-sounding. But there are still glitches.

This morning, Dick Margulis sent an example of one common problem: inconsistent (and often wrong) stressing of complex nominals:

We have a winding road that we drive with our Google Maps navigator on, to keep us from taking a wrong turn in the woods. We have noticed that "West Woods Road" is rendered with a few different stress patterns as we go from turn to turn, and we can't come up with a hypothesis explaining the variation. Attached is a recording. It's a few minutes long because that's how long the trip takes. The background hum is the car.

I've extracted and concatenated the 11 Google Maps instructions from the four minutes and five seconds of the attached recording:

Read the rest of this entry »

Permalink Comments (30)

Micro- Nano-Stylistic Variation

September 6, 2022 @ 9:17 am· Filed by Mark Liberman under Computational linguistics, Usage

"Don't miss the most loved conference by Delphists like you!"

Philip Taylor wrote to complain about that phrase, which apparently arrived in an email advertisement:

"The most loved conference …" ? I would have written "The conference most loved …".

But his preference apparently disagrees, not only with the author of that flyer, but also with most other writers of English. And it's wonderful how easily we can now check such things. As Yogi Berra (may have) said, "Sometimes you can see a lot just by looking".

Read the rest of this entry »

Permalink Comments (26)

When more data makes things worse…

August 29, 2022 @ 7:44 am· Filed by Mark Liberman under Computational linguistics

The mantra of machine learning, as Fred Jelinek used to say, is "The best data is more data" — because in many areas, there's a Long Tail of relevant cases that are hard to classify or predict without either a valid theory or enough examples.

But a recent meta-analysis of machine-learning work in digital medicine shows, convincingly, that more data can lead to poorer reported performance. The paper is Visar Berisha et al., "Digital medicine and the curse of dimensionality", NPJ digital medicine 2021, and one of the pieces of evidence they present is shown in the figure reproduced below:

This analysis considers two types of models: (1) speech-based models for classifying between a control group and patients with a diagnosis of Alzheimer’s disease (Con vs. AD; blue plot) and (2) speech-based models for classifying between a control group and patients with other forms of cognitive impairment (Con vs. CI; red plot).

Read the rest of this entry »

Permalink Comments (8)

Word frequency variation: elicit vs. illicit

August 20, 2022 @ 8:16 am· Filed by Mark Liberman under Computational linguistics

In the comments on yesterday's post about a slip of the fingers or brain ("Elicit → illicit"), there was some discussion about which of the two words is more common.

Obviously, the answer to such questions depends on where you look.

So I looked in a bunch of places. Overall, illicit tends to be more common than elicit — but the relative frequency varies widely, and sometimes it's the other way round.

Read the rest of this entry »

Permalink Comments (4)

COURTHOUHAING TOGET T ROCESS.WHE

July 25, 2022 @ 6:59 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

HE HAS ALL THE SOU OF COURSE
0:05 AND LOADED, READTOO.K
0:11 TING
0:16 A TVERY CONFIDENT.CONWAY
0:21 COURTHOUHAING TOGET T ROCESS.WHE
0:28 COIDATE'
0:30 TTACUTION'S CATHATE'
0:36 SE.
0:36 CHCEN'T KNHA
0:37 TAER OFURDI

That's the start of the automatically-generated transcript on YouTube for "See George Conway's reaction to Trump's reported plan if he wins again", CNN 7/24/2022.

Read the rest of this entry »

Permalink Comments (3)

Conversations with GPT-3

June 25, 2022 @ 7:50 am· Filed by Mark Liberman under Computational linguistics, Humor

In a recent presentation, I noted that generic statements can be misleading, though it's not easy to avoid the problem:

The limitations and complexities of ordinary language in this area pose difficult problems for scientists, journalists, teachers, and everyone else.

But the problems are especially hard to avoid for AI researchers aiming to turn large text collections into an understanding of the world that the texts discuss.

And to illustrate the point, I used a couple of conversations with GPT-3.

Read the rest of this entry »

Permalink Comments (14)

Sentient AI

June 14, 2022 @ 8:29 am· Filed by Mark Liberman under Computational linguistics, Humor

Google Researcher discovers sentient AI pic.twitter.com/3D0vcQBjEm

— Julian Togelius (@togelius) June 13, 2022

Read the rest of this entry »

Permalink Comments (7)

Comparing phrase lengths in French and English

May 28, 2022 @ 6:34 am· Filed by Mark Liberman under Computational linguistics, Orthography

In a comment on "Trends in French sentence length" (5/26/2022), AntC raised the issue of cross-language differences in word counts: "I was under the impression French needed ~20% more words to express the same idea as an English text." And in response, I promised to "check letter-count and word-count relationships in some English/French parallel text corpora, when I have a few minutes".

I found a few minutes yesterday, and ran (a crude version of) this check on the data in Alex Franz, Shankar Kumar & Thorsten Brants, "1993-2007 United Nations Parallel Text", LDC2013T06.

Read the rest of this entry »

Permalink Comments (10)

Trends in French sentence length?

May 26, 2022 @ 8:13 pm· Filed by Mark Liberman under Computational linguistics

"Memoirs of a Woman of Long Sentences" (5/21/2022) reproduced a plot from my 5/20/2022 talk at SHEL 12:

In the talk's slides, I used that plot (without the outlier-marking arrow) as a way of illustrating the obvious point that "Older texts in English tend to have longer sentences".

And in my final slide, I suggested that "French seems different". That (imprudent) suggestion was based on my subjective impression of a few 18th-century works, where it seemed to me that sentence (and especially paragraph) lengths were much shorter in French-language works than in English-language ones from the same period.

Read the rest of this entry »

Permalink Comments (7)

Memoirs of a Woman of Long Sentences

May 21, 2022 @ 9:36 am· Filed by Mark Liberman under Computational linguistics, Style and register

In the question period after my virtual talk yesterday at SHEL 12, an alert audience member asked about the outlier in a graph that I showed of average sentence length over the centuries. The outlier is marked with an arrow in the plot below, though no such arrow singled it out in the presentation:

I had been struck by the same point when I made the graph, and identified the work and author as John Cleland's 1748 epistolary novel, "Memoirs of a Woman of Pleasure", commonly known as Fanny Hill.

Read the rest of this entry »

Permalink Comments (8)

Praise for clinical applications of linguistic analysis

May 10, 2022 @ 7:26 am· Filed by Mark Liberman under Computational linguistics, Language and medicine

From the abstract of Sunghye Cho et al., "Lexical and Acoustic Speech Features Relating to Alzheimer Disease Pathology", published in Neurology on 4/29/2022:

Background and Objectives: We compared digital speech and language features of patients with amnestic Alzheimer’s disease (aAD) or logopenic variant primary progressive aphasia (lvPPA) in a biologically confirmed cohort and related these features to neuropsychiatric test scores and CSF analyses.

[…]

Discussion: Our measures captured language and speech differences between the two phenotypes that traditional language-based clinical assessments failed to identify.

From an editorial by Federica Agosta and Massimo Felippi, "Natural Speech Analysis: A Window Into Alzheimer Disease Phenotypes", published in Neurology on 5/4/2022:

Compared to a standard language assessment, the automated analysis of natural speech is more complex and requires a larger amount of time to be post-processed. On the other hand, as is well demonstrated by this study, analysis of natural speech provides information at several levels of language production. Even though data are extracted from only one recorded minute of speech, the tool is able to detect subtle differences among groups, reflecting the patient’s daily experience in a more realistic way than the standard speech and language assessment. Its use has already produced important achievements in distinguishing different language phenotypes. Furthermore, differently from other studies, the work of Cho et al proposed an automated and reproducible method that highly reduces the time of speech analysis and increases the inter-rater reliability.

Read the rest of this entry »

Permalink Comments (1)

The covert pandemic

May 2, 2022 @ 6:03 am· Filed by Mark Liberman under Computational linguistics, Humor

Trevor Noah's speech at the White House Correspondents' Dinner has gotten a lot of well-deserved praise. But what impressed me most about it was the quality of the "auto-generated" transcript associated with the YouTube version.

Assuming that "auto-generated" means "the output of automatic speech-to-text", the results are overall excellent — with a few odd glitches. For example, the transcript consistently renders "Covid" as "covert". The first one, at around 1:40 —

and uh covert risk aside can i just say
how happy i am that this event is
happening again for the first time in
three years

Read the rest of this entry »

Permalink Comments (15)

Archive for Computational linguistics

Against physics

Our Lady of the Highway: A linguistic mystery

Micro- Nano-Stylistic Variation

When more data makes things worse…

Word frequency variation: elicit vs. illicit

COURTHOUHAING TOGET T ROCESS.WHE

Conversations with GPT-3

Sentient AI

Comparing phrase lengths in French and English

Trends in French sentence length?

Memoirs of a Woman of Long Sentences

Praise for clinical applications of linguistic analysis

The covert pandemic

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta