Archive for Computational linguistics


0:11 TING
0:36 SE.

That's the start of the automatically-generated transcript on YouTube for "See George Conway's reaction to Trump's reported plan if he wins again", CNN 7/24/2022.

Read the rest of this entry »

Comments (3)

Conversations with GPT-3

In a recent presentation, I noted that generic statements can be misleading, though it's not easy to avoid the problem:

The limitations and complexities of ordinary language in this area pose difficult problems for scientists, journalists, teachers, and everyone else.

But the problems are especially hard to avoid for AI researchers aiming to turn large text collections into an understanding of the world that the texts discuss.

And to illustrate the point, I used a couple of conversations with GPT-3.

Read the rest of this entry »

Comments (14)

Sentient AI

Read the rest of this entry »

Comments (7)

Comparing phrase lengths in French and English

In a comment on "Trends in French sentence length" (5/26/2022), AntC raised the issue of cross-language differences in word counts: "I was under the impression French needed ~20% more words to express the same idea as an English text." And in response, I promised to "check letter-count and word-count relationships in some English/French parallel text corpora, when I have a few minutes".

I found a few minutes yesterday, and ran (a crude version of) this check on the data in Alex Franz, Shankar Kumar & Thorsten Brants, "1993-2007 United Nations Parallel Text", LDC2013T06.

Read the rest of this entry »

Comments (10)

Trends in French sentence length?

"Memoirs of a Woman of Long Sentences" (5/21/2022) reproduced a plot from my 5/20/2022 talk at SHEL 12:

In the talk's slides, I used that plot (without the outlier-marking arrow) as a way of  illustrating the obvious point that "Older texts in English tend to have longer sentences".

And in my final slide, I suggested that "French seems different". That (imprudent) suggestion was based on my subjective impression of a few 18th-century works, where it seemed to me that sentence (and especially paragraph) lengths were much shorter in French-language works than in English-language ones from the same period.

Read the rest of this entry »

Comments (7)

Memoirs of a Woman of Long Sentences

In the question period after my virtual talk yesterday at SHEL 12, an alert audience member asked about the outlier in a graph that I showed of average sentence length over the centuries. The outlier is marked with an arrow in the plot below, though no such arrow singled it out in the presentation:

I had been struck by the same point when I made the graph, and identified the work and author as John Cleland's 1748 epistolary novel, "Memoirs of a Woman of Pleasure", commonly known as Fanny Hill.

Read the rest of this entry »

Comments (8)

Praise for clinical applications of linguistic analysis

From the abstract of Sunghye Cho et al., "Lexical and Acoustic Speech Features Relating to Alzheimer Disease Pathology", published in Neurology on 4/29/2022:

Background and Objectives: We compared digital speech and language features of patients with amnestic Alzheimer’s disease (aAD) or logopenic variant primary progressive aphasia (lvPPA) in a biologically confirmed cohort and related these features to neuropsychiatric test scores and CSF analyses.


Discussion: Our measures captured language and speech differences between the two phenotypes that traditional language-based clinical assessments failed to identify. 

From an editorial by Federica Agosta and Massimo Felippi, "Natural Speech Analysis: A Window Into Alzheimer Disease Phenotypes", published in Neurology on 5/4/2022:

Compared to a standard language assessment, the automated analysis of natural speech is more complex and requires a larger amount of time to be post-processed. On the other hand, as is well demonstrated by this study, analysis of natural speech provides information at several levels of language production. Even though data are extracted from only one recorded minute of speech, the tool is able to detect subtle differences among groups, reflecting the patient’s daily experience in a more realistic way than the standard speech and language assessment. Its use has already produced important achievements in distinguishing different language phenotypes. Furthermore, differently from other studies, the work of Cho et al proposed an automated and reproducible method that highly reduces the time of speech analysis and increases the inter-rater reliability.

Read the rest of this entry »

Comments (1)

The covert pandemic

Trevor Noah's speech at the White House Correspondents' Dinner has gotten a lot of well-deserved praise. But what impressed me most about it was the quality of the "auto-generated" transcript associated with the YouTube version.

Assuming that "auto-generated" means "the output of automatic speech-to-text", the results are overall excellent — with a few odd glitches. For example, the transcript consistently renders "Covid" as "covert". The first one, at around 1:40 —

and uh covert risk aside can i just say
how happy i am that this event is
happening again for the first time in
three years

Read the rest of this entry »

Comments (15)

Parsing puzzle of the week

"Short Wave: A Physics Legend", NPR Up First 4/3/2022 [emphasis added]:

In the 1950's, a particle physicist made a landmark discovery that changed what we thought we knew about how our universe operates. Chien-Shiung Wu did it while raising a family and an ocean away from her relatives in China. In this episode from NPR's daily science podcast Short Wave, we delve into the life and impact of Chien-Shiung Wu, widely considered the "queen of nuclear physics."

Read the rest of this entry »

Comments (17)

The weirdness of typing errors

In this age of typing on computers and other digital devices, when we daily input thousands upon thousands of words, we are often amazed at the number and types of mistakes we make.  Many of them are simple and straightforward, as when our fingers stumblingly hit the wrong keys by sheer accident.  People who type on phones warn their correspondents about the likelihood that their messages are prone to contain such errors because they include some such warning at the bottom: 

Please forgive spelling / grammatical errors; typed on glass // sent from my phone.

Read the rest of this entry »

Comments (37)

Postdocs on ancient scripts: Chinese and Aegean

Since these are on subjects that are of interest to many of us, I'm calling them to your attention.

From Mattia Cartolano:

The INSCRIBE project is hiring!

Two post-doc positions are now available:

  1. Evolution of Graphic Codes: The Origins of the Chinese Script
  2. Undeciphered Aegean Scripts: New perspectives in Computational Linguistics

Deadline for applications: Sunday 27 March 2022
If you want to find out more, write to

Read the rest of this entry »

Comments off

Turing Complete

Today's xkcd:

The mouseover title: "Thanks to the ForcedEntry exploit, your company's entire tech stack can now be hosted out of a PDF you texted to someone."

Read the rest of this entry »

Comments (13)

Who is brian?

Email from a colleague in computer science, listing some of the mistranscriptions in the Zoom captions of his office hours:

timing problem -> tiling problem
bulletin annually -> boolean formulae
satisfy your ability -> satisfiability
fire patterns -> tile patterns
inquisition -> position
valuables -> variables
double fines -> double prime
double poison -> ?
amen -> m, n
wine is in the continent of age -> ???
I do not want a diet climb to brian -> ???

I will stop here and I hope that you can all satisfy your ability with no double fines
and avoid inquisition.
Who is brian?

Read the rest of this entry »

Comments (8)