Archive for Computational linguistics

The Scunthorpe effect rides again

Alex Hern, "Anti-porn filters stop Dominic Cummings trending on Twitter", The Guardian 5/27/2020:

Twitter’s anti-porn filters have blocked Dominic Cummings’ name from its list of trending topics despite Boris Johnson’s chief adviser dominating British political news for almost a week, the Guardian can reveal.

As a result of the filtering, trending topics over the past five days have instead included a variety of misspellings of his name, including #cummnings, #dominiccummigs and #sackcummimgs, as well as his first name on its own, the hashtag #sackdom, and the place names Durham, County Durham and Barnard Castle.

The filter also affects suggested hashtags, meaning users who tried to type #dominiccummings were instead presented with one of the misspelled variations to auto-complete, helping them trend instead.

This sort of accidental filtering has gained a name in computer science: the Scunthorpe problem, so-called because of the Lincolnshire town’s regular issues with such censorship.

Read the rest of this entry »

Comments (15)

Mama Drama

So-called "verbal fluency" is one of the tasks we're using in the first iteration of the SpeechBiomarkers project (and please participate if you haven't done so!). Despite the test's name, it doesn't really measure verbal fluency, but rather asks subjects to name as many words as possible from some category in 60 seconds, like "animals" or "words starting with the letter F".

Here's the first ten seconds of one participant's "animals" response:

As you can hear, the audio quality is quite good, although it was recorded remotely using the participant's browser and their system's standard microphone. These days, standard hardware usually has pretty good built-in facilities for voice recording.

In order to automate the analysis, we need a speech-to-text system that can do a good enough job on data of this kind. As I've noted in earlier posts, we're not there yet for picture descriptions ("Shelties On Alki Story Forest", 11/26/2019; "The right boot of the warner of the baron", 12/6/2019). For "fluency" recordings, the error rate is worse — but maybe we're actually closer to a solution, as I'll explain below.

Read the rest of this entry »

Comments (5)

Chatbot comedy

Unfortunately, most customer service chatbots are not nearly this good:

Comments (1)

Zoom time

I'm involved with several projects that analyze recordings from e-interviews conducted using systems like Zoom, Bluejeans, and WebEx. Some of our analysis techniques rely on timing information, and so it's natural to wonder how much distortion might be introduced by those systems' encoding, transmission, and decoding processes.

Why might timing in particular be distorted? Because any internet-based audio or video streaming system encodes the signal at the source into a series of fairly small packets, sends them individually by diverse routes to the destination, and then assembles them again at the end.

If the transmission is one-way, then the system can introduce a delay that's long enough to ensure that all the diversely-routed packets get to the destination in time to be reassembled in the proper order — maybe a couple of seconds of buffering. But for a conversational system, that kind of latency disrupts communication, and so the buffering delays used by broadcasters and streaming services are not possible. As a result, there may be missing packets at decoding time, and the system has to deal with that by skipping, repeating, or interpolating (the signal derived from) packets, or by just freezing up for a while.

It's not clear (at least to me) how much of this happens when, or how to monitor it. (Though it's easy to see that the video signal in such conversations is often coarsely sampled or even "frozen", and obvious audio glitches sometimes occur as well.) But the results of a simple test suggest that more subtle time distortion is sometimes a problem for the audio channel as well.

Read the rest of this entry »

Comments (9)

Speaker change detection

A couple of years ago ("Hearing interactions", 2/28/2018), I posted some anecdotal evidence that human perception of speaker change is accurate and usually also pretty fast. I noted that the performance of automatic systems at analogous tasks was distinctly underwhelming in comparison.

A recent paper measures human performance more systematically, and compares a state-of-the art program — Neeraj Sharma et al., "On the impact of language familiarity in talker change detection", ICASSP 2020:

The ability to detect talker changes when listening to conversational speech is fundamental to perception and understanding of multitalker speech. In this paper, we propose an experimental paradigm to provide insights on the impact of language familiarity on talker change detection. Two multi-talker speech stimulus sets, one in a language familiar to the listeners (English) and the other unfamiliar (Chinese), are created. A listening test is performed in which listeners indicate the number of talkers in the presented stimuli. Analysis of human performance shows statistically significant results for: (a) lower miss (and a higher false alarm) rate in familiar versus unfamiliar language, and (b) longer response time in familiar versus unfamiliar language. These results signify a link between perception of talker attributes and language proficiency. Subsequently, a machine system is designed to perform the same task. The system makes use of the current state-of-the-art diarization approach with x-vector embeddings. A performance comparison on the same stimulus set indicates that the machine system falls short of human performance by a huge margin, for both languages.



Comments (4)

Lexical display rates in novels

In some on-going research on linguistic features relating to clinical diagnosis and tracking, we've been looking at "lexical diversity". It's easy to measure the rate of vocabulary display — you can just use a type-token graph, which shows the count of distinct words ("types") against the count of total words ("tokens"). It's less obvious how to turn such a curve into a single number that can be compared across sources — for a survey of some alternative measures, see e.g. Scott Jarvis,  "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 2002; and for the measure that we've settled on, see Michael Covington and Joe McFall, "Cutting the Gordian knot: The moving-average type–token ratio (MATTR)", Journal of quantitative linguistics 2010. More on that later.

For now, I want to make a point that depends only on type-token graphs. Over time, I've accumulated a small private digital corpus of more than 100 English-language fiction titles, from Tristram Shandy forward to 2019. It's clear that different authors have different characteristic rates of vocabulary display, and for today's post, I want to present the authors in my collection with the highest and lowest characteristic rates.

Read the rest of this entry »

Comments (5)

New approaches to Alzheimer's Disease

This post is another pitch for our on-going effort to develop simple, easy, and effective ways to track neurocognitive health through short interactions with a web app.  Why do we want this? Two reasons: first, early detection of neurodegenerative disorders through near-universal tracking; and second, easy large-scale evaluation of interventions, whether those are drugs or lifestyle changes. You can participate by enrolling at, and suggesting it to your friends and acquaintances as well.

Today, diagnosis generally depends on scoring below a certain value on cognitive tests such as the MMSE, which usually won't even be given until you've started experiencing life-changing symptoms — and at that point, the degenerative process has probably been at work for a decade or more. This may well be too late for interventions to make a difference, which may help explain the failure of dozens of Alzheimer's disease drug trials. And it's difficult and expensive to evaluate an intervention, in part because it requires a series of clinic visits, making it hard to fund support for trials that don't involve a patented drug.

If people could accurately track their neurocognitive health with a few minutes a week on a web app, they could be alerted to potential problems by the rate of change in their scores, even if they're many years away from a diagnosis by today's methods. Of course, this will be genuinely useful only when we have ways to slow or reverse the process — but the same approach can be used to evaluate such interventions inexpensively on a large scale.

More background is here: "Towards tracking neurocognitive health", 3/24/2020. As that post explains, this is just the first step on what may be a long journey — but we will be making the data available to all interested researchers, so that the approaches that have worked elsewhere in AI research over the past 30 years can be be applied to this problem as well.

Again, you can participate by enrolling at . And please spread the word!

Read the rest of this entry »

Comments (2)

What do you hear?

Listen to this sound, and describe it in the comments below:

You can learn what the sound is, and why I care how you hear it, after the fold.

Read the rest of this entry »

Comments (41)

Standardized Project Gutenberg Corpus

Martin Gerlach and Francesc Font-Clos, "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics", arXiv 12/19/2018:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Read the rest of this entry »

Comments (1)

Long ago, in a narratology far away…

Louisa Shepard, "‘May the force be with you’ and other fan fiction favorites", Penn Today 12/18/2019:

Starting with Star Wars, Penn researchers create a unique digital humanities tool to analyze the most popular phrases and character connections in fan fiction. […]

The Penn team started with the script of “Star Wars: The Force Awakens” and created algorithms to analyze the words in the script against those in millions of fan fiction stories. The unique program identifies the most popular phrases, characters, scenes, and connections that are repurposed by these writers and then displays them in a simple graph format.

The results are now available on their “fan engagement meter” at

Serendipitously, today's xkcd:

Read the rest of this entry »

Comments (1)

Mrs. Transformer-XL Tittlemouse

This is another note on the amazing ability of modern AI learning techniques to imitate some aspects of natural-language patterning almost perfectly, while managing to miss common sense almost entirely. This probably tells us something about modern AI and also about language, though we probably won't understand what it's telling us until many years in the future.

Today's example comes from Zihang Da et al., "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context", arXiv 6/2/2019.

Read the rest of this entry »

Comments (5)

Canoe schemata nama gary anaconda

Following up on recent posts suggesting that speech-to-text is not yet a solved problem ("Shelties On Alki Story Forest", "The right boot of the warner of the baron", "AI is brittle"), here's a YouTube link to a lecture given in July of 2018 by Michael Picheny, "Speech Recognition: What's Left?" The whole thing is worth following, but I particularly draw your attention to the section starting around 50:06, where he reviews the state of human and machine performance with respect to "noise, speaking style, accent, domain robustness, and language learning capabilities", with the goal to "make the case that we have a long way to go in [automatic] speech recognition".

Read the rest of this entry »

Comments (4)

AI is brittle

Following up "Shelties On Alki Story Forest" (11/26/2019) and "The right boot of the warner of the baron" (12/6/2019), here's some recent testimony from engineers at Google about the brittleness of contemporary speech-to-text systems: Arun Narayanan et al., "Recognizing Long-Form Speech Using Streaming End-To-End Models", arXiv 10/24/2019.

The goal of that paper is to document some methods for making things better. But I want to underline the fact that considerable headroom remains, even with the massive amounts of training material and computational resources available to a company like Google.

Modern AI (almost) works because of machine learning techniques that find patterns in training data, rather than relying on human programming of explicit rules. A weakness of this approach has always been that generalization to material different in any way from the training set can be unpredictably poor. (Though of course rule- or constraint-based approaches to AI generally never even got off the ground at all.) "End-to-end"  techniques, which eliminate human-defined layers like words, so that speech-to-text systems learn to map directly between sound waveforms and letter strings, are especially brittle.

Read the rest of this entry »

Comments (6)