## A diarization corpus from Amazon

About a month ago, Zaid Ahmed and others in Amazon's speech research group released DiPCo ("Dinner Party Corpus"), "a new data set that will help speech scientists address the difficult problem of separating speech signals in reverberant rooms with multiple speakers".

The past decade has seen striking progress in Human Language Technology, brought about by new methods, more training data, and (especially) cheaper/faster computers. But this rapid progress highlights the fact that "All problems are not solved", as I wrote last year — and in particular, the central problem of "diarization", or determining who spoken when, has turned out to be a surprisingly difficult one. And diarization is not just hard for conversations at dinner parties.

Read the rest of this entry »

## Kabbalist NLP

Oscar Schwartz, "Natural Language Processing Dates Back to Kabbalist Mystics", IEEE Spectrum 10/28/2019 ("Long before NLP became a hot field in AI, people devised rules and machines to manipulate language"):

The story begins in medieval Spain. In the late 1200s, a Jewish mystic by the name of Abraham Abulafia sat down at a table in his small house in Barcelona, picked up a quill, dipped it in ink, and began combining the letters of the Hebrew alphabet in strange and seemingly random ways. Aleph with Bet, Bet with Gimmel, Gimmel with Aleph and Bet, and so on.

Abulafia called this practice "the science of the combination of letters." He wasn't actually combining letters at random; instead he was carefully following a secret set of rules that he had devised while studying an ancient Kabbalistic text called the Sefer Yetsirah. This book describes how God created "all that is formed and all that is spoken" by combining Hebrew letters according to sacred formulas. In one section, God exhausts all possible two-letter combinations of the 22 Hebrew letters.

By studying the Sefer Yetsirah, Abulafia gained the insight that linguistic symbols can be manipulated with formal rules in order to create new, interesting, insightful sentences. To this end, he spent months generating thousands of combinations of the 22 letters of the Hebrew alphabet and eventually emerged with a series of books that he claimed were endowed with prophetic wisdom.

## Lombroso and Lavater, reborn as fake AI

Drew Harwell, "A face-scanning algorithm increasingly decides whether you deserve the job", WaPo 10/22/2019:

An artificial intelligence hiring system has become a powerful gatekeeper for some of America's most prominent employers, reshaping how companies assess their workforce — and how prospective employees prove their worth.

Designed by the recruiting-technology firm HireVue, the system uses candidates' computer or cellphone cameras to analyze their facial movements, word choice and speaking voice before ranking them against other applicants based on an automatically generated "employability" score.

HireVue's "AI-driven assessments" have become so pervasive in some industries, including hospitality and finance, that universities make special efforts to train students on how to look and speak for best results. More than 100 employers now use the system, including Hilton, Unilever and Goldman Sachs, and more than a million job seekers have been analyzed.

Read the rest of this entry »

## "Protester dressed as Boris Johnson scales Big Ben"

Sometimes it's hard for us humans to see the intended meaning of an ambiguous phrase, like "Hospitals named after sandwiches kill five". But in other cases, the intended structure comes easily to us, and we have a hard time seeing the alternative, as in the case of "Extinction rebellion protester dressed as Boris Johnson scales Big Ben".

These two examples have essentially the same structure. There's a word that might be construed as a preposition linking a verb to a nominal argument ("named after sandwiches", "dressed as Boris Johnson"), or alternatively as a complementizer introducing a subordinate clause ("after sandwiches kill five", "as Boris Johnson scales Big Ben"). In the first example, the complementizer reading is the one the author intended, while in the second example, it's the preposition. But in both cases, most of us go for the preposition, presumably because "named after X" and "dressed as Y" are common constructions.

Read the rest of this entry »

## Danger: Demo!

John Seabrook, "The Next Word: Where will predictive text take us?", The New Yorker 10/14/2019:

At the end of every section in this article, you can read the text that an artificial intelligence predicted would come next.

I glanced down at my left thumb, still resting on the Tab key. What have I done? Had my computer become my co-writer? That's one small step forward for artificial intelligence, but was it also one step backward for my own?

The skin prickled on the back of my neck, an involuntary reaction to what roboticists call the "uncanny valley"—the space between flesh and blood and a too-human machine.

Read the rest of this entry »

## TO THE CONTRARYGE OF THE AND THENESS

Yiming Wang et al., "Espresso: A fast end-to-end neural speech recognition toolkit", ASRU 2019:

We present ESPRESSO, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit FAIRSEQ. ESPRESSO supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. ESPRESSO achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4–11× faster for decoding than similar systems (e.g. ESPNET)

Read the rest of this entry »

## Speed vs. efficiency in speech production and reception

An interesting new paper on speech and information rates as determined by neurocognitive capacity appeared a week ago:

Christophe Coupé, Yoon Oh, Dan Dediu, and François Pellegrino, "Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche", Science Advances, 5.9 (2019):  eaaw2594. doi: 10.1126/sciadv.aaw2594.

Here's the abstract:

Language is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages' structural properties and their speakers' neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.

Read the rest of this entry »

## Where the magic happens

From today's SMBC, an idea about AI that's obvious in retrospect but seems to be new:

Read the rest of this entry »

## The Voder — and "emotion"

There was an interesting story yesterday on NPR's All Things Considered, "How We Hear Our Own Voice Shapes How We See Ourselves And How Others See Us". Shankar Vedantam starts with the case of a woman whose voice was altered because her larynx was accidentally damaged during an operation, leading to a change in her personality. And then it segues into an 80-year-old crowd pleaser, the Voder:

All the way back in 1939, Homer Dudley unveiled an organ-like machine he called the "Voder". It worked using special keys and a foot pedal, and it fascinated people at the World's Fair in New York.

Helen, will you have the Voder say 'She saw me'.

She … saw … me

"That sounded awfully flat. How about a little expression? Say the sentence in answer to these questions.

Q: Who saw you?
A: SHE saw me.
Q: Whom did she see?
A: She saw ME.
Q: Well did she see you or hear you?
A: She SAW me.

Read the rest of this entry »

## "Douchey uses of AI"

The book for this year's Penn Reading Project is Cathy O'Neil's Weapons of Math Destruction. From the PRP's description:

We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans but by mathematical models. In theory, this should lead to greater fairness: everyone is judged according to the same rules, and bias is eliminated.

But as Cathy O'Neil reveals in this urgent and necessary book, the opposite is true.

I've been seeing lots of resonances of this concern elsewhere in popular culture, for example this recent SMBC, which focuses on the deflection of responsibility:

Read the rest of this entry »

## Emotion detection

Taylor Telford, "'Emotion detection' AI is a $20 billion industry. New research says it can't do what it claims", WaPo 7/31/2019: In just a handful of years, the business of emotion detection — using artificial intelligence to identify how people are feeling — has moved beyond the stuff of science fiction to a$20 billion industry. Companies such as IBM and Microsoft tout software that can analyze facial expressions and match them to certain emotions, a would-be superpower that companies could use to tell how customers respond to a new product or how a job candidate is feeling during an interview. But a far-reaching review of emotion research finds that the science underlying these technologies is deeply flawed.

The problem? You can't reliably judge how someone feels from what their face is doing.

A group of scientists brought together by the Association for Psychological Science spent two years exploring this idea. After reviewing more than 1,000 studies, the five researchers concluded that the relationship between facial expression and emotion is nebulous, convoluted and far from universal.

Read the rest of this entry »

## ERNIE's here — is OSCAR next?

In "Contextualized Muppet Embeddings" (2/13/2019) I noted the advent of ELMo ("Embeddings from Language Models") and BERT ("Bidirectional Encoder Representations from Transformers"), and predicted ERNiE, GRoVEr, KERMiT, …

I'm happy to say that the first of these predictions has come true:

"Baidu's ERNIE 2.0 Beats BERT and XLNet on NLP Benchmarks", Synced 7/30/2019
"Baidu unveils ERNIE 2.0 natural language framework in Chinese and English", VentureBeat 7/30/2019

Actually I'm late reporting this, since ERNIE 1.0 came out in March:

But I'm ashamed to say that the Open System for Classifying Ambiguous Reference (OSCAR) is still just an idea, though I did recruit a collaborator who agreed in principle to work with me on it.