Archive for Computational linguistics

Standardized Project Gutenberg Corpus

Martin Gerlach and Francesc Font-Clos, "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics", arXiv 12/19/2018:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Read the rest of this entry »

Comments (1)

Long ago, in a narratology far away…

Louisa Shepard, "'May the force be with you' and other fan fiction favorites", Penn Today 12/18/2019:

Starting with Star Wars, Penn researchers create a unique digital humanities tool to analyze the most popular phrases and character connections in fan fiction. […]

The Penn team started with the script of "Star Wars: The Force Awakens" and created algorithms to analyze the words in the script against those in millions of fan fiction stories. The unique program identifies the most popular phrases, characters, scenes, and connections that are repurposed by these writers and then displays them in a simple graph format.

The results are now available on their "fan engagement meter" at https://fanengagement.org.

Serendipitously, today's xkcd:

Read the rest of this entry »

Comments (1)

Mrs. Transformer-XL Tittlemouse

This is another note on the amazing ability of modern AI learning techniques to imitate some aspects of natural-language patterning almost perfectly, while managing to miss common sense almost entirely. This probably tells us something about modern AI and also about language, though we probably won't understand what it's telling us until many years in the future.

Today's example comes from Zihang Da et al., "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context", arXiv 6/2/2019.

Read the rest of this entry »

Comments (5)

Canoe schemata nama gary anaconda

Following up on recent posts suggesting that speech-to-text is not yet a solved problem ("Shelties On Alki Story Forest", "The right boot of the warner of the baron", "AI is brittle"), here's a YouTube link to a lecture given in July of 2018 by Michael Picheny, "Speech Recognition: What's Left?" The whole thing is worth following, but I particularly draw your attention to the section starting around 50:06, where he reviews the state of human and machine performance with respect to "noise, speaking style, accent, domain robustness, and language learning capabilities", with the goal to "make the case that we have a long way to go in [automatic] speech recognition".

Read the rest of this entry »

Comments (4)

AI is brittle

Following up "Shelties On Alki Story Forest" (11/26/2019) and "The right boot of the warner of the baron" (12/6/2019), here's some recent testimony from engineers at Google about the brittleness of contemporary speech-to-text systems: Arun Narayanan et al., "Recognizing Long-Form Speech Using Streaming End-To-End Models", arXiv 10/24/2019.

The goal of that paper is to document some methods for making things better. But I want to underline the fact that considerable headroom remains, even with the massive amounts of training material and computational resources available to a company like Google.

Modern AI (almost) works because of machine learning techniques that find patterns in training data, rather than relying on human programming of explicit rules. A weakness of this approach has always been that generalization to material different in any way from the training set can be unpredictably poor. (Though of course rule- or constraint-based approaches to AI generally never even got off the ground at all.) "End-to-end"  techniques, which eliminate human-defined layers like words, so that speech-to-text systems learn to map directly between sound waveforms and letter strings, are especially brittle.

Read the rest of this entry »

Comments (6)

The right boot of the warner of the baron

Here at the UNESCO LT4All conference, I've noticed that many participants assert or imply that the problems of human language technology have been solved for a few major languages, especially English, so that the problem on the table is how to extend that success to thousands of other languages and varieties.

This is not totally wrong — HLT is a practical reality in many applications, and is being rapidly spread to others. And the problem of digitally underserved speech communities is real and acute.

But it's important to understand that the problems are not all solved, even for English, and that the remaining issues also represent barriers for extensions of the technology to other communities, in that the existing approximate solutions are far too hungry for data and far too short on practical understanding and common sense.

Read the rest of this entry »

Comments (9)

Command your kitchen

…or at least the faucets in it, using Delta's VoiceIQ Technology.

Delta VoiceIQ Technology pairs with your connected home device to give you exactly the amount of water you need with features like metered dispensing and custom container commands.

I have to say that being able to tell my kitchen faucet to dispense 137 milliliters of hot water, or whatever, is not high on my list of desires. I'm happy enough with good old-fashioned indoor plumbing, reliable supplies of potable water, and filters to take care of residual issues. But apparently the market-research folks at Delta think that the faucet-buying public is more forward-looking than I am.

Read the rest of this entry »

Comments (4)

Shelties On Alki Story Forest

Last week I gave a talk at an Alzheimer's Association workshop on "Digital Biomarkers". Overall I told a hopeful story, about the prospects for a future in which a few minutes of interaction each month, with an app on a smartphone or tablet, will give effective longitudinal tracking of neurocognitive health.

But I emphasized the fact that we're not there yet, and that some serious research and development problems stand in the way. In particular, the current state of the art in speech recognition is not yet good enough for reliable automated evaluation of spoken responses.

Read the rest of this entry »

Comments (2)

A diarization corpus from Amazon

About a month ago, Zaid Ahmed and others in Amazon's speech research group released DiPCo ("Dinner Party Corpus"), "a new data set that will help speech scientists address the difficult problem of separating speech signals in reverberant rooms with multiple speakers".

The past decade has seen striking progress in Human Language Technology, brought about by new methods, more training data, and (especially) cheaper/faster computers. But this rapid progress highlights the fact that "All problems are not solved", as I wrote last year — and in particular, the central problem of "diarization", or determining who spoken when, has turned out to be a surprisingly difficult one. And diarization is not just hard for conversations at dinner parties.

Read the rest of this entry »

Comments (2)

Kabbalist NLP

Oscar Schwartz, "Natural Language Processing Dates Back to Kabbalist Mystics", IEEE Spectrum 10/28/2019 ("Long before NLP became a hot field in AI, people devised rules and machines to manipulate language"):

The story begins in medieval Spain. In the late 1200s, a Jewish mystic by the name of Abraham Abulafia sat down at a table in his small house in Barcelona, picked up a quill, dipped it in ink, and began combining the letters of the Hebrew alphabet in strange and seemingly random ways. Aleph with Bet, Bet with Gimmel, Gimmel with Aleph and Bet, and so on.

Abulafia called this practice "the science of the combination of letters." He wasn't actually combining letters at random; instead he was carefully following a secret set of rules that he had devised while studying an ancient Kabbalistic text called the Sefer Yetsirah. This book describes how God created "all that is formed and all that is spoken" by combining Hebrew letters according to sacred formulas. In one section, God exhausts all possible two-letter combinations of the 22 Hebrew letters.

By studying the Sefer Yetsirah, Abulafia gained the insight that linguistic symbols can be manipulated with formal rules in order to create new, interesting, insightful sentences. To this end, he spent months generating thousands of combinations of the 22 letters of the Hebrew alphabet and eventually emerged with a series of books that he claimed were endowed with prophetic wisdom.

Comments (6)

Lombroso and Lavater, reborn as fake AI

Drew Harwell, "A face-scanning algorithm increasingly decides whether you deserve the job", WaPo 10/22/2019:

An artificial intelligence hiring system has become a powerful gatekeeper for some of America's most prominent employers, reshaping how companies assess their workforce — and how prospective employees prove their worth.

Designed by the recruiting-technology firm HireVue, the system uses candidates' computer or cellphone cameras to analyze their facial movements, word choice and speaking voice before ranking them against other applicants based on an automatically generated "employability" score.

HireVue's "AI-driven assessments" have become so pervasive in some industries, including hospitality and finance, that universities make special efforts to train students on how to look and speak for best results. More than 100 employers now use the system, including Hilton, Unilever and Goldman Sachs, and more than a million job seekers have been analyzed.

Read the rest of this entry »

Comments (21)

"Protester dressed as Boris Johnson scales Big Ben"

Sometimes it's hard for us humans to see the intended meaning of an ambiguous phrase, like "Hospitals named after sandwiches kill five". But in other cases, the intended structure comes easily to us, and we have a hard time seeing the alternative, as in the case of "Extinction rebellion protester dressed as Boris Johnson scales Big Ben".

These two examples have essentially the same structure. There's a word that might be construed as a preposition linking a verb to a nominal argument ("named after sandwiches", "dressed as Boris Johnson"), or alternatively as a complementizer introducing a subordinate clause ("after sandwiches kill five", "as Boris Johnson scales Big Ben"). In the first example, the complementizer reading is the one the author intended, while in the second example, it's the preposition. But in both cases, most of us go for the preposition, presumably because "named after X" and "dressed as Y" are common constructions.

Read the rest of this entry »

Comments (18)

Danger: Demo!

John Seabrook, "The Next Word: Where will predictive text take us?", The New Yorker 10/14/2019:

At the end of every section in this article, you can read the text that an artificial intelligence predicted would come next.

I glanced down at my left thumb, still resting on the Tab key. What have I done? Had my computer become my co-writer? That's one small step forward for artificial intelligence, but was it also one step backward for my own?

The skin prickled on the back of my neck, an involuntary reaction to what roboticists call the "uncanny valley"—the space between flesh and blood and a too-human machine.

Read the rest of this entry »

Comments (11)