Archive for Computational linguistics

What use electrolytic pickling?

Once you've written down your responses to the dozen audio clips in yesterday's perception experiment, you can check them against the truth, and also against the transcripts generated by Google's automatic captioning system, both given below.

Read the rest of this entry »

Comments (37)

Perception Experiment

Here are a dozen short audio clips from a lecture, stripped from YouTube, and re-encoded after editing as mp3 files. Despite being handicapped by this marginal sound quality, and even more by the lack of context, you will probably be able to transcribe them fairly well. Please do so, and retain your results for discussion tomorrow morning (where "tomorrow" = Wednesday 5/8/2013).

Read the rest of this entry »

Comments off

NPR: oyez.org finishes Supreme Court oral arguments project

"Once Under Wraps, Supreme Court Audio Trove Now Online", NPR All Things Considered 4/24/2013:

The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn't available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn't duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.

But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.

Read the rest of this entry »

Comments (8)

Anatomy of a spambot

We've often had occasion to wonder how spammy blog comments are linguistically constructed. (See, most recently, Mark Liberman's post, "Numerous upon the written content material," in which he refers to spam comments as "aleatoric sub-poetry.") Now, on Quartz, David Yanofsky and Zachary M. Seward expose how spam comments are engineered:

Comment spam follows a formula, which was made plain the other day when a spambot accidentally posted its entire template on the blog of programmer Scott Hanselman. With his permission, we’ve reproduced some of the spam comment recipes here and added colorful formatting to make it readable. The spambot constructs new, vaguely unique comments by selecting from each set of options. We hope you find it wonderful | terrific | brilliant | amazing | great | excellent | fantastic | outstanding | superb.

Read the rest of this entry »

Comments (27)

Numerous upon the written content material

Another fragment of aleatoric sub-poetry, from the 5,036,601 spam comments that Akismet has caught since we installed it:

I image this might be numerous upon the written content material? nevertheless I nonetheless believe that it may be suitable for just about any type of topic material, because it could frequently be pleasant to resolve a warm and delightful face or possibly listen a voice whilst initial landing.

Read the rest of this entry »

Comments (12)

Depopularization in the limit

George Orwell, in his hugely overrated essay "Politics and the English language", famously insists you should "Never use a metaphor, simile, or other figure of speech which you are used to seeing in print." He thinks modern writing "consists in gumming together long strips of words which have already been set in order by someone else" (only he doesn't mean "long") — joining togther "ready-made phrases" instead of thinking out what to say. His hope is that one can occasionally, "if one jeers loudly enough, send some worn-out and useless phrase … into the dustbin, where it belongs." That is, one can eliminate some popular phrase from the language by mocking it out of existence. In effect, he wants us to collaborate in getting rid of the most widely-used phrases in the language. In a Lingua Franca post published today I called his program elimination of the fittest (tongue in cheek, of course: the proposal is actually just to depopularize the most popular).

For a while, after I began thinking about this, I wondered what would be the ultimate fate of a language in which this policy was consistently and iteratively implemented. I even spoke to a distinguished theoretical computer scientist about how one might represent the problem mathematically. But eventually I realized it was really quite simple; at least in a simplified ideal case, I knew what would happen, and I could do the proof myself.

Read the rest of this entry »

Comments off

Androids in Amazonia: recording an endangered language

Augustine Tembé, recording a story using a smartphoneThe village of Akazu’yw lies in the rainforest, a day’s drive from the state capital of Belém, deep in the Brazilian Amazon. Last week I traveled there, carrying a dozen Android phones with a specialized app for recording speech. It wasn't all plain sailing…

Read the full story here.

Comments (5)

Android app for oral language documentation

Steven Bird, "Cyberlinguistics: recording the world's vanishing voices", 3/11/2013:

Of the 7,000 languages spoken on the planet, Tembé is at the small end with just 150 speakers left. In a few days, I will head into the Brazilian Amazon to record Tembé – via specially-designed technology – for posterity. Welcome to the world of cyberlinguistics.

Our new Android app Aikuma is still in the prototype stage. But it will dramatically speed up the process of collecting and preserving oral literature from endangered languages, if last year’s field trip to Papua New Guinea is anything to go by.

Read the whole thing.

Read the rest of this entry »

Comments (8)

PP attachment is hard

Alex Williams, "Creating Hipsturbia", NYT 2/15/2013:

“When we checked towns out,” Ms. Miziolek recalled, “I saw some moms out in Hastings with their kids with tattoos. A little glimmer of Williamsburg!”

Read the rest of this entry »

Comments (6)

Word String frequency distributions

Several people have asked me about Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words", Nature Scientific Reports 12/10/2012. The abstract (emphasis added):

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.

The paper is thought-provoking, and the conclusions definitely merit further exploration. But I feel that the paper as published is guilty of false advertising. As the emphasized material in the abstract indicates, the paper claims to be about the frequency distributions of words in the vocabulary of English and other natural languages. In fact, I'm afraid, it's actually about the frequency distributions of strings in Google's 2009 OCR of printed books — and this, alas, is not the same thing at all.

It's possible that the paper's conclusions also hold for the distributions of words in English and other languages, but it's far from clear that this is true. At a minimum, the paper's quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call "words". Whether the qualitative results hold or not remains to be seen.

Read the rest of this entry »

Comments (13)

Speech and silence

I recently became interested in patterns of speech and silence. People divide their discourse into phrases for many reasons: syntax, meaning, rhetoric; thinking about what to say next; running out of breath. But for current purposes, we're ignoring the content of what's said, and we're also ignoring the process of saying it. We're even ignoring the language being spoken. All we're looking at is the partition of the stream of talk into speech segments and silence segments.

Why?

Read the rest of this entry »

Comments (10)

Dramatic reading of ASR voicemail transcription

Following up on the recent post about ASR error rates, here's Mary Robinette Kowal doing a dramatic reading of the Google Voice transcript of three phone calls (voicemail messages?) from John Scalzi:

Read the rest of this entry »

Comments (17)

High-entropy speech recognition, automatic and otherwise

Regular readers of LL know that I've always been a partisan of automatic speech recognition technology, defending it against unfair attacks on its performance, as in the case of "ASR Elevator" (11/14/2010). But Chin-Hui Lee recently showed me the results of an interesting little experiment that he did with his student I-Fan Chen, which suggests a fair (or at least plausible) critique of the currently-dominant ASR paradigm. His interpretation, as I understand it, is that ASR technology has taken a wrong turn, or more precisely, has failed to explore adequately some important paths that it bypassed on the way to its current success.

Read the rest of this entry »

Comments (23)