Archive for Computational linguistics

UM UH 3

[Warning: More than usually wonkish and quantitative.]

In two recent and one older post, I've referred to apparent gender and age differences in the usage of the English filled pauses normally transcribed as "um" and "uh" ("More on UM and UH", 8/3/2014; "Fillers: Autism, gender, and age", 7/30/2014; "Young men talk like old women", 11/6/2005).  In the hope of answering some of the many open questions, I decided to make a closer comparison between the Switchboard dataset (collected in 1990-91) and the Fisher dataset (collected in 2003).

Read the rest of this entry »

Comments (1)

More on speech overlaps in meetings

This post follows up on Mark Dingemanse's guest post, "Some constructive-critical notes on the informal overlap study", which in turn comments on Kieran Snyder's guest post, "Men interrupt more than women".

As part of a project on the application of speech and language technology to meetings, almost 15 years ago, researchers at the International Computer Science Institute (ICSI) recorded, transcribed and analyzed a large number of their regular technical meetings. The results were published by the Linguistic Data Corsortium as the ICSI Meeting speech and transcripts. As the publication's documentation explains:

75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each.

There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.

There's an extensive set of "dialogue act" annotations of this material, available from ICSI, and described in Elizabeth Shriberg et al., "The ICSI Meeting Recorder Dialog Act (MRDA) Corpus", HLT 2004.

Read the rest of this entry »

Comments (13)

The shape of a spoken phrase in Mandarin

A few years ago, with Jiahong Yuan and Chris Cieri, I took a look at variation in English word duration by phrasal position, using data from the Switchboard conversational-speech corpus ("The shape of a spoken phrase", LLOG 4/12/2006; Jiahong Yuan, Mark Liberman, and Chris Cieri, "Towards an Integrated Understanding of Speaking Rate in Conversation", InterSpeech 2006). As is often the case for simple-minded analysis of large speech datasets, this exercise showed a remarkably consistent pattern of variation — the plot below shows mean duration by position for phrases from 1 to 12 words long:

The Mandarin Broadcast News collection discussed in a recent post ("Consonant effects on F0 in Chinese", 6/12/2014) lends itself to a similar analysis of phrase-position effects on speech timing. So for this morning's Breakfast Experiment™, I ran a couple of scripts to take a first look.

Read the rest of this entry »

Comments (3)

Unfair Turing Test handicaps

Today's PhD Comics:

As in the recently-celebrated case of an alleged 13-year-old Ukrainian, there are circumstances in which the humanity of correspondents may be somewhat obscured.

Read the rest of this entry »

Comments (15)

More deceptive statements about Voice Stress Analysis

Leonard Klie, "Momentum Builds for Voice Stress Analysis in Law Enforcement", Speech Technology Magazine, Summer 2014:

Nearly 1,800 U.S. law enforcement agencies have dropped the polygraph in favor of newer computer voice stress analyzer (CVSA) technology to detect when suspects being questioned are not being honest, according to a report from the National Association of Computer Voice Stress Analysts.

Among those that have already made the switch are police departments in Atlanta, Baltimore, San Francisco, New Orleans, Nashville, and Miami, FL, as well as the California Highway Patrol and many other state and local law enforcement agencies.

The technology is also gaining momentum overseas. "The CVSA has gained international acceptance, and our foreign sales are steadily growing," reports Jim Kane, executive director of the National Institute for Truth Verification Federal Services, a West Palm Beach, FL, company that has been producing CVSA systems since 1988.

Read the rest of this entry »

Comments (14)

Random letter-partition advantages in baby names

Commenting on "QWERTY again", 5/14/2014, Rubrick suggested that

It seems like an extremely simple way to check the validity of this theory would be to repeat the analysis, but with the letters grouped into two random subsets, rather than right-left subsets. In fact, I'd think the original authors should have done this as a control. If this new grouping yields a graph with any meaningful-looking trends whatsoever (or if multiple repeats of the analysis with different random subsets yield such trends a significant percentage of the time), it would pretty soundly deflate the idea that the original trends are the result of "right-hand favoritism".

Steve Kass followed up on this suggestion, providing five examples, and commenting that

The graphs don't all look the same, but they all look interesting, and several of them practically beckon the storyteller. There's something interesting about this general kind of data and "advantage function" analysis worth discovering, I think.

Read the rest of this entry »

Comments (22)

QWERTY again

Various readers have pointed out to to me that the "QWERTY Effect" is back. (For coverage of the first QWERTY-Effect paper, see "The QWERTY Effect", 3/8/2012; "QWERTY: Failure to Replicate", 3/13/2012; "Casasanto and Jasmin on the QWERTY effect", 3/17/2012; and "Response to Jasmin and Casasanto's response to me", 3/17/2012.)

The new paper is Casasanto, D., Jasmin, K., Brookshire, G. & Gijssels, T. "The QWERTY Effect: How typing shapes word meanings and baby names". In P. Bello, M. Guarini, M. McShane, & B. Scassellati (Eds.), Proceedings of the 36th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society, 2014.

As before, the idea is that typing letters with the right hand makes us like them more; or in the words of their abstract,

Filtering words through our fingers as we type appears to be changing their meanings. On average, words typed with more letters from the right side of the QWERTY keyboard are more positive in meaning than words typed with more letters from the left: This is the QWERTY effect (Jasmin & Casasanto, 2012), which was shown previously across three languages. In five experiments, here we replicate the QWERTY effect in a large corpus of English words, extend it to two new languages (Portuguese and German), and show that the effect is mediated by space-valence associations encoded at the level of individual letters. Finally, we show that QWERTY appears to be influencing the names American parents give their children. Together, these experiments demonstrate the generality of the QWERTY effect, and inform our theories of how people’s bodily interactions with a cultural artifact can change the way they use language.

The most interesting new result is the baby-names experiment, in my opinion; and since I'm stuck in Heathrow Airport for a while, I thought I'd take a quick look at it.

Read the rest of this entry »

Comments (53)

Draft words

Reuben Fischer-Baum, Aaron Gordon, and Billy Haisley, "Which Words Are Used To Describe White And Black NFL Prospects?", Deadspin 5/8/2014

Do NFL scouts talk about white players and black players differently? Are certain words reserved for white players? Are others used primarily to describe black players?

Let's try and find out. We've pulled the text from pre-draft scouting reports from NFL.com (written by the infamous Nolan Nawrocki), CBS, and ESPN, split them by player race, counted the number of times individual words appeared using the Voyant tool, and then calculated the rate at which each word appeared per 10,000 words. (In total we pulled 68,465 words on 99 white players—6,228 unique—and 223,868 words on 288 black players—10,580 unique). You can play with the data in the interactive below; simply plug a single word into the input field, hit search, and see how often the word appeared in black and white scouting reports.

Read the rest of this entry »

Comments (21)

Accessibility and diarization

I spent this morning at at ICASSP-2014 session on "Speaker Diarization". As the picture indicates, the room was not exactly handicapped accessible…

Luckily this is not a problem for me, but my experience of three torn knee ligaments a few years ago sticks with me.

Anyhow, I made it up the stairway to Room Scherma, and learned some useful and interesting things about current techniques for speaker diarization, which is the problem of determining who spoke when in an arbitrary audio or video recording.

Read the rest of this entry »

Comments (4)

Ten years ago in LLOG

From 3/28/2004, a post that asks a question for which I still don't have a good answer:

How many times does a word or phrase need to be repeated in order to seem characteristic of a speaker or author? I think that the answer is "not very many times, maybe only once or twice, if the use in context is salient enough".

Ruminations on related issues can be found in "Strange Bookfellows" and "Captain Crunch among the Literati".  And since this question tells us as much about the reader or listener as it does about the writer or speaker, we should also consider the curious case of the president's pronouns.

Comments (4)

A zero-tolerance approach to PP attachment

Deborah Ball, "Pope Francis Appoints Eight to Sex-Abuse Commission", WSJ 3/22/2014:

Pope Francis on Saturday appointed a victim of sexual abuse and a senior cardinal known for his zero-tolerance approach to a new group charged with advising the Catholic Church on how to respond to the problem of sexual abuse of children.

The sequence "zero-tolerance approach to a new group" sent Tim Leonard down a syntactic garden path — he had to get past "charged with advising the Catholic Church" before he figured out that the cardinal was appointed to the new group rather than having a zero-tolerance approach to it. So Tim forwarded the example to me, and I had exactly the same experience.

Read the rest of this entry »

Comments (22)

Erdogan's phone conversations

Recep Tayyip Erdoğan has been the prime minister of Turkey for 11 years. On Monday, someone posted on YouTube what purports to be recordings of a series of phone conversations between Erdoğan and his son, discussing how to hide a billion dollars or so in cash: "Başçalan Erdoğan'ın Yalanlarının ve Yolsuzluklarının Kaydı"= "Recording of Erdogan's lying and corruption". Here's an acted version of an English translation, from "Full transcript of voice recording purportedly of Erdoğan and his son", Today's Zaman 2/26/2014:

Read the rest of this entry »

Comments (10)

Rates of exchange

Comments (18)