Archive for Computational linguistics

Sex, age, and pronouns on Facebook

Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.

Lyle Ungar talked about this work a few days ago (Andy was unfortunately out of town), for an audience of mostly first-year undergraduates. The venue was a weekly event, Dinners With Interesting People, held in the Quad, an undergraduate residence here at Penn.

This year, the DWIP talks (though still open to the public) are integrated into a Freshman Seminar called "The Landscape of Research and Innovation at Penn". The idea is to give the participants a general idea of what kinds of research go on around here, and how they might get involved. As part of the course, I've asked DWIP guests to provide a dataset that we can use as part of a course assignment in quantitative analysis.  Since the students have widely varied backgrounds in mathematics, statistics, and programming, and since the quantitative analysis part of the course is only one of several aspects, the assignments start with an R script that does something interesting, with the assigned task being to modify the script to do something a bit different.

In this case, Andy was kind enough to give me a table indicating number of posts and token counts for each "word", in their Facebook dataset, for males and females of each age.  Inspired by Jamie Pennebaker's The Secret Life of Pronouns,  I decided to focus the quantitative analysis assignment around the issue of pronoun usage. The body of this post lays out some of the things that I've noticed in setting the assignment up.

Read the rest of this entry »

Comments (2)

Predictive poetry

A few years ago, people noticed that the predictive typing on Android smartphones could construct interesting phrases all on its own: "Your typical sentence", 6/13/2012. iOS 8 has caught up  – Geoffrey Fowler and Joanna Stern, "iOS 8 Keyboard Makes Hilarious 'Mad Libs' For You", WSJ 9/17/2014:

Now the latest version of Apple’s iPhone software, iOS 8, adds a layer of smarts on top of autocorrect called QuickType, predictive typing of a sort previously found on Android. Not only does it suggest spelling, it also suggests words you might want to type next. If you keep following its train of robotic thought, QuickType will form entire sentences on your behalf.

The result is so goofy that it is brilliant. For the last week, we—your WSJ personal technology columnists—have been conducting serious tests of the new iPhones and iOS 8, while also holding nonsensical auto-generated conversations with each other.

Read the rest of this entry »

Comments (3)

Text analytics applied to applications of things like text analytics

South by Southwest (SXSW) uses a web-based voting method to choose panels, and so Jason Baldridge took a look at the titles submitted for Phil Resnik's "Putting a Real-Time Face on Polling" session,  to

… see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

He describes the results in "Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics", 9/2/2014.

 

Comments (1)

Nth Xest

In the course of writing about the "fourth highest of five levels", I looked around at how the pattern "Nth Xest" is used in general. I found that uses of such expressions overwhelmingly count from the "top" where X names a top-oriented scale (high, big, long, etc.), and count from the "bottom" where X names a bottom-oriented scale (low, small, short, etc.)  In other words, unsurprisingly, "Nth Xest" normally counts (up or down) from whatever end of the scale "Xest" names.

Another (less logically necessary but still unsurprising) thing I noticed is that top-oriented counts are always a lot bigger than corresponding bottom-oriented counts, and that counts decrease almost-proportionately as N increases. Thus from Google Books ngrams:

second third fourth fifth sixth
highest 34447 9692 3148 1411 784
lowest 6006 1455 491 293 138

Read the rest of this entry »

Comments (1)

Sex and pronouns

Andy Schwartz recently gave me a copy of word counts by sex and age for the Facebook posts from the PPC's World Well-Being Project. So I thought I'd compare some of the Facebook counts to data from the LDC's archive of conversational speech transcripts. As a start, here's a comparison of rates of pronoun usage in the PPC Facebook sample and in the transcripts of the LDC's Fisher English datasets (combining Part 1 and Part 2).

Read the rest of this entry »

Comments (1)

Geoffrey Leech, 1936-2014

Geoffrey Leech, one of the giants of corpus-based computational linguistics, passed away yesterday. With the death of Chuck Fillmore in February, the field has lost two of its pillars this year.

Read the rest of this entry »

Comments (4)

Lorem China

Brian Krebs, "Lorem Ipsum: Of Good & Evil, Google & China", Krebs on Security 8/14/2014:

Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.

Read the rest of this entry »

Comments (31)

Wrecking a nice beach

Under the subject line "Things you never thought you'd get to say", Bob Ladd sent me this note yesterday:

You are among the few people I know who will appreciate this anecdote:  

It's been unusually cool, wet, and windy in many parts of the Mediterranean this summer, including our part of Sardinia.  On our last full day there last week, our local beach was still unpleasantly rough and windy, so we decided to go to a place called La Licciola about 10 miles away, on the other side of the headland and therefore protected from the wind.  The last time we went there a couple of years ago, the final access was a long downhill stretch of dirt road with what amounted to a field to park in at the bottom.  It was fairly chaotic in a typically Italian way, with people managing to park along the edges of the dirt road when the field got full, but with everyone always leaving just enough room to get through.  Anyway, the other day we got to the top of the downhill road to discover that it has been properly paved, with an actual sidewalk along one side and no-parking signs on the other (though everyone was parking there anyway).  The parking field has been improved with clearly delineated spaces and there was a chain across the entrance because it was already full.  People were having a hard time turning around because the sidewalk has narrowed the driveable part of the downhill road, and new people kept coming in at the top of the hill looking for a space to park, creating more chaos.  We decided to give up and go somewhere else, but it took us the better part of 15 minutes to extract ourselves from the mess. It was only on the way back out to the main road that it occurred to me that, in trying to improve things, they had managed to, well, wreck a nice beach.  

It was my misfortune to be sharing the car with someone who wouldn't have understood why I was giggling. 

Read the rest of this entry »

Comments (17)

UM UH 3

[Warning: More than usually wonkish and quantitative.]

In two recent and one older post, I've referred to apparent gender and age differences in the usage of the English filled pauses normally transcribed as "um" and "uh" ("More on UM and UH", 8/3/2014; "Fillers: Autism, gender, and age", 7/30/2014; "Young men talk like old women", 11/6/2005).  In the hope of answering some of the many open questions, I decided to make a closer comparison between the Switchboard dataset (collected in 1990-91) and the Fisher dataset (collected in 2003).

Read the rest of this entry »

Comments (1)

More on speech overlaps in meetings

This post follows up on Mark Dingemanse's guest post, "Some constructive-critical notes on the informal overlap study", which in turn comments on Kieran Snyder's guest post, "Men interrupt more than women".

As part of a project on the application of speech and language technology to meetings, almost 15 years ago, researchers at the International Computer Science Institute (ICSI) recorded, transcribed and analyzed a large number of their regular technical meetings. The results were published by the Linguistic Data Corsortium as the ICSI Meeting speech and transcripts. As the publication's documentation explains:

75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each.

There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.

There's an extensive set of "dialogue act" annotations of this material, available from ICSI, and described in Elizabeth Shriberg et al., "The ICSI Meeting Recorder Dialog Act (MRDA) Corpus", HLT 2004.

Read the rest of this entry »

Comments (13)

The shape of a spoken phrase in Mandarin

A few years ago, with Jiahong Yuan and Chris Cieri, I took a look at variation in English word duration by phrasal position, using data from the Switchboard conversational-speech corpus ("The shape of a spoken phrase", LLOG 4/12/2006; Jiahong Yuan, Mark Liberman, and Chris Cieri, "Towards an Integrated Understanding of Speaking Rate in Conversation", InterSpeech 2006). As is often the case for simple-minded analysis of large speech datasets, this exercise showed a remarkably consistent pattern of variation — the plot below shows mean duration by position for phrases from 1 to 12 words long:

The Mandarin Broadcast News collection discussed in a recent post ("Consonant effects on F0 in Chinese", 6/12/2014) lends itself to a similar analysis of phrase-position effects on speech timing. So for this morning's Breakfast Experiment™, I ran a couple of scripts to take a first look.

Read the rest of this entry »

Comments (3)

Unfair Turing Test handicaps

Today's PhD Comics:

As in the recently-celebrated case of an alleged 13-year-old Ukrainian, there are circumstances in which the humanity of correspondents may be somewhat obscured.

Read the rest of this entry »

Comments (15)

More deceptive statements about Voice Stress Analysis

Leonard Klie, "Momentum Builds for Voice Stress Analysis in Law Enforcement", Speech Technology Magazine, Summer 2014:

Nearly 1,800 U.S. law enforcement agencies have dropped the polygraph in favor of newer computer voice stress analyzer (CVSA) technology to detect when suspects being questioned are not being honest, according to a report from the National Association of Computer Voice Stress Analysts.

Among those that have already made the switch are police departments in Atlanta, Baltimore, San Francisco, New Orleans, Nashville, and Miami, FL, as well as the California Highway Patrol and many other state and local law enforcement agencies.

The technology is also gaining momentum overseas. "The CVSA has gained international acceptance, and our foreign sales are steadily growing," reports Jim Kane, executive director of the National Institute for Truth Verification Federal Services, a West Palm Beach, FL, company that has been producing CVSA systems since 1988.

Read the rest of this entry »

Comments (14)

Random letter-partition advantages in baby names

Commenting on "QWERTY again", 5/14/2014, Rubrick suggested that

It seems like an extremely simple way to check the validity of this theory would be to repeat the analysis, but with the letters grouped into two random subsets, rather than right-left subsets. In fact, I'd think the original authors should have done this as a control. If this new grouping yields a graph with any meaningful-looking trends whatsoever (or if multiple repeats of the analysis with different random subsets yield such trends a significant percentage of the time), it would pretty soundly deflate the idea that the original trends are the result of "right-hand favoritism".

Steve Kass followed up on this suggestion, providing five examples, and commenting that

The graphs don't all look the same, but they all look interesting, and several of them practically beckon the storyteller. There's something interesting about this general kind of data and "advantage function" analysis worth discovering, I think.

Read the rest of this entry »

Comments (22)

QWERTY again

Various readers have pointed out to to me that the "QWERTY Effect" is back. (For coverage of the first QWERTY-Effect paper, see "The QWERTY Effect", 3/8/2012; "QWERTY: Failure to Replicate", 3/13/2012; "Casasanto and Jasmin on the QWERTY effect", 3/17/2012; and "Response to Jasmin and Casasanto's response to me", 3/17/2012.)

The new paper is Casasanto, D., Jasmin, K., Brookshire, G. & Gijssels, T. "The QWERTY Effect: How typing shapes word meanings and baby names". In P. Bello, M. Guarini, M. McShane, & B. Scassellati (Eds.), Proceedings of the 36th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society, 2014.

As before, the idea is that typing letters with the right hand makes us like them more; or in the words of their abstract,

Filtering words through our fingers as we type appears to be changing their meanings. On average, words typed with more letters from the right side of the QWERTY keyboard are more positive in meaning than words typed with more letters from the left: This is the QWERTY effect (Jasmin & Casasanto, 2012), which was shown previously across three languages. In five experiments, here we replicate the QWERTY effect in a large corpus of English words, extend it to two new languages (Portuguese and German), and show that the effect is mediated by space-valence associations encoded at the level of individual letters. Finally, we show that QWERTY appears to be influencing the names American parents give their children. Together, these experiments demonstrate the generality of the QWERTY effect, and inform our theories of how people’s bodily interactions with a cultural artifact can change the way they use language.

The most interesting new result is the baby-names experiment, in my opinion; and since I'm stuck in Heathrow Airport for a while, I thought I'd take a quick look at it.

Read the rest of this entry »

Comments (53)