Language Log

Archive for Computational linguistics

The global language network

December 16, 2014 @ 10:16 am· Filed by Mark Liberman under Computational linguistics

Michael Erard has a nice discussion in Science magazine of a paper recently published in PNAS: "Want to influence the world? Map reveals the best languages to speak", 12/15/2014.

The original paper is Shahar Ronen et al., "Links that speak: the global language network and its association with global fame", PNAS 2014. And there's a cute interactive visualization.

Read the rest of this entry »

Permalink Comments (7)

Another dumb Flesch-Kincaid exercise

October 26, 2014 @ 6:26 pm· Filed by Mark Liberman under Computational linguistics, Language and the media

E.J. Fox and Mike Spies, "Who was America's most well-spoken president?", vocativ.com 10/10/2014:

Using the Flesch-Kincaid readability test—the most well-known reading comprehension algorithm—Vocativ analyzed over 600 presidential speeches, going back to George Washington. We measured syllables along with word and sentence counts, and gave each speech a numerical grade. For instance, a grade of four means the content is accessible to a fourth-grader, while a grade of 12 corresponds to that of a high school graduate, a 15 to that of a college graduate and a 21 or higher to that of a PhD. Ultimately, we drew five conclusions, each of which was analyzed by Jeff Shesol, a historian and former speechwriter for Bill Clinton.

Read the rest of this entry »

Permalink Comments (8)

"Voiceprints" again

October 14, 2014 @ 11:24 am· Filed by Mark Liberman under Computational linguistics

"Millions of voiceprints quietly being harvested as latest identification tool", The Guardian (AP), 10/13/2014:

Over the telephone, in jail and online, a new digital bounty is being harvested: the human voice.

Businesses and governments around the world increasingly are turning to voice biometrics, or voiceprints, to pay pensions, collect taxes, track criminals and replace passwords.

The article lists some successful applications:

Barclays plc recently experimented with voiceprinting as an identification for its wealthiest clients. It was so successful that Barclays is rolling it out to the rest of its 12 million retail banking customers.

“The general feeling is that voice biometrics will be the de facto standard in the next two or three years,” said Iain Hanlon, a Barclays executive.

Read the rest of this entry »

Permalink Comments (8)

More fun with Facebook: THE

October 12, 2014 @ 5:58 am· Filed by Mark Liberman under Computational linguistics, Psychology of language

The script that I used to make that course assignment about Facebook pronouns ("Sex, age, and pronouns on Facebook", 9/19/2014; "More fun with Facebook pronouns", 9/27/2014) can trivially be focused on any other words — so here's "the":

Read the rest of this entry »

Permalink Comments (11)

UM / UH in German

September 29, 2014 @ 7:41 am· Filed by Mark Liberman under Computational linguistics, Language and gender, Sociolinguistics

We've previously observed a surprisingly consistent pattern of age and gender effects on the relative frequency of filled pauses (or "hesitation sounds") with and without final nasals — what we usually write as "um" and "uh" in American English, or often as "er" and "erm" in British English.

Specifically, younger people use the UM form more than older people, while at any age, women use the UM form more than men do. We've seen this same pattern in various varieties of American English and in John Coleman's analysis of the spoken portion of the British National Corpus, and we found the sex effect in the HCRC Map Task Corpus, which involves task-oriented dialogues among college students from Glasgow in Scotland.

It was even more surprising that Martijn Wieling found the same pattern in a collection of Dutch conversational speech. And to make the puzzle more puzzling, Joe Fruehwald's analysis of the Philadelphia Neighborhood Corpus, which includes recordings across several decades of real time, suggests an on-going change in the direction of greater overall UM usage, as well as a life-cycle effect within each cohort of speakers. And Jack Grieve's analysis of Twitter data indicates a pattern of geographical variation within the U.S.

For additional details, see "Young men talk like old women", 11/6/2005; "Fillers: Autism, gender, age", 7/30/2014; "More on UM and UH", 8/3/2014; "UM UH 3", 8/4/2014; "Male and female word usage", 8/7/2014; "UM / UH geography", 8/13/2014; "Educational UM / UH", 8/13/2014; "UM / UH: Lifecycle effects vs. language change", 8/15/2014; "Filled pauses in Glasgow", 8/17/2014; "ER and ERM in the spoken BNC", 8/18/2014; "Um and uh in Dutch", 9/16/2014.

Now Martijn Wieling has found the same pattern in German. His guest post follows.

Read the rest of this entry »

Permalink Comments (10)

More fun with Facebook pronouns

September 27, 2014 @ 11:25 am· Filed by Mark Liberman under Computational linguistics, Language and gender

Class discussion of the Facebook pronoun data brought out some interesting points.

We started by looking at the relationship between first-person singular pronouns ("I", "me", "my", "mine") and first-person plural pronouns ("we", "us", "our", "ours") as a function of the age of the poster. Here's the ratio of FPS/FPP frequencies:

Read the rest of this entry »

Permalink Comments (10)

Sex, age, and pronouns on Facebook

September 19, 2014 @ 8:46 am· Filed by Mark Liberman under Computational linguistics

Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.

Lyle Ungar talked about this work a few days ago (Andy was unfortunately out of town), for an audience of mostly first-year undergraduates. The venue was a weekly event, Dinners With Interesting People, held in the Quad, an undergraduate residence here at Penn.

This year, the DWIP talks (though still open to the public) are integrated into a Freshman Seminar called "The Landscape of Research and Innovation at Penn". The idea is to give the participants a general idea of what kinds of research go on around here, and how they might get involved. As part of the course, I've asked DWIP guests to provide a dataset that we can use as part of a course assignment in quantitative analysis. Since the students have widely varied backgrounds in mathematics, statistics, and programming, and since the quantitative analysis part of the course is only one of several aspects, the assignments start with an R script that does something interesting, with the assigned task being to modify the script to do something a bit different.

In this case, Andy was kind enough to give me a table indicating number of posts and token counts for each "word", in their Facebook dataset, for males and females of each age. Inspired by Jamie Pennebaker's The Secret Life of Pronouns, I decided to focus the quantitative analysis assignment around the issue of pronoun usage. The body of this post lays out some of the things that I've noticed in setting the assignment up.

Read the rest of this entry »

Permalink Comments (7)

Predictive poetry

September 18, 2014 @ 6:43 am· Filed by Mark Liberman under Computational linguistics, Language and culture

A few years ago, people noticed that the predictive typing on Android smartphones could construct interesting phrases all on its own: "Your typical sentence", 6/13/2012. iOS 8 has caught up — Geoffrey Fowler and Joanna Stern, "iOS 8 Keyboard Makes Hilarious 'Mad Libs' For You", WSJ 9/17/2014:

Now the latest version of Apple’s iPhone software, iOS 8, adds a layer of smarts on top of autocorrect called QuickType, predictive typing of a sort previously found on Android. Not only does it suggest spelling, it also suggests words you might want to type next. If you keep following its train of robotic thought, QuickType will form entire sentences on your behalf.

The result is so goofy that it is brilliant. For the last week, we—your WSJ personal technology columnists—have been conducting serious tests of the new iPhones and iOS 8, while also holding nonsensical auto-generated conversations with each other.

Read the rest of this entry »

Permalink Comments (6)

Text analytics applied to applications of things like text analytics

September 2, 2014 @ 4:22 pm· Filed by Mark Liberman under Computational linguistics

South by Southwest (SXSW) uses a web-based voting method to choose panels, and so Jason Baldridge took a look at the titles submitted for Phil Resnik's "Putting a Real-Time Face on Polling" session, to

… see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

He describes the results in "Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics", 9/2/2014.

Permalink Comments (1)

Nth Xest

September 1, 2014 @ 6:03 pm· Filed by Mark Liberman under Computational linguistics

In the course of writing about the "fourth highest of five levels", I looked around at how the pattern "Nth Xest" is used in general. I found that uses of such expressions overwhelmingly count from the "top" where X names a top-oriented scale (high, big, long, etc.), and count from the "bottom" where X names a bottom-oriented scale (low, small, short, etc.) In other words, unsurprisingly, "Nth Xest" normally counts (up or down) from whatever end of the scale "Xest" names.

Another (less logically necessary but still unsurprising) thing I noticed is that top-oriented counts are always a lot bigger than corresponding bottom-oriented counts, and that counts decrease almost-proportionately as N increases. Thus from Google Books ngrams:

	second	third	fourth	fifth	sixth
highest	34447	9692	3148	1411	784
lowest	6006	1455	491	293	138

Read the rest of this entry »

Permalink Comments (1)

Sex and pronouns

August 24, 2014 @ 12:22 pm· Filed by Mark Liberman under Computational linguistics, Language and gender

Andy Schwartz recently gave me a copy of word counts by sex and age for the Facebook posts from the PPC's World Well-Being Project. So I thought I'd compare some of the Facebook counts to data from the LDC's archive of conversational speech transcripts. As a start, here's a comparison of rates of pronoun usage in the PPC Facebook sample and in the transcripts of the LDC's Fisher English datasets (combining Part 1 and Part 2).

Read the rest of this entry »

Permalink Comments (1)

Geoffrey Leech, 1936-2014

August 20, 2014 @ 3:28 pm· Filed by Ben Zimmer under Computational linguistics, Obituaries

Geoffrey Leech, one of the giants of corpus-based computational linguistics, passed away yesterday. With the death of Chuck Fillmore in February, the field has lost two of its pillars this year.

Read the rest of this entry »

Permalink Comments (4)

Lorem China

August 20, 2014 @ 5:35 am· Filed by Mark Liberman under Computational linguistics, Humor

Brian Krebs, "Lorem Ipsum: Of Good & Evil, Google & China", Krebs on Security 8/14/2014:

Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.

Read the rest of this entry »

Permalink Comments (31)

« Previous Page — « Previous Entries

Next Entries » — Next Page »

Archive for Computational linguistics

The global language network

Another dumb Flesch-Kincaid exercise

"Voiceprints" again

More fun with Facebook: THE

UM / UH in German

More fun with Facebook pronouns

Sex, age, and pronouns on Facebook

Predictive poetry

Text analytics applied to applications of things like text analytics

Nth Xest

Sex and pronouns

Geoffrey Leech, 1936-2014

Lorem China

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta