Archive for Computational linguistics

Fake account spotting on Facebook

One language-related story in the British press over the weekend was that Gavin McGowan was threatened by Facebook with having his account shut down… because they said his name was fake.

About ten years ago Gavin learned some Scottish Gaelic and started using the Gaelic spelling of his name: Gabhan Mac A Ghobhainn. Facebook is apparently running software designed to spot bogus accounts on the basis of the letter-strings used to name them. Gabhan's name evidently failed the test.

Read the rest of this entry »

Comments (36)

"They called for more structure"

From Kevin Knight's home page:

I think our approach to syntax in machine translation is best described in D. Barthelme's short story They called for more structure….

Read the rest of this entry »

Comments (25)

REAPER

A couple of days ago, I mentioned ("Sarah Koenig", 2/5/2015) that David Talkin was releasing a new pitch tracking program called REAPER (available from github at the link). After a few minor improvements in documentation, it's ready for the general public.

The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

After trying it out, I can recommend it whole-heartedly — it's robust and accurate and fast. It's my new standard pitch tracker.

Read the rest of this entry »

Comments (5)

Decreasing definiteness

During the course of the 20th century, the frequency of the English definite article the decreased gradually and radically. I first noticed this effect about a year ago, in a post about the history of State of the Union addresses ("SOTU evolution", 1/26/2014), where I observed, in reference to the graph on the right, that

The average frequency of the in the most recent 10 SOTU addresses (2004-2013) was 47,458 per million words; in the first 10 addresses (1790-1799, all delivered as speeches to Congress) it was 93,201 per million words, almost double the frequency.  And the decline during the 20th-century era of oral addresses seems to have been a gradual one.

I speculated that

Maybe the style of speeches has been getting gradually less formal, and therefore gradually less like written style. Or maybe even formal styles have been changing.

And I noted that a corresponding effect can be seen in two other sources, the BYU Corpus of Historical American English (COHA) and the Google Books N-Gram viewer (GNG), though it is considerably smaller in magnitude:

COHA and the Google Books data pretty much agree, which is reassuring; and they both suggest a slight decline in the frequency of the; but the change that they show is very modest compared to the change in SOTU frequencies. So I feel that the explanation for the SOTU change remains to be found.

At that point, I turned my attention to other aspects of SOTU evolution. But a student paper recently reminded me of this issue.

Read the rest of this entry »

Comments (41)

The global language network

Michael Erard has a nice discussion in Science magazine of a paper recently published in PNAS: "Want to influence the world? Map reveals the best languages to speak", 12/15/2014.

The original paper is Shahar Ronen et al., "Links that speak: the global language network and its association with global fame", PNAS 2014. And there's a cute interactive visualization.

Read the rest of this entry »

Comments (7)

Another dumb Flesch-Kincaid exercise

E.J. Fox and Mike Spies, "Who was America's most well-spoken president?", vocativ.com 10/10/2014:

Using the Flesch-Kincaid readability test—the most well-known reading comprehension algorithm—Vocativ analyzed over 600 presidential speeches, going back to George Washington. We measured syllables along with word and sentence counts, and gave each speech a numerical grade. For instance, a grade of four means the content is accessible to a fourth-grader, while a grade of 12 corresponds to that of a high school graduate, a 15 to that of a college graduate and a 21 or higher to that of a PhD. Ultimately, we drew five conclusions, each of which was analyzed by Jeff Shesol, a historian and former speechwriter for Bill Clinton.

Read the rest of this entry »

Comments (8)

"Voiceprints" again

"Millions of voiceprints quietly being harvested as latest identification tool", The Guardian (AP), 10/13/2014:

Over the telephone, in jail and online, a new digital bounty is being harvested: the human voice.  

Businesses and governments around the world increasingly are turning to voice biometrics, or voiceprints, to pay pensions, collect taxes, track criminals and replace passwords.

The article lists some successful applications:

Barclays plc recently experimented with voiceprinting as an identification for its wealthiest clients. It was so successful that Barclays is rolling it out to the rest of its 12 million retail banking customers.  

“The general feeling is that voice biometrics will be the de facto standard in the next two or three years,” said Iain Hanlon, a Barclays executive.  

Read the rest of this entry »

Comments (8)

More fun with Facebook: THE

The script that I used to make that course assignment about Facebook pronouns ("Sex, age, and pronouns on Facebook", 9/19/2014; "More fun with Facebook pronouns", 9/27/2014) can trivially be focused on any other words — so here's "the":

Read the rest of this entry »

Comments (11)

UM / UH in German

We've previously observed a surprisingly consistent pattern of age and gender effects on the relative frequency of filled pauses (or "hesitation sounds") with and without final nasals — what we usually write as "um" and "uh" in American English, or often as "er" and "erm" in British English.

Specifically, younger people use the UM form more than older people, while at any age, women use the UM form more than men do. We've seen this same pattern in various varieties of American English and in John Coleman's analysis of the spoken portion of the British National Corpus, and we found the sex effect in the HCRC Map Task Corpus, which involves task-oriented dialogues among college students from Glasgow in Scotland.

It was even more surprising that Martijn Wieling found the same pattern in a collection of Dutch conversational speech.  And to make the puzzle more puzzling, Joe Fruehwald's analysis of the Philadelphia Neighborhood Corpus, which includes recordings across several decades of real time, suggests an on-going change in the direction of greater overall UM usage, as well as a life-cycle effect within each cohort of speakers. And Jack Grieve's analysis of Twitter data indicates a pattern of geographical variation within the U.S.

For additional details, see "Young men talk like old women", 11/6/2005; "Fillers: Autism, gender, age", 7/30/2014;  "More on UM and UH", 8/3/2014; "UM UH 3", 8/4/2014; "Male and female word usage", 8/7/2014; "UM / UH geography", 8/13/2014; "Educational UM / UH", 8/13/2014; "UM / UH: Lifecycle effects vs. language change", 8/15/2014; "Filled pauses in Glasgow", 8/17/2014; "ER and ERM in the spoken BNC", 8/18/2014; "Um and uh in Dutch", 9/16/2014.

Now Martijn Wieling has found the same pattern in German. His guest post follows.

Read the rest of this entry »

Comments (10)

More fun with Facebook pronouns

Class discussion of the Facebook pronoun data brought out some interesting points.

We started by looking at the relationship between first-person singular pronouns ("I", "me", "my", "mine") and first-person plural pronouns ("we", "us", "our", "ours") as a function of the age of the poster. Here's the ratio of FPS/FPP frequencies:

Read the rest of this entry »

Comments (10)

Sex, age, and pronouns on Facebook

Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.

Lyle Ungar talked about this work a few days ago (Andy was unfortunately out of town), for an audience of mostly first-year undergraduates. The venue was a weekly event, Dinners With Interesting People, held in the Quad, an undergraduate residence here at Penn.

This year, the DWIP talks (though still open to the public) are integrated into a Freshman Seminar called "The Landscape of Research and Innovation at Penn". The idea is to give the participants a general idea of what kinds of research go on around here, and how they might get involved. As part of the course, I've asked DWIP guests to provide a dataset that we can use as part of a course assignment in quantitative analysis.  Since the students have widely varied backgrounds in mathematics, statistics, and programming, and since the quantitative analysis part of the course is only one of several aspects, the assignments start with an R script that does something interesting, with the assigned task being to modify the script to do something a bit different.

In this case, Andy was kind enough to give me a table indicating number of posts and token counts for each "word", in their Facebook dataset, for males and females of each age.  Inspired by Jamie Pennebaker's The Secret Life of Pronouns,  I decided to focus the quantitative analysis assignment around the issue of pronoun usage. The body of this post lays out some of the things that I've noticed in setting the assignment up.

Read the rest of this entry »

Comments (7)

Predictive poetry

A few years ago, people noticed that the predictive typing on Android smartphones could construct interesting phrases all on its own: "Your typical sentence", 6/13/2012. iOS 8 has caught up  — Geoffrey Fowler and Joanna Stern, "iOS 8 Keyboard Makes Hilarious 'Mad Libs' For You", WSJ 9/17/2014:

Now the latest version of Apple’s iPhone software, iOS 8, adds a layer of smarts on top of autocorrect called QuickType, predictive typing of a sort previously found on Android. Not only does it suggest spelling, it also suggests words you might want to type next. If you keep following its train of robotic thought, QuickType will form entire sentences on your behalf.

The result is so goofy that it is brilliant. For the last week, we—your WSJ personal technology columnists—have been conducting serious tests of the new iPhones and iOS 8, while also holding nonsensical auto-generated conversations with each other.

Read the rest of this entry »

Comments (6)

Text analytics applied to applications of things like text analytics

South by Southwest (SXSW) uses a web-based voting method to choose panels, and so Jason Baldridge took a look at the titles submitted for Phil Resnik's "Putting a Real-Time Face on Polling" session,  to

… see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

He describes the results in "Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics", 9/2/2014.

 

Comments (1)