Archive for Computational linguistics

Weird languages?

Tyler Schnoebelen, "The Weirdest Languages", Idibon blog 6/21/2013:

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total. […]

The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

Read the rest of this entry »

Comments (57)

City of the big disjunctions

Continuing in another connection with the exploration of real-estate listings that I discussed earlier ("Long is good, good is bad, nice is worse, and ! is questionable", 6/12/2013; "Significant (?) relationships everywhere", 6/14/2013), I stumbled on this curious factoid about the use of and and or in trulia.com's listings for the ten cities I've harvested so far:

Read the rest of this entry »

Comments (14)

Forensic linguistics in the Zimmerman case

The judge in the Zimmerman case has recently decided to let the jury decide for themselves about the source of the screams in the 911 tape ("Jury to decide whose voice on 911 call in Zimmerman case"). This decision is a stinging rebuke to the "expert" testimony of Tom Owen and Alan Reich, and supports the testimony of Peter French, George Doddington, and Hirotaka Nakasone. For a summary of the dueling experts, see Andrew Branca, "Zimmerman Case: Dr. Hirotaka Nakasone, FBI, and the low-quality 3-second audio file", Legal Insurrection 6/7/2013, "Zimmerman Prosecution’s Voice Expert admits: 'This is not really good evidence'", 6/8/2013, and "Zimmerman Case: Experts Call State’s Scream Claims 'Absurd' 'Ridiculous' and 'Imaginary Stuff'", 6/9/2013.

I don't have time this morning to discuss the issues at greater length, but it's clear that the judge's evaluation of the situation was correct.

Read the rest of this entry »

Comments (3)

Significant (?) relationships everywhere

While we're on the subject of maybe-meaningful data-mining output, let me share with you some semi-refined ore from the dataset of real-estate listings that I mentioned the other day.

Read the rest of this entry »

Comments (7)

Long is good, good is bad, nice is worse, and ! is questionable

Sanette Tanaka, "Fancy Real-Estate Listing, Fancier Verbiage", WSJ 6/6/2013:

Savvy real-estate agents know it's not just what you say. It's how long it takes you to say it.

More-expensive homes go hand-in-hand with longer real-estate agents' remarks—the language written by the agent that supplements the house description and photos in a listing. Agents use a median 250 characters for homes listed under $100,000, according to an analysis for The Wall Street Journal by real-estate listings company Zillow. For homes priced over $1 million, they go nearly twice as long, with a median 487 characters. (That's about the length of this paragraph.)

"Generally, what you find is that regardless of the region, the more expensive the home is, the more characters are used to describe that home," says Stan Humphries, chief economist at Zillow.

Read the rest of this entry »

Comments (16)

"The saddest tweeters live in Texas"

That's not from the chorus of a postmodern country song — it's the title of a National Geographic piece discussing Morgan R. Frank, Kameron Decker Harris, Peter Sheridan Dodds, and Christopher M. Danforth, "The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place", PLoS ONE 5/29/2013.

I don't have time this morning to do anything more than point to the article, but my previous interactions with Peter Dodds and others at the Vermont Complex Systems Center have been positive.

Comments (19)

Ngram morality

David Brooks has found a congenial story in Google ngrams — or rather, in three papers about ngrammatical history, which he interprets to show that virtue, discipline, and concern for the common good have been declining, while subjectivity and concern for self-esteem have increased ("What Our Words Tell Us", NYT 5/20/2013)).

Brooks doesn't cite or link to the papers, which in my opinion is a form of journalistic malpractice, so here they are:

Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012
Pelin Kesebir and Selin Kesebir, "The Cultural Salience of Moral Character and Virtue Declined in Twentieth Century America", Journal of Positive Psychology, Forthcoming
Daniel B. Klein, "Ngrams of the Great Transformations", GMU Working Paper in Economics, 2013

Read the rest of this entry »

Comments (22)

2013 Blizzard Challenge

From Simon King:

I am pleased to announce that the English section of this year's Blizzard Challenge listening test is now live. Please help us out by taking part, and encouraging your colleagues, students, friends, contacts, etc. to take part too. It's your chance to hear a range of speech synthesisers, including some really good ones. Please circulate this message widely – for example, on mailing lists, forums and using social media – we need to reach as many people as possible in the coming month or so.

Read the rest of this entry »

Comments (7)

Annals of algorithmic communication

Yesterday afternoon, I got this interesting email message:

The departure time for US Airways flight # 3314, from Detroit to Philadelphia on May 11 at 6:05 PM has changed. The flight is delayed due to air traffic at the destination airport. Your estimated time of Departure is 6:05 PM.

Read the rest of this entry »

Comments (26)

What use electrolytic pickling?

Once you've written down your responses to the dozen audio clips in yesterday's perception experiment, you can check them against the truth, and also against the transcripts generated by Google's automatic captioning system, both given below.

Read the rest of this entry »

Comments (37)

Perception Experiment

Here are a dozen short audio clips from a lecture, stripped from YouTube, and re-encoded after editing as mp3 files. Despite being handicapped by this marginal sound quality, and even more by the lack of context, you will probably be able to transcribe them fairly well. Please do so, and retain your results for discussion tomorrow morning (where "tomorrow" = Wednesday 5/8/2013).

Read the rest of this entry »

Comments off

NPR: oyez.org finishes Supreme Court oral arguments project

"Once Under Wraps, Supreme Court Audio Trove Now Online", NPR All Things Considered 4/24/2013:

The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn't available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn't duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.

But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.

Read the rest of this entry »

Comments (8)

Anatomy of a spambot

We've often had occasion to wonder how spammy blog comments are linguistically constructed. (See, most recently, Mark Liberman's post, "Numerous upon the written content material," in which he refers to spam comments as "aleatoric sub-poetry.") Now, on Quartz, David Yanofsky and Zachary M. Seward expose how spam comments are engineered:

Comment spam follows a formula, which was made plain the other day when a spambot accidentally posted its entire template on the blog of programmer Scott Hanselman. With his permission, we’ve reproduced some of the spam comment recipes here and added colorful formatting to make it readable. The spambot constructs new, vaguely unique comments by selecting from each set of options. We hope you find it wonderful | terrific | brilliant | amazing | great | excellent | fantastic | outstanding | superb.

Read the rest of this entry »

Comments (27)