Archive for Computational linguistics

Bookworm

When the Google Ngram Viewer came out, I tempered my enthusiastic praise with a complaint ("More on 'culturomics'", 12/17/2010):

The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture".  But  as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

I'm happy to say that the (non-Google part of) the Culturomics crew at the Harvard Cultural Observatory have taken a significant step in that direction, building on the work of the Open Library. You can check out what they've done with an alpha version of an online search interface at http://bookworm.culturomics.org/. But in my opinion, the online search interface, alpha or not, is the least important part of what's going on here.

Read the rest of this entry »

Comments (9)

Non-markovian yawp

Now that I've got morning internet access again, and the semester is more or less underway, it's time for another Breakfast Experiment™.

In "Markov's Heart of Darkness" (7/18/2011) and "Finch linguistics" (7/13/2011) , we learned that Joseph Conrad's paragraphs are more markovian — at least in terms of their distribution of lengths — than zebra finch song bouts are. So I wondered about length distributions in some other sources — pause groups in conversational speech, and lines in Walt Whitman's poetry.

Read the rest of this entry »

Comments (5)

The visibilizing analyzer

Less than 50 years ago, this is what the future of data visualization looked like — H. Beam Piper, "Naudsonce", Analog 1962:

She had been using a visibilizing analyzer; in it, a sound was broken by a set of filters into frequency-groups, translated into light from dull red to violet paling into pure white. It photographed the light-pattern on high-speed film, automatically developed it, and then made a print-copy and projected the film in slow motion on a screen. When she pressed a button, a recorded voice said, "Fwoonk." An instant later, a pattern of vertical lines in various colors and lengths was projected on the screen.

This is in a future world with anti-gravity and faster-than-light travel.

Read the rest of this entry »

Comments (24)

Counting hierarchical kinds

Nick Collins, "Earth is home to 8.7 million species", The Telegraph 8/23/2011:

Previous guesses had put the total number of different species at anywhere between three million and 100 million, but a new calculation based on the way in which life forms are classified puts the estimate at the lower end of that scale.

The list of known species currently stands at about 1.2 million, but experts said that advances in technology meant that the remainder could be found and classified within the next century.

The study was undertaken by researchers from the Census of Marine Life, a ten-year project involving 2,700 scientists from more than 80 countries aimed at assessing the diversity of life in our seas and oceans which concluded in October 2010.

Read the rest of this entry »

Comments (17)

Authors vs. Speakers: A Tale of Two Subfields

The best part of Monday's post on the Facebook authorship-authentication controversy ("High-stakes forensic linguistics", 7/25/2011) was the contribution in the comments by  Ron Butters, Larry Solan, and Carole Chaski.  It's interesting to compare the situation they describe — and the frustration that they express about it — with the history of technologies for answering questions about the source of bits of speech rather than bits of text.

Read the rest of this entry »

Comments (8)

Empirical Foundations of Linguistics

I gave a talk a few weeks ago at the Laboratoire de Phonétique et Phonologie in Paris, founded in 1897 by L'abbé P.-J. Rousselot. Antonia Colazo-Simon took this picture of l'abbé and me:

Read the rest of this entry »

Comments (13)

High-stakes forensic linguistics

Over the past few months, there have been several developments in the legal battle between Paul Ceglia and Mark Zuckerberg over Ceglia's claim to part ownership of Facebook. As Ben Zimmer explains ("Decoding Your E-Mail Personality", NYT Sunday Review, 7/23/2011):

Mr. Ceglia says that a work-for-hire contract he arranged with Mr. Zuckerberg, then an 18-year-old Harvard freshman, entitles him to half of the Facebook fortune. He has backed up his claim with e-mails purported to be from Mr. Zuckerberg, but Facebook’s lawyers argue that the e-mail exchanges are fabrications. […]

The law firm representing Mr. Zuckerberg called upon Gerald McMenamin, emeritus professor of linguistics at California State University, Fresno, to study the alleged Zuckerberg e-mails. (Normally, other data like message headers and server logs could be used to pin down the e-mails’ provenance, but Mr. Ceglia claims to have saved the messages in Microsoft Word files.) Mr. McMenamin determined, in a report filed with the court last month, that “it is probable that Mr. Zuckerberg is not the author of the questioned writings.” Using “forensic stylistics,” he reached his conclusion through a cross-textual comparison of 11 different “style markers,” including variant forms of punctuation, spelling and grammar.

Read the rest of this entry »

Comments (32)

Google thinks Darwin is Freud

Or at least some automatically-derived Google thesaurus does:

For some searches including the term "Freud", a significant fraction of the hits (including the second one in the screenshot above) do not contain "Freud" (or derivatives like "Freudian") at all. At the same time, instances of "Darwin" in the displayed snippets are put into bold typeface, as if they were instances of one of the search terms.

Read the rest of this entry »

Comments (25)

Job trends

At the Revolutions blog, David Smith posts a nice little discussion about growth in jobs where people are making sense of data; he used job search site indeed.com to look at trends in job postings. Apparently postings involving "statistician" are not seeing a lot of growth, but "data scientists" have really started to catch on during the last year or so. (Hat tip to Joe Reisinger for tweeting this. He comments that data scientist is a "truly terrible name, but it's undeniably a different skill set: way too many statisticians can't code".)

Read the rest of this entry »

Comments off

Markov's Heart of Darkness

It seems that the length of Joseph Conrad's paragraphs — unlike the length of zebra finch song bouts — is well approximated by a two-state markov process.

Read the rest of this entry »

Comments (46)

Finch linguistics

Andy Coughlan, "First evidence that birds tweet using grammar", New Scientist 6/26/2011:

They may not have verbs, nouns or past participles, but birds challenge the notion that humans alone have evolved grammatical rules.

Bengal finches have their own versions of such rules – known as syntax – says Kentaro Abe of Kyoto University, Japan. "Songbirds have a spontaneous ability to process syntactic structures in their songs," he says.

To show a sense of syntax in the animals, Abe's team played jumbled "ungrammatical" remixes of finch songs to the birds and measured the response calls.

The basic article is Kentaro Abe & Dai Watanabe, "Songbirds possess the spontaneous ability to discriminate syntactic rules", Nature Neuroscience 6/26/2011. And like the coverage in New Scientist, it's both true and misleading.

Read the rest of this entry »

Comments (15)

Biblical scholarship at the ACL

The 49th Annual Meeting of the Association for Computational Linguistics took place last week in Portland OR, and one of the papers presented there has gotten some (well deserved) press coverage: Moshe Koppel, Navot Akiva, Idan Dershowitz and Nachum Dershowitz, "Unsupervised Decomposition of a Document into Authorial Components", ACL2011.

Well, at least the AP covered it: Matti Friedman, "An Israeli algorithm sheds light on the Bible", AP 6/29/2011 (as usual published under different headlines in various publications, e.g. "Algorithm developed by Israeli scholars sheds light on the Bible’s authorship" (WaPo), "Software deciphers authorship of the Bible" (CathNews), etc.).

Read the rest of this entry »

Comments (21)

Spoken style correction: the iPeeve™

I just had a terrible idea that could probably make someone a modest fortune. I was inspired by Erin Gloria Ryan, "My Love Affair With 'Like'", Jezebel 6/26/2011:

I use the word "like" with embarrassing frequency. I've started paying attention to how other people talk as well, and it's amazing how many women who I know are very smart are similarly infected with like-itis.

Where does this come from? Why do we do this? […]

Since we know that saying "like" too much leads others to negatively judge our intelligence, maybe inserting "like" into a sentence is something that we do to purposefully make ourselves sound less intelligent and forceful and therefore less formidable than we actually are. We're sabotaging ourselves! […]

Maybe women of my generation have been taught, through positive social reinforcement, that we're supposed to pepper our speech with meaningless modifiers that make us sounds a little less sure of ourselves, a little less credible. No one likes a show off or a know-it-all. Better temper your smart-talk with assurance to whoever you're speaking that you're not, like, a threat or anything. Any girl who's been teased for middle school nerdery has likely developed a long standing aversion for the feeling of being excluded for being too smart or opinionated. This is the way that socially acceptable people talk. This is the way that pretty people talk. Women are taught that it's more important to be pretty and socially accepted than it is to be smart. Ergo, like.

Read the rest of this entry »

Comments (52)