Archive for Computational linguistics

On the front lines of Twitter linguistics

I have a piece in today's New York Times Sunday Review section, "Twitterology: A New Science?" In the limited space I had, I tried to give a taste of what research is currently out there using Twitter to build various types of linguistic corpora. Obviously, there's a lot more that could be said about these projects and other fascinating ones currently underway. Herewith a few notes.

Read the rest of this entry »

Comments (14)

Where he at now?

That's the question on a t-shirt designed by John Allison,  the author of the Bad Machinëry comic series:

Remember that dude? Always poppin' up in the corner? Wonder what he doin' now? Where he at now?

For those who are too young (or too old, or too fortunate in some other way) to have encountered the Microsoft's Office Assistant "Clippit", nicknamed "Clippy", the Wikipedia page may be helpful.

Read the rest of this entry »

Comments (14)

Amy was found dead in his apartment

I'm spending three days in Tampa at the kick-off meeting for  DARPA's new BOLT program. Today was Language Sciences Day, and among many other events, there was a "Semantics Panel", in which a half a dozen luminaries discussed ways that the analysis of meaning might play a role again in machine translation. The "again" part comes up because, as Kevin Knight observed in starting the panel off, natural language processing and artificial intelligence went through a bitter divorce 20 years ago. ("And", Gene Charniak added, "I haven't spoken to myself since.")

The various panelists had somewhat different ideas about what to do, and the question period uncovered a substantially larger range of opinions represented in the audience. But it occurred to me that there's a simple and fairly superficial kind of semantic analysis that is not used in any of the MT systems that I'm familiar with, to their considerable detriment — despite the fact that algorithms with decent performance on this task have been around for many years.

Read the rest of this entry »

Comments (15)

Replicating the snuckward trend

In yesterday's post "Deceptively valuable", I made use of counts from the Google Books ngram dataset, as seen through Mark Davies' convenient interface. That was a case where the ngram dataset's flaws (uncertain metadata, lack of ability to look at context, etc.) are more than balanced by its virtues. In thinking about some of the other issues involved, I remembered a case that makes it possible to check the ngram dataset's answers against those given by another historical collection: the trend over the past century for Americans to replace "sneaked" with "snuck".

Read the rest of this entry »

Comments (15)

Deceptively valuable

A couple of weeks ago, Eric Baković posted about phrases of the form deceptively <ADJECTIVE>, and gave the results of an online survey of more than 1500 LL readers ("Watching the deceptive", 10/2/2011), who were each asked to interpret one of two phrases:

The exam was deceptively easy. The exam was deceptively hard
The exam was easy. 56.8% The exam was easy. 11.8%
The exam was hard. 36.0% The exam was hard. 84.0%
The exam was neither. 7.2% The exam was neither. 4.2%

Eric suggested that this variability in judgments, and also the asymmetry between easy and hard, might be connected to the phenomenon of misnegation. And there were many other interesting observations and speculations in Eric's post and the 64 comments on it. But a simple tally of collocational frequency for the word deceptively suggests a couple of relevant factors that neither Eric nor any of the commenters noticed.

Read the rest of this entry »

Comments (28)

Sirimania

Yesterday's Doonesbury joins the parade of praise for Siri:

Read the rest of this entry »

Comments (10)

Political voices

Like other regular readers of Andrew Sullivan's web log, I was not surprised that he was happy about Sarah Palin's decision not to run for U.S. president in 2012. However, one aspect of his commentary ("Rejoice!", 10/5/2011) did surprise me. The puzzle is in the second sentence:

Our Three Year National Nightmare Is Over!

Palin talks to Mark Levin here (her voice is the deeper one).

Mark Levin is a radio talk show host, and Sullivan's link goes to a page on Levin's web site that includes not only the text of Palin's statement, but also accesses an mp3 file of a 15-minute segment of his show. My interest here, of course, is not in the politics but in the phonetics. Is it really true that Sarah Palin's voice is deeper (i.e. lower in pitch) than Mark Levin's?

Read the rest of this entry »

Comments (18)

Raising his voice

FDR had his weekly "Fireside chats", and in 1982 Ronald Reagan began the modern tradition of weekly presidential addresses, which U.S. presidents since then have maintained. I don't think that very many people actually listen to these things — no one that I've asked has ever admitted to regular consumption. But I've been collecting them since 2004, and listening to most of them, and a few days ago I noticed something.

Read the rest of this entry »

Comments (12)

MAGE pHTS

Comments (2)

Bookworm

When the Google Ngram Viewer came out, I tempered my enthusiastic praise with a complaint ("More on 'culturomics'", 12/17/2010):

The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture".  But  as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

I'm happy to say that the (non-Google part of) the Culturomics crew at the Harvard Cultural Observatory have taken a significant step in that direction, building on the work of the Open Library. You can check out what they've done with an alpha version of an online search interface at http://bookworm.culturomics.org/. But in my opinion, the online search interface, alpha or not, is the least important part of what's going on here.

Read the rest of this entry »

Comments (9)

Non-markovian yawp

Now that I've got morning internet access again, and the semester is more or less underway, it's time for another Breakfast Experiment™.

In "Markov's Heart of Darkness" (7/18/2011) and "Finch linguistics" (7/13/2011) , we learned that Joseph Conrad's paragraphs are more markovian — at least in terms of their distribution of lengths — than zebra finch song bouts are. So I wondered about length distributions in some other sources — pause groups in conversational speech, and lines in Walt Whitman's poetry.

Read the rest of this entry »

Comments (5)

The visibilizing analyzer

Less than 50 years ago, this is what the future of data visualization looked like — H. Beam Piper, "Naudsonce", Analog 1962:

She had been using a visibilizing analyzer; in it, a sound was broken by a set of filters into frequency-groups, translated into light from dull red to violet paling into pure white. It photographed the light-pattern on high-speed film, automatically developed it, and then made a print-copy and projected the film in slow motion on a screen. When she pressed a button, a recorded voice said, "Fwoonk." An instant later, a pattern of vertical lines in various colors and lengths was projected on the screen.

This is in a future world with anti-gravity and faster-than-light travel.

Read the rest of this entry »

Comments (24)

Counting hierarchical kinds

Nick Collins, "Earth is home to 8.7 million species", The Telegraph 8/23/2011:

Previous guesses had put the total number of different species at anywhere between three million and 100 million, but a new calculation based on the way in which life forms are classified puts the estimate at the lower end of that scale.

The list of known species currently stands at about 1.2 million, but experts said that advances in technology meant that the remainder could be found and classified within the next century.

The study was undertaken by researchers from the Census of Marine Life, a ten-year project involving 2,700 scientists from more than 80 countries aimed at assessing the diversity of life in our seas and oceans which concluded in October 2010.

Read the rest of this entry »

Comments (17)