Archive for Computational linguistics

The case of the missing spamularity

A recent diary post by Charlie Stross  ("It's made out of meat", 12/22/2010) poses a striking paradox. Or rather, he makes a prediction about a process whose trajectory, as so far observable, seems paradoxical to me.

Read the rest of this entry »

Comments (35)

Word lens

Competing with Culturomics for meme room today is Word Lens, which has a great YouTube ad:

Read the rest of this entry »

Comments (26)

Humanities research with the Google Books corpus

In Science today, there's yesterday, there was an article called "Quantitative analysis of culture using millions of digitized books" [subscription required] by at least twelve authors (eleven individuals, plus "the Google Books team"), which reports on some exercises in quantitative research performed on what is by far the largest corpus ever assembled for humanities and social science research. Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)

I have an article on the project appearing in tomorrow's in today's Chronicle of Higher Education, which I'll link to here, and in later posts Ben or Mark will probably be addressing some of the particular studies, like the estimates of English vocabulary size, as well as the wider implications of the enterprise. For now, some highlights:

Read the rest of this entry »

Comments (58)

"Utterly noxious retail" as Search Engine Optimization

David Segal, "A bully finds a pulpit on the web", NYT 11/26/2010:

Today, when reading the dozens of comments [at getsatisfaction.com] about DecorMyEyes, it is hard to decide which one conveys the most outrage. It is easy, though, to choose the most outrageous. It was written by Mr. Russo/Bolds/Borker himself.

“Hello, My name is Stanley with DecorMyEyes.com,” the post began. “I just wanted to let you guys know that the more replies you people post, the more business and the more hits and sales I get. My goal is NEGATIVE advertisement.”

Read the rest of this entry »

Comments (8)

The robot army

Randall Stross, "When the Software Is the Sportswriter", NYT 11/27/2010:

ONLY human writers can distill a heap of sports statistics into a compelling story. Or so we human writers like to think.

StatSheet, a Durham, N.C., company that serves up sports statistics in monster-size portions, thinks otherwise. The company, with nine employees, is working to endow software with the ability to turn game statistics into articles about college basketball games.

Read the rest of this entry »

Comments (18)

Speech-based quantification of Parkinson's Disease

Earlier this year, I discussed an interesting paper from a poster session at ICASSP 2010 ("Clinical applications of speech technology", 3/18/2010), which used an automated evaluation of dysphonia measures in short speech samples to match clinicians' evaluations of Parkinson's Disease severity.

That work, extended and improved, has been published as Athanasios Tsanas et al., "Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity", J.  Roy. Soc. Interface, 11/17/2010.

Read the rest of this entry »

Comments (6)

ASR Elevator

This is funny, though unfair:

Read the rest of this entry »

Comments (39)

Statistical MT – with meter and rhyme

I promised in an earlier post to report on some of the many interesting presentations here at InterSpeech 2010. But various other obligations and opportunities have cut into my blogging time, and so for now, I'll just point you to the slides for my own presentation here: Jiahong Yuan and Mark Liberman, "F0 Declination in English and Mandarin Broadcast News Speech".

I still hope to blog about some of the other interesting things I've learned here, but it's already time for me to head out on the next leg of my journey. Worse, I've already got a list of things to blog about from the next conference where I'm co-author on a presentation, EMNLP 2010 — which hasn't even started yet. At the top of that list is Dmitriy Genzel, Jakob Uszkoreit and Franz Och, "'Poetic' Statistical Machine Translation: Rhyme and Meter".

Read the rest of this entry »

Comments (6)

Sproat asks the question

The "Last Words" segment of the latest issue of Computational Linguistics is by Richard Sproat: "Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals". Richard reviews and extends the analysis (partly contributed by him and by Cosma Shalizi) in "Conditional entropy and the Indus Script" (4/26/2009) and "Pictish writing?" (4/2/2010), and poses the question that I was too polite to ask:

How is it that papers that are so trivially and demonstrably wrong get published in journals such as Science or the Proceedings of the Royal Society?

Read the rest of this entry »

Comments (24)

A synthetic singing president?

A couple of days ago, Gary Marcus told me about the Beatles Complete on Ukulele project, and introduced me to its creator, David Barratt.

Gary got involved because he's working on a book about "learning to become musical at the age of 40", and so he's joining a roster of performers that includes the Fort Greene Childrens Choir (Age 7 and Under Section), Samantha Fox, and many others (82 so far), recording voice-and-ukulele versions of all 185 songs in the Beatles catalog. Gary is of course singing With a Little Help from My Friends (because, he explains, "otherwise I couldn't carry a tune in a bucket"), and his contribution is scheduled to be released on July 19, 2011.

So how does Language Log come into this? Well, David wants to recruit Barack Obama to sing Let it Be, and Gary thought that I could help. In turn, I believe that YOU can help.

Read the rest of this entry »

Comments (11)

The long tail of religious studies?

Google Books isn't the only outfit that sometimes has trouble with metadata. I happened to notice this morning that Oxford University Press has classified Herbert A. Simon's "On a class of skew distribution functions" (Biometrika 43:425-440, 1955) as "Religious Studies..Death":


Read the rest of this entry »

Comments (22)

South Hadley & surrounded by trees

Reader FM wrote to draw our attention to what he thought might be "another example of poor machine reading, à la Embuggerance and Feisty". FM is referring back to this epic error, where a line in a review listing interesting vocabulary items

embuggerance, elevate, feisty, holistic,

somehow attracted the attention of an algorithm for un-inverting comma-separated lists of author names, resulting in the hypothetical author pair "Elevate Embuggerance and Holistic Feisty", and a striking citation in Google Scholar:

;

Read the rest of this entry »

Comments (20)

Dialect geography and social networks

There are a variety of factors that are believed to be involved in the establishment and maintenance of the language varieties that are commonly called "dialects". Among these are substrate or contact influences, patterns of initial settlement, group identity, and patterns of communication. Some of these factors, such as settlement patterns, mainly re-distribute existing variation in geographical and social space. But others, such as patterns of communication, affect the way that innovations arise and spread.

The rise of internet-based social media offers new pictures of such patterns of communication, and a few months ago, I came across an interesting analysis of the geography of Facebook friend links: Pete Warden, "How to split up the U.S.", 2/6/2010.

Read the rest of this entry »

Comments (26)