Archive for Computational linguistics

In defense of spell-checking

In a post a few days ago ("Why you shouldn't use spellcheckers", 4/7/2009), Bill Poser argued that "if English had a decent writing system there would be no use for [spellchecking] software". I'm no defender of our current writing system — it makes life much harder than it should be for writers and readers alike, especially in the early stages of learning. But I think that Bill is overselling the potential benefits of reform.

Read the rest of this entry »

Comments (45)

In defense of Amazon's Mechanical Turk

I can find no better description of Amazon's Mechanical Turk than in the "description" tag at the site itself:

The online market place for work. We give businesses and developers access to an on-demand scalable workforce. Workers can work at home and make money by choosing from thousands of tasks and jobs.

This is followed by a "keywords" meta tag:

make money, make money at home, make money from home, make money on the internet, make extra money, make money …

This makes the site sound a bit like the next stop on Dave Chapelle's tour of his imagined Internet as physical place, and indeed it does have its seamy side. But I come to defend Mechanical Turk as a useful tool for linguistic research — a quick and inexpensive way to gather data and conduct simple experiments.

Read the rest of this entry »

Comments (11)

Term-mining Obama's inaugural address

TerMine is a system for recognizing multiword terms. The algorithm was originally presented in Katerina Frantzi, Sophia Ananiadou, and Hideki Mima, Automatic recognition of multi-word terms, International Journal of Digital Libraries 3(2): 117-132, 2000. You can try it out on a site at the National Centre for Text Mining (NaCTeM) at the University of Manchester in the UK, where they have a web demonstration that will analyze short (<2 MB) texts or URLs for you.

As you'll find if you try, the results are not always perfect, but I think that the algorithm is remarkably good at guessing multi-word terms from small amounts of text. For example, if I try it out on a page (~2000 words) of lecture notes about "Statistical estimation for Large Numbers of Rare Events", it comes up with a large number of sensible things like good-turing estimate, maximum likelihood, population frequency, belief tax, and negative binomial distribution — along with a few clunkers like cnew = cnew./token and some other fragments of Matlab code. (Maybe it was unfair to give it a sample that included such things…)

Jock McNaught recently reminded me of this service by trying it out on President Obama's inaugural address.

Read the rest of this entry »

Comments (2)

New results on Austronesian linguistic phylogeny

Published today: R. D. Gray, A. J. Drummond, and S. J. Greenhill, "Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement", Science 323(5913):479:483, 23 January 2009. The abstract:

Debates about human prehistory often center on the role that population expansions play in shaping biological and cultural diversity. Hypotheses on the origin of the Austronesian settlers of the Pacific are divided between a recent "pulse-pause" expansion from Taiwan and an older "slow-boat" diffusion from Wallacea. We used lexical data and Bayesian phylogenetic methods to construct a phylogeny of 400 languages. In agreement with the pulse-pause scenario, the language trees place the Austronesian origin in Taiwan approximately 5230 years ago and reveal a series of settlement pauses and expansion pulses linked to technological and social innovations. These results are robust to assumptions about the rooting and calibration of the trees and demonstrate the combined power of linguistic scholarship, database technologies, and computational phylogenetic methods for resolving questions about human prehistory.

An unusually clear explanation of the project, along with a great deal of background information, is available on the web here.

Read the rest of this entry »

Comments (13)

Inaugural anticipation

There's an extraordinary amount of anticipation about Barack Obama's inaugural address, due in a few hours. A small sample of the anticipatory commentary: "The speech"; "'The Speech': An Experts' Guide"; "Inaugural Words: 1789 to the Present"; "Obama's Inaugural Address: Great Expectations"; and literally thousands of other articles. We've contributed our mite, in the form of Geoff Pullum's post "Presidential inaugurals: the form and the content", 1/15/2009 (though this belongs to a somewhat smaller body of work, the meta-anticipatory commentary). No doubt after the event there'll an even greater flood of discussion, meta- and otherwise.

Read the rest of this entry »

Comments (2)

Messing around

The latest xkcd:

The mouseover title: "And the ten minutes striking up a conversation with that strange kid in homeroom sometimes matters more than every other part of high school combined."

Read the rest of this entry »

Comments (10)

Reproducible research

For the last few days, I've been in Düsseldorf for the Berlin 6 Open Access conference, where I organized a session on "Open Data and Reproducible Research". Here's the abstract:

In many scientific and technical fields, research is increasingly based on published data. Researchers also often publish detailed instructions or even executable recipes for reproducing their results. Combined with inexpensive networked computing and mass storage, these trends can radically accelerate the pace of research, by lowering barriers to entry and decreasing the time required to reproduce and extend innovations. These changes may also modify the balance between data collection and data analysis, and between experimental and theoretical work.

Nevertheless, these potentially revolutionary developments are mostly happening below the surface, with uneven progress across disciplines, and little general discussion of how to guide or react to the process. The goal of this panel is to publicize the experience of several communities who have up to two decades of experience with what Jon Claerbout has termed "reproducible research", and to begin a general discussion of the broader implications for scientific, technical and scholarly publication.

Read the rest of this entry »

Comments (11)

Androids, electric sheep, plastic tongues…

For your edification and amusement: An articulator-based, rather than acoustic, speech synthesis device.

The original context, here on Botjunkie, says that the ultimate goal is a voice compression system for cellphones. I'm a bit confused about this — I *think* that the idea is that representing speech articulatorily will be less data-intensive than representing it acoustically, but that seems wildly improbable to me.

Here's the description of the system on the Takanishi Labs page. Amazingly, they even have a rubber set of vocal cords at work! (scroll down to see them in action).

Comments (30)

Interactive visualization for computational linguistics

I didn't make it to ACL2008 back in June, but Ani Nenkova, who was the tutorials chair, recently sent me a link to some really terrific slides from a tutorial by Christopher Collins, Gerald Penn and Sheelagh Carpendale on "Interactive Visualization for Computational Linguistics". (Warning: it's a 13.8 MB .pdf file).

Read the rest of this entry »

Comments (2)