Archive for Computational linguistics

The Dowdbot challenge

A few weeks ago, Maureen Dowd fantasized about a secret Google team trying to simulate her in software ("Dinosaur at the Gate", 4/14/2009):

When I ask [Eric Schmidt] if human editorial judgment still matters, he tries to reassure me: “We learned in working with newspapers that this balance between the newspaper writers and their editors is more subtle than we thought. It’s not reproducible by computers very easily.”

I feel better for a minute, until I realize that the only reason he knew that I wasn’t so easily replaceable is that Google had been looking into how to replace me.

There's a lot of far-out stuff over at Google Labs. But I'd be surprised to find that designing an army of Robot Maureens is in the mix, even though digital Dowd design poses some interesting challenges.

Read the rest of this entry »

Comments (10)

Industrial bullshitters censor linguists

A bullshit lie detector company run by a charlatan has managed to semi-successfully censor a peer reviewed academic article. And I don't like it one bit. But first, some background, and then we'll get to the censorship stuff.

Five years ago I wrote a Language Log post entitled "BS conditional semantics and the Pinocchio effect" about the nonsense spouted by a lie detection company, Nemesysco. I was disturbed by the marketing literature of the company, which suggested a 98% success rate in detecting evil intent of airline passengers, and included crap like this:

The LVA uses a patented and unique technology to detect "Brain activity finger prints" using the voice as a "medium" to the brain and analyzes the complete emotional structure of your subject. Using wide range spectrum analysis and micro-changes in the speech waveform itself (not micro tremors!) we can learn about any anomaly in the brain activity, and furthermore, classify it accordingly. Stress ("fight or flight" paradigm) is only a small part of this emotional structure

The 98% figure, as I pointed out, and as Mark Liberman made even clearer in a follow up post, is meaningless. There is no type of lie detector in existence whose performance can reasonably be compared to the performance of finger printing. It is meaningless to talk about someone's "complete emotional structure", and there is no interesting sense in which any current technology can analyze it. It is not the case that looking at speech will provide information about "any anomaly in the brain activity": at most it will tell you about some anomalies. Oh, the delicious irony, a lie detector company that engages in wanton deception.

Read the rest of this entry »

Comments (30)

Good is dead

Irving John "Jack" Good, who died on April 5 at the age of 92, is best known to linguists as the author of a paper on mathematical ecology. The paper is I.J. Good, "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrika 40(3-4) 237-264 (1953), and its abstract reads as follows:

A random sample is drawn from a population of animals of various species. (The theory may also be applied to studies of literary vocabulary, for example.) If a particular species is represented r times in the sample of size N, then r/N is not a good estimate of the population frequency, p, when r is small. Methods are given for estimating p, assuming virtually nothing about the underlying population. The estimates are expressed in terms of smoothed values of the numbers nr (r = 1, 2, 3, …), where nr is the number of distinct species that are each represented r times in the sample. (nr may be described as `the frequency of the frequency r'.) Turing is acknowledged for the most interesting formula in this part of the work. An estimate of the proportion of the population represented by the species occurring in the sample is an immediate corollary. Estimates are made of measures of heterogeneity of the population, including Yule's 'characteristic' and Shannon's 'entropy'. Methods are then discussed that do depend on assumptions about the underlying population. It is here that most work has been done by other writers. It is pointed out that a hypothesis can give a good fit to the numbers nr but can give quite the wrong value for Yule's characteristic. An example of this is Fisher's fit to some data of Williams's on Macrolepidoptera.

Read the rest of this entry »

Comments (8)

Conditional entropy and the Indus Script

A recent publication (Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, and Iravatham Mahadevan, "Entropic Evidence for Linguistic Structure in the Indus Script", Science, published online 23 April 2009; also supporting online material) claims a breakthrough in understanding the nature of the symbols found in inscriptions from the Indus Valley Civilization.

Two major types of nonlinguistic systems are those that do not exhibit much sequential structure (“Type 1” systems) and those that follow rigid sequential order (“Type 2” systems). […] Linguistic systems tend to fall somewhere between these two extremes […] This flexibility can be quantified statistically using conditional entropy, which measures the amount of randomness in the choice of a token given a preceding token. […]

We computed the conditional entropies of five types of known natural linguistic systems […], four types of nonlinguistic systems […], and an artificially-created linguistic system […]. We compared these conditional entropies with the conditional entropy of Indus inscriptions from a well-known concordance of Indus texts.

We found that the conditional entropy of Indus inscriptions closely matches those of linguistic systems and remains far from nonlinguistic systems throughout the entire range of token set sizes.

Read the rest of this entry »

Comments off

In defense of spell-checking

In a post a few days ago ("Why you shouldn't use spellcheckers", 4/7/2009), Bill Poser argued that "if English had a decent writing system there would be no use for [spellchecking] software". I'm no defender of our current writing system — it makes life much harder than it should be for writers and readers alike, especially in the early stages of learning. But I think that Bill is overselling the potential benefits of reform.

Read the rest of this entry »

Comments (45)

In defense of Amazon's Mechanical Turk

I can find no better description of Amazon's Mechanical Turk than in the "description" tag at the site itself:

The online market place for work. We give businesses and developers access to an on-demand scalable workforce. Workers can work at home and make money by choosing from thousands of tasks and jobs.

This is followed by a "keywords" meta tag:

make money, make money at home, make money from home, make money on the internet, make extra money, make money …

This makes the site sound a bit like the next stop on Dave Chapelle's tour of his imagined Internet as physical place, and indeed it does have its seamy side. But I come to defend Mechanical Turk as a useful tool for linguistic research — a quick and inexpensive way to gather data and conduct simple experiments.

Read the rest of this entry »

Comments (11)

Term-mining Obama's inaugural address

TerMine is a system for recognizing multiword terms. The algorithm was originally presented in Katerina Frantzi, Sophia Ananiadou, and Hideki Mima, Automatic recognition of multi-word terms, International Journal of Digital Libraries 3(2): 117-132, 2000. You can try it out on a site at the National Centre for Text Mining (NaCTeM) at the University of Manchester in the UK, where they have a web demonstration that will analyze short (<2 MB) texts or URLs for you.

As you'll find if you try, the results are not always perfect, but I think that the algorithm is remarkably good at guessing multi-word terms from small amounts of text. For example, if I try it out on a page (~2000 words) of lecture notes about "Statistical estimation for Large Numbers of Rare Events", it comes up with a large number of sensible things like good-turing estimate, maximum likelihood, population frequency, belief tax, and negative binomial distribution — along with a few clunkers like cnew = cnew./token and some other fragments of Matlab code. (Maybe it was unfair to give it a sample that included such things…)

Jock McNaught recently reminded me of this service by trying it out on President Obama's inaugural address.

Read the rest of this entry »

Comments (2)

New results on Austronesian linguistic phylogeny

Published today: R. D. Gray, A. J. Drummond, and S. J. Greenhill, "Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement", Science 323(5913):479:483, 23 January 2009. The abstract:

Debates about human prehistory often center on the role that population expansions play in shaping biological and cultural diversity. Hypotheses on the origin of the Austronesian settlers of the Pacific are divided between a recent "pulse-pause" expansion from Taiwan and an older "slow-boat" diffusion from Wallacea. We used lexical data and Bayesian phylogenetic methods to construct a phylogeny of 400 languages. In agreement with the pulse-pause scenario, the language trees place the Austronesian origin in Taiwan approximately 5230 years ago and reveal a series of settlement pauses and expansion pulses linked to technological and social innovations. These results are robust to assumptions about the rooting and calibration of the trees and demonstrate the combined power of linguistic scholarship, database technologies, and computational phylogenetic methods for resolving questions about human prehistory.

An unusually clear explanation of the project, along with a great deal of background information, is available on the web here.

Read the rest of this entry »

Comments (13)

Inaugural anticipation

There's an extraordinary amount of anticipation about Barack Obama's inaugural address, due in a few hours. A small sample of the anticipatory commentary: "The speech"; "'The Speech': An Experts' Guide"; "Inaugural Words: 1789 to the Present"; "Obama's Inaugural Address: Great Expectations"; and literally thousands of other articles. We've contributed our mite, in the form of Geoff Pullum's post "Presidential inaugurals: the form and the content", 1/15/2009 (though this belongs to a somewhat smaller body of work, the meta-anticipatory commentary). No doubt after the event there'll an even greater flood of discussion, meta- and otherwise.

Read the rest of this entry »

Comments (2)

Messing around

The latest xkcd:

The mouseover title: "And the ten minutes striking up a conversation with that strange kid in homeroom sometimes matters more than every other part of high school combined."

Read the rest of this entry »

Comments (10)

Reproducible research

For the last few days, I've been in Düsseldorf for the Berlin 6 Open Access conference, where I organized a session on "Open Data and Reproducible Research". Here's the abstract:

In many scientific and technical fields, research is increasingly based on published data. Researchers also often publish detailed instructions or even executable recipes for reproducing their results. Combined with inexpensive networked computing and mass storage, these trends can radically accelerate the pace of research, by lowering barriers to entry and decreasing the time required to reproduce and extend innovations. These changes may also modify the balance between data collection and data analysis, and between experimental and theoretical work.

Nevertheless, these potentially revolutionary developments are mostly happening below the surface, with uneven progress across disciplines, and little general discussion of how to guide or react to the process. The goal of this panel is to publicize the experience of several communities who have up to two decades of experience with what Jon Claerbout has termed "reproducible research", and to begin a general discussion of the broader implications for scientific, technical and scholarly publication.

Read the rest of this entry »

Comments (11)

Androids, electric sheep, plastic tongues…

For your edification and amusement: An articulator-based, rather than acoustic, speech synthesis device.

The original context, here on Botjunkie, says that the ultimate goal is a voice compression system for cellphones. I'm a bit confused about this — I *think* that the idea is that representing speech articulatorily will be less data-intensive than representing it acoustically, but that seems wildly improbable to me.

Here's the description of the system on the Takanishi Labs page. Amazingly, they even have a rubber set of vocal cords at work! (scroll down to see them in action).

Comments (30)

Interactive visualization for computational linguistics

I didn't make it to ACL2008 back in June, but Ani Nenkova, who was the tutorials chair, recently sent me a link to some really terrific slides from a tutorial by Christopher Collins, Gerald Penn and Sheelagh Carpendale on "Interactive Visualization for Computational Linguistics". (Warning: it's a 13.8 MB .pdf file).

Read the rest of this entry »

Comments (2)