Archive for Computational linguistics

Please don't tell me about it

Those who can read German may be interested in some recent work by Gerd Fritz, of the Zentrum für Medien und Interaktivität at the Justus-Liebig-Universitaet Giessen, on "Texttypen im Language Log" ("Text types in Language Log"). Prof. Fritz tells me that this is "a brief summary of a longer paper to be published shortly".

Read the rest of this entry »

Comments (14)

Speech-based lie detection in Russia

Andrew E. Kramer, "Russian A.T.M. With an Ear for the Truth", NYT 6/8/2011:

Russia’s biggest retail bank is testing a machine that the old K.G.B. might have loved, an A.T.M. with a built-in lie detector intended to prevent consumer credit fraud.

Consumers with no previous relationship with the bank could talk to the machine to apply for a credit card, with no human intervention required on the bank’s end.

The machine scans a passport, records fingerprints and takes a three-dimensional scan for facial recognition. And it uses voice-analysis software to help assess whether the person is truthfully answering questions that include “Are you employed?” and “At this moment, do you have any other outstanding loans?”

The voice-analysis system was developed by the Speech Technology Center, a company whose other big clients include the Federal Security Service — the Russian domestic intelligence agency descended from the Soviet K.G.B.

Dmitri V. Dyrmovsky, director of the center’s Moscow offices, said the new system was designed in part by sampling Russian law enforcement databases of recorded voices of people found to be lying during police interrogations.

Read the rest of this entry »

Comments (12)

Remembering 9/11/2001

Like almost everyone else, I was happy to learn that Osama bin Laden is now an ex-terrorist; and I was mildly surprised to learn that he had been holed up in a large and luxurious compound located less than a mile by road from PMA Kakul, Pakistan's equivalent of West Point.

Read the rest of this entry »

Comments (38)

Phonemic diversity decays "out of Africa"?

A striking recent paper by Quentin Atkinson ("Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa", Science 4/15/2011) has been the subject of a lot of discussion recently. Its abstract:

Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a serial founder effect in which successive population bottlenecks during range expansion progressively reduce diversity, underpinning support for an African origin of modern humans. Recent work suggests that a similar founder effect may operate on human culture and language. Here I show that the number of phonemes used in a global sample of 504 languages is also clinal and fits a serial founder–effect model of expansion from an inferred origin in Africa. This result, which is not explained by more recent demographic history, local language diversity, or statistical non-independence within language families, points to parallel mechanisms shaping genetic and linguistic diversity and supports an African origin of modern human languages.

Read the rest of this entry »

Comments (68)

Word-order "universals" are lineage-specific?

This post is the promised short discussion of Michael Dunn, Simon J. Greenhill, Stephen C. Levinson & Russell D. Gray, "Evolved structure of language shows lineage-specific trends in word-order universals", Nature, published online 4/13/2011. [Update: free downloadable copies are available here.] As I noted earlier, I recommend the clear and accessible explanation that Simon Greenhill and Russell Gray have put on the Austronesian Database website in Auckland — in fact, if you haven't read that explanation, you should go do so now, because I'm not going to recapitulate what they did and their reasons for doing it, beyond quoting the conclusion:

These family-specific linkages suggest that language structure is not set by innate features of the cognitive language parser (as suggested by the generativists), or by some over-riding concern to "harmonize" word-order (as suggested by the statistical universalists). Instead language structure evolves by exploring alternative ways to construct coherent language systems. Languages are instead the product of cultural evolution, canalized by the systems that have evolved during diversification, so that future states lie in an evolutionary landscape with channels and basins of attraction that are specific to linguistic lineages.

And I should start by saying that I'm neither a syntactician nor a typologist.  The charitable way to interpret this is that I don't start with any strong prejudices on the subject of syntactic typology. From this unbiased perspective, it seems to me that this paper adds a good idea that has been missing from most traditional work in syntactic typology, but at the same time, it misses two good ideas that have been extensively developed in the related area of historical syntax.

Read the rest of this entry »

Comments (96)

Oice-vay Earch-say

According to the Official Google Research Blog,

As you might know, Google Voice Search is available in more than two dozen languages and dialects, making it easy to perform Google searches just by speaking into your phone.

Today it is our pleasure to announce the launch of Pig Latin Voice Search! […]

To configure Pig Latin Voice Search in your Android phone just go to Settings, select “Voice input & output settings”, and then “Voice recognizer settings”. In the list of languages you’ll see Pig Latin. Just select it and you are ready to roll in the mud!

It also works on iPhone with the Google Search app. In the app, tap the Settings icon, then "Voice Search" and select Pig Latin.

Read the rest of this entry »

Comments (10)

Waseda talker

"This is cool", writes John Coleman — and it is. More later.

Comments (8)

Two Breakfast Experiments™: Literally

A couple of days ago, following up on Sunday's post about literally, Michael Ramscar sent me this fascinating graph:

What this shows us is a remarkably lawful relationship between the frequency of a verb and the probability of its being modified by literally, as revealed by counts from the 410-million-word COCA corpus. (The R2 value means that a verb's frequency accounts for 88% of the variance in  its chances of being modified by literally.)

Read the rest of this entry »

Comments (40)

Intellectual automation

Following up on the recent discussion of legal automation, I note that Paul Krugman has added a blog post ("Falling Demand for Brains?", 3/5/2011) and an Op-Ed column ("Degrees and Dollars", 3/6/2011), pushing an idea that he first suggested in a 1996 NYT Magazine piece ("White Collars Turn Blue", 9/29/1996), where he wrote as if from the perspective of 2096:

When something becomes abundant, it also becomes cheap. A world awash in information is one in which information has very little market value. In general, when the economy becomes extremely good at doing something, that activity becomes less, rather than more, important. Late-20th-century America was supremely efficient at growing food; that was why it had hardly any farmers. Late-21st-century America is supremely efficient at processing routine information; that is why traditional white-collar workers have virtually disappeared.

Read the rest of this entry »

Comments (18)

Legal automation

Over the past few days, we've discussed the possible relevance of corpus evidence in legal evaluations of ordinary-language meaning. Another (and socio-economically more important) legal application of computational linguistics is featured today in John Markoff's article, "Armies of Expensive Lawyers, Replaced by Cheaper Software", NYT 3/4/2011:

When five television studios became entangled in a Justice Department antitrust lawsuit against CBS, the cost was immense. As part of the obscure task of “discovery” — providing documents relevant to a lawsuit — the studios examined six million documents at a cost of more than $2.2 million, much of it to pay for a platoon of lawyers and paralegals who worked for months at high hourly rates.

But that was in 1978. Now, thanks to advances in artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost. In January, for example, Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.

Read the rest of this entry »

Comments (12)

Now on The Atlantic: The corpus in the court

On Tuesday, the Supreme Court ruled in FCC v. AT&T that corporations are not entitled to a right of "personal privacy," even if corporations can be construed as "persons." To reach this decision, they were aided by an amicus brief by Neal Goldfarb that presented corpus evidence on the types of nouns that the adjective "personal" typically modifies. Here on Language Log, Mark Liberman posted about the case on the day the decision was released, and now I have a piece for The Atlantic discussing the use of corpus analysis in the courtroom.

Read the rest of this entry »

Comments (2)

…with just a hint of Naive Bayes in the nose

Coco Krumme, "Velvety Chocolate With a Silky Ruby Finish. Pair With Shellfish.", Slate 2/23/2011:

Using descriptions of 3,000 bottles, ranging from \$5 to \$200 in price from an online aggregator of reviews, I first derived a weight for every word, based on the frequency with which it appeared on cheap versus expensive bottles. I then looked at the combination of words used for each bottle, and calculated the probability that the wine would fall into a given price range. The result was, essentially, a Bayesian classifier for wine. In the same way that a spam filter considers the combination of words in an e-mail to predict the legitimacy of the message, the classifier estimates the price of a bottle using its descriptors.

The analysis revealed, first off, that "cheap" and "expensive" words are used differently. Cheap words are more likely to be recycled, while words correlated with expensive wines tend to be in the tail of the distribution. That is, reviewers are more likely to create new vocabulary for top-end wines. The classifier also showed that it's possible to guess the price range of a wine based on the words in the review.

Read the rest of this entry »

Comments (15)

Could Watson parse a snowclone?

Today on The Atlantic I break down Watson's big win over the humans in the Jeopardy!/IBM challenge. (See previous Language Log coverage here and here.) I was particularly struck by the snowclone that Ken Jennings left on his Final Jeopardy response card last night: "I, for one, welcome our new computer overlords." I use that offhand comment as a jumping-off point to dismantle some of the hype about Watson's purported ability to "understand" natural language.

Read the rest of this entry »

Comments (32)