Archive for Computational linguistics

Sexual orders

In the comments on "The order of ancestors" (12/24/2009), there was some discussion about the possible role of gender bias in determining the preference for orders like "mothers and fathers" over "fathers and mothers".  This discussion faced a basic empirical problem: there were more plausibly-relevant principles (a long list of apparent semantic and phonological preferences) than there were facts to explain.

In this post, I'll review in more depth the evidence about the preferred orders of English binomial expressions for gendered categories of humans. This review will leave us in the same logical impasse.  Then I'll tell you about the clever solution found by Saundra Wright, Jennifer Hay and Tessa Bent in their paper "Ladies first? Phonology, frequency, and the naming conspiracy", Linguistics 43(3): 531–561, 2005.

Read the rest of this entry »

Comments (25)

Quotes with and without quotes

Chris is puzzled by these Google counts, for famous quotations with and without quotation marks flanking the search string:

Gone With The Wind
about 797,000 for "Frankly, my dear, I don't give a damn!"
about 163,000 for Frankly, my dear, I don't give a damn!

Taxi Driver
about 17,500,000 for "You talkin' to me?"
about 7,450,000 for You talkin' to me?

As he explains: " I discovered something weird. In some cases, the more restrictive, double-quoted query returned more hits that the unquoted query. A lot more. "

Read the rest of this entry »

Comments (6)

Literary Alzheimer's

One of the items featured in the New York Times Magazine's "Ninth Annual Year in Ideas", under the heading "Literary Alzheimer's", is a summary of Ian Lancashire and Graeme Hirst, "Vocabulary Changes in Agatha Christie’s Mysteries as an Indication of Dementia: A Case Study", presented at the 19th Annual Rotman Research Institute Conference, Cognitive Aging: Research and Practice, 8–10 March 2009, Toronto.

Read the rest of this entry »

Comments (17)

Rhymes

Andrew Gelman is justifiably impressed by Laura Wattenberg's ruminations on rhyme (warning: the second link triggers one of those insufferable ads that starts playing loud sounds as soon as the page comes up, so mute your audio before clicking).  Ms. Wattenberg without the musical background:

Here's a little pet peeve of mine: nothing rhymes with orange. You've heard that before, right? Orange is famous for its rhymelessness. There's even a comic strip called "Rhymes with Orange." Fine then, let me ask you something. What the heck rhymes with purple?

If you stop and think about it, you'll find that English is jam-packed with rhymeless common words. What rhymes with empty, or olive, or silver, or circle? You can even find plenty of one-syllable words like wolf, bulb, and beige. Yet orange somehow became notorious for its rhymelessness, with the curious result that people now assume its status is unique.

Andrew wrote to ask about this, and so I did a bit of looking around for information about the statistics of rhyme.

Read the rest of this entry »

Comments (125)

Supreme Court open infrastructure

Yesterday and today, I'm at Washington University in St. Louis at a meeting on open infrastructure for studies of the U.S. Supreme Court, organized by Andrew Martin at the Center for Empirical Research in the Law.  (That sentence sets some kind of local record for prepositional phrase density, but a couple of quick attempts to fix it made things worse.  Just to start with, you've got CERL, which has two, and WUSL, which adds one more…)

Read the rest of this entry »

Comments (30)

Body loses Supreme Court appeal

This morning, I appealed the somebody-vs.-someone story to the Supreme Court of the United States. The decision came quickly — details are below.

Read the rest of this entry »

Comments (21)

Authors of the month

A few weeks ago, we featured Elevate Embuggerance and Holistic Feisty, authors (according to Google Scholar) of The Linguistics of Laughter:

Now, thanks to research by Steven Landsburg and Aaron Mandel, we're proud to introduce you to the prolific writer "Ass Meat Research Group", who is listed at amazon.com as the author of 88 books:

Read the rest of this entry »

Comments (16)

Wombling

The second talk in a workshop on "Natural Algorithms", to be held at Princeton on Nov. 2-3, is Jorge Cortés, "Distributed wombling by robotic sensor networks". But you don't need to be able to attend the workshop in order to learn about this fascinating topic, since the author has recently published a version of the same material. The abstract:

This paper proposes a distributed coordination algorithm for robotic sensor networks to detect boundaries that separate areas of abrupt change of spatial phenomena. We consider an aggregate objective function, termed wombliness, that measures the change of the spatial field along the closed polygonal curve defined by the location of the sensors in the environment. We encode the network task as the optimization of the wombliness and characterize the smoothness properties of the objective function. In general, the complexity of the spatial phenomena makes the gradient flow cause self-intersections in the polygonal curve described by the network. Therefore, we design a distributed coordination algorithm that allows for network splitting and merging while guaranteeing the monotonic evolution of wombliness.

Read the rest of this entry »

Comments (10)

A new target language for machine translation

Weasel-speak, as featured in today's Tank McNamara:

There's clearly money in it — and quite a bit of training material out there.

Comments (11)

Another nail in the ATEOTD=manager coffin

Some people are hard to persuade. In response to my post "'At the end of the day' not management-speak", Peter Taylor commented:

I argue that the first question to ask is whether hearing someone use the phrase "At the end of the day" conveys information on whether they are likely to be a manager…

Well, a definitive determination of the information gain involved, aside from its limited general interest, would require more resources than I can bring to bear over my morning coffee. But we can make a plausible guess, and the answer turns out to be that the "information gain" is probably pretty small, and is just about as likely to point away from the conclusion that the speaker or writer is a manager as towards it.

Read the rest of this entry »

Comments (19)

Google Scholar: another metadata muddle?

Following on the critiques of the faulty metadata in Google Books that I offered here and in the Chronicle of Higher Education, Peter Jacso of the University of Hawaii writes in the Library Journal that Google Scholar is laced with millions of metadata errors of its own. These include wildly inflated publication and citation counts (which Jacso compares to Bernie Madoff's profit reports), numerous missing author names, and phantom authors assigned by the parser that Google elected to use to extract metadata, rather than using the metadata offered them by scholarly publishers and indexing/abstracting services:

In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records— Limited (234,000) and Ltd (452,000). 

What makes this a serious problem is that many people regard the Google Scholar metadata as a reliable index of scholarly influence and reputation, particularly now that there are tools like the Google Scholar Citation Count gadget by Jan Feyereisl and the Publish or Perish software produced by Tarma Software, both of which take Google Scholar's metadata at face value. True, the data provided by traditional abstracting and indexing services are far from perfect, but their errors are dwarfed by those of Google Scholar, Jacso says.

Of course you could argue that Google's responsibilities with Google Scholar aren't quite analogous to those with Google Book, where the settlement has to pass federal scrutiny and where Google has obligations to the research libraries that provided the scans. Still, you have to feel sorry for any academic whose tenure or promotion case rests in part on the accuracy of one of Google's algorithms.

Comments (9)

Semantic fail

Leena Rao at TechCrunch points out a case where semantic search turned into anti-semitic search.

Read the rest of this entry »

Comments (41)

Serial improvement

Although I share Geoff Nunberg's disappointment in some aspects of Google's metadata for books,  I've noticed a significant — though apparently unheralded — recent improvement.  So I decided to check this out by following up Bill Poser's post yesterday about insect species, which I thought was likely to turn up an example of the right sort. And in fact, the third hit in a search for {hemipteran} is a relevant one: Irene McCulloch, "A comparison of the life cycle of Crithidia with that of Trypanosoma in the invertebrate host", University of California Publications in Zoology, 19(4) 135-190, October 4, 1919.

This paper appears in a volume that is part of a serial publication. And until recently, Google Books  routinely gave all such publications the date of the first in the series, even if the result was a decade or a century out of whack.

Read the rest of this entry »

Comments (12)