Archive for Computational linguistics

Google Books: A Metadata Train Wreck

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Read the rest of this entry »

Comments (81)

"Team, Meet Girls; Girls, Meet Team"

The ideal David Bowie song, according to (Nick Troop's interpretation of) the output of Jamie Pennebaker's LIWC program, correlated with sales figures across Bowie's oeuvre:

Read the rest of this entry »

Comments (8)

Computational eggcornology

Chris Waigl, keeper of the Eggcorn Database, brings to our attention a paper that was presented at CALC-09 (Workshop on Computational Approaches to Linguistic Creativity, held in conjunction with NAACL HLT in Boulder, Colorado, on June 4, 2009). As part of a session on "Metaphors and Eggcorns," Sravana Reddy (University of Chicago Dept. of Computer Science) delivered a paper entitled "Understanding Eggcorns." Here's the abstract:

An eggcorn is a type of linguistic error where a word is substituted with one that is semantically plausible – that is, the substitution is a semantic reanalysis of what may be a rare, archaic, or otherwise opaque term. We build a system that, given the original word and its eggcorn form, finds a semantic path between the two. Based on these paths, we derive a typology that reflects the different classes of semantic reinterpretation underlying eggcorns.

You can read the PDF of Reddy's paper here. Yet another advance in the recognition of eggcornology as a legitimate linguistic subdiscipline.

Comments (2)

The and a sex: a replication

On the basis of recent research in social psychology, I calculate that there is a 53% probability that Geoff Pullum is male. That estimate is based the percentage of the and a/an in a recent Language Log post, "Stupid canine lexical acquisition claims", 8/12/2009.

But we shouldn't get too excited about our success in correctly sexing Geoff: the same process, applied to Sarah Palin's recent "Death Panel" facebook post ("Statement on the Current Health Care Debate", 8/7/2009),  estimates her probability of being male at 56%.

Read the rest of this entry »

Comments (8)

Thanks, Bill Dunn!

In a comment on a recent LL post, Daniel C. Parmenter wrote:

In my MT days (starting in the early nineties) we used the WSJ corpus a lot. I read recently that the availablity of this corpus was in no small part thanks to you. And so I thank you. In those pre-and-early Google/Altavista days the WSJ corpus was an enormous help. Thanks!

Daniel is referring to an archive of text from the Wall Street Journal, covering 1987-1989, originally published with some other raw material for corpus linguistics by the  Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). And the person who most deserves thanks for the availability of the WSJ part of this publication — perhaps its most important part — is Bill Dunn, who was the head of Dow Jones Information Services in the late 1980s.

As far as I know, Bill's role in making this corpus available is not documented anywhere, so I'll take this opportunity to tell some of the story as I remember it. (The rest of this post is a slightly-edited version of an email that I sent on 5/1/2008 to someone at the WSJ who had corresponded with Geoff Pullum about an article on the use of corpus materials in linguistic research.)

Read the rest of this entry »

Comments (7)

NLTK Book on Sale Now

The NLTK book, Natural Language Processing with Python, went on sale yesterday:

Cover of Natural Language Processing with Python

"This book is here to help you get your job done." I love that line (from the preface). It captures the spirit of the book. Right from the start, readers/users get to do advanced things with large corpora, including information-rich visualizations and sophisticated theory implementation. If you've started to see that your research would benefit from some computational power, but you have limited (or no) programming experience, don't despair — install NLTK and its data sets (it's a snap), then work through this book.

Read the rest of this entry »

Comments (5)

Everyone to obey the orders and guidelines Mzmlh call girl

Over the past couple of days, I've continued to use Google's alpha Persian-English translation system as part of an attempt to keep track of what's happening in Iran.

On long passages, the results are still at the fever-dream stage of machine translation, where enough relevant words and phrase-fragments emerge to leave a sort of impressionistic residue of content, but without much overall coherence. For example, I tried it on a bulletin from Mehr News yesterday evening that claimed to be a statement from the Assembly of Experts announcing full support for Kahmenei's speech on Friday. This sentence

به گزارش خبرگزاری مهر ، در این بیانیه آمده است: مجلس خبرگان رهبری ضمن تشکر از حضور شکوهمند و حماسه‌ساز مردم در انتخابات ریاست جمهوری، حمایت قاطع خود را از بیانات روشنگرانه، وحدت‌بخش و داهیانه‌ مقام معظم رهبری در نماز جمعه تهران اعلام می‌دارد و با شکرگزاری به درگاه الهی نسبت به نعمت عظما و بی‌بدیل ولایت فقیه، این رکن رکین حدوث و تداوم انقلاب؛ همگان را به تبعیت از دستورات و رهنمودهای معظم‌‌له فرا می‌خواند.

comes out in the automatic translation as

Mehr News Agency reported, the statement states: the Assembly of Experts also thanked the glorious presence Hmas·hsaz and presidential elections, support their statements Rvshngranh decisive, and Vhdtbkhsh Dahyanh Ayatollah Khamenei Friday Prayers in Tehran and ready Thanksgiving Portal to the Divine favor Zma Bybdyl and velayat-e faqih, the pillars of the revolution and continuity Rkyn Hdvs; everyone to obey the orders and guidelines Mzmlh call girl.

Read the rest of this entry »

Comments (3)

Green verdure tone desert liquidation

Researchers at Google have responded to current events in Iran by offering an alpha version of Persian-to-English machine translation. I'm a big fan of statistical MT, and for that matter of Google's MT team, and current events in Iran are gripping, so I thought I'd try it out.

Read the rest of this entry »

Comments (7)

496M hits for "language log"? Alas, no.

You've probably heard about Microsoft's new search site bing. I don't know much about it yet, but I did observe a couple of things that may be of interest to those of us who try to use web-search counts as data.

Read the rest of this entry »

Comments (6)

The Dowdbot challenge

A few weeks ago, Maureen Dowd fantasized about a secret Google team trying to simulate her in software ("Dinosaur at the Gate", 4/14/2009):

When I ask [Eric Schmidt] if human editorial judgment still matters, he tries to reassure me: “We learned in working with newspapers that this balance between the newspaper writers and their editors is more subtle than we thought. It’s not reproducible by computers very easily.”

I feel better for a minute, until I realize that the only reason he knew that I wasn’t so easily replaceable is that Google had been looking into how to replace me.

There's a lot of far-out stuff over at Google Labs. But I'd be surprised to find that designing an army of Robot Maureens is in the mix, even though digital Dowd design poses some interesting challenges.

Read the rest of this entry »

Comments (10)

Industrial bullshitters censor linguists

A bullshit lie detector company run by a charlatan has managed to semi-successfully censor a peer reviewed academic article. And I don't like it one bit. But first, some background, and then we'll get to the censorship stuff.

Five years ago I wrote a Language Log post entitled "BS conditional semantics and the Pinocchio effect" about the nonsense spouted by a lie detection company, Nemesysco. I was disturbed by the marketing literature of the company, which suggested a 98% success rate in detecting evil intent of airline passengers, and included crap like this:

The LVA uses a patented and unique technology to detect "Brain activity finger prints" using the voice as a "medium" to the brain and analyzes the complete emotional structure of your subject. Using wide range spectrum analysis and micro-changes in the speech waveform itself (not micro tremors!) we can learn about any anomaly in the brain activity, and furthermore, classify it accordingly. Stress ("fight or flight" paradigm) is only a small part of this emotional structure

The 98% figure, as I pointed out, and as Mark Liberman made even clearer in a follow up post, is meaningless. There is no type of lie detector in existence whose performance can reasonably be compared to the performance of finger printing. It is meaningless to talk about someone's "complete emotional structure", and there is no interesting sense in which any current technology can analyze it. It is not the case that looking at speech will provide information about "any anomaly in the brain activity": at most it will tell you about some anomalies. Oh, the delicious irony, a lie detector company that engages in wanton deception.

Read the rest of this entry »

Comments (30)

Good is dead

Irving John "Jack" Good, who died on April 5 at the age of 92, is best known to linguists as the author of a paper on mathematical ecology. The paper is I.J. Good, "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrika 40(3-4) 237-264 (1953), and its abstract reads as follows:

A random sample is drawn from a population of animals of various species. (The theory may also be applied to studies of literary vocabulary, for example.) If a particular species is represented r times in the sample of size N, then r/N is not a good estimate of the population frequency, p, when r is small. Methods are given for estimating p, assuming virtually nothing about the underlying population. The estimates are expressed in terms of smoothed values of the numbers nr (r = 1, 2, 3, …), where nr is the number of distinct species that are each represented r times in the sample. (nr may be described as `the frequency of the frequency r'.) Turing is acknowledged for the most interesting formula in this part of the work. An estimate of the proportion of the population represented by the species occurring in the sample is an immediate corollary. Estimates are made of measures of heterogeneity of the population, including Yule's 'characteristic' and Shannon's 'entropy'. Methods are then discussed that do depend on assumptions about the underlying population. It is here that most work has been done by other writers. It is pointed out that a hypothesis can give a good fit to the numbers nr but can give quite the wrong value for Yule's characteristic. An example of this is Fisher's fit to some data of Williams's on Macrolepidoptera.

Read the rest of this entry »

Comments (8)

Conditional entropy and the Indus Script

A recent publication (Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, and Iravatham Mahadevan, "Entropic Evidence for Linguistic Structure in the Indus Script", Science, published online 23 April 2009; also supporting online material) claims a breakthrough in understanding the nature of the symbols found in inscriptions from the Indus Valley Civilization.

Two major types of nonlinguistic systems are those that do not exhibit much sequential structure (“Type 1” systems) and those that follow rigid sequential order (“Type 2” systems). […] Linguistic systems tend to fall somewhere between these two extremes […] This flexibility can be quantified statistically using conditional entropy, which measures the amount of randomness in the choice of a token given a preceding token. […]

We computed the conditional entropies of five types of known natural linguistic systems […], four types of nonlinguistic systems […], and an artificially-created linguistic system […]. We compared these conditional entropies with the conditional entropy of Indus inscriptions from a well-known concordance of Indus texts.

We found that the conditional entropy of Indus inscriptions closely matches those of linguistic systems and remains far from nonlinguistic systems throughout the entire range of token set sizes.

Read the rest of this entry »

Comments off