Archive for Books

Google Books: A Metadata Train Wreck

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Read the rest of this entry »

Comments (81)

The Google Books Settlement

I'm spending today at Berkeley, participating in a one-day conference on "The Google Books Settlement and the Future of Information Access".  I'll live-blog the discussion as the day unfolds, leaving comments off until it's over. I believe that the sessions are being recorded, and the recordings will be available on the web at some time in the near future. [Gary Price at Resource Shelf provides some other links here, and a press round-up here. Another summary by an attendee is here.]

Regular LL readers will know that we've been long-time users and supporters of Google Books, with occasional complaints about the poor quality of its metadata. For a lucid discussion of some issues with the terms of the proposed settlement, read Pamela Samuelson's articles "The Audacity of the Google Books Settlement", Huffington Post, 8/10/2009, and "Why is the Antitrust Division Investigating the Google Books Search Settlement?", Huffington Post, 8/19/2009.

Read the rest of this entry »

Comments (7)

Down the memory hole into bibliomysticism

For the better part of the past two years, I've resisted the temptation to run out and buy a Kindle. (Well, OK, I wouldn't have to "run out" to do it, nor could I; as far as I can tell, the Kindle can only be ordered on Amazon. But whatever, it's still an appropriate figure of speech.) The Kindle just seems made for me. I love books and I love to read, and I'm also a ridiculously huge fan of electronic publishing of all kinds, and (especially) of the idea of carrying a library worth of books with me wherever I go, because hey, you never know when you might want to read any one of them. (This also explains why my iPod touch overfloweth with just-in-case music.) The Kindle seems like it should be the best of both worlds: it's all there but the actual page-turning.

But still, I've resisted. I suppose I've been waiting for a sign, or at the very least for a definitive review of the Kindle — something other than the lap-doggish panting that was all I'd seen thus far. And in the space of the past two weeks, I've had both.

Read the rest of this entry »

Comments (32)

NLTK Book on Sale Now

The NLTK book, Natural Language Processing with Python, went on sale yesterday:

Cover of Natural Language Processing with Python

"This book is here to help you get your job done." I love that line (from the preface). It captures the spirit of the book. Right from the start, readers/users get to do advanced things with large corpora, including information-rich visualizations and sophisticated theory implementation. If you've started to see that your research would benefit from some computational power, but you have limited (or no) programming experience, don't despair — install NLTK and its data sets (it's a snap), then work through this book.

Read the rest of this entry »

Comments (5)

Slang affixation: it's all mystery-y-ish-y

If you haven't picked up a copy of Michael Adams' new book, Slang: The People's Poetry, well, what are you waiting for? For starters, it's a lively and engaging look at English slang and its multitudinous forms. At the same time, it's a thoughtful interrogation of what "slang" actually is, and how we might determine its boundaries. One way that Michael expands traditional notions of slang is in his treatment of affixation, or what he amusingly calls "unorthodox lexifabricology." I talked to Michael about slangy affixation in the second part of my two-part interview with him for the Visual Thesaurus. An excerpt follows below.

Read the rest of this entry »

Comments (17)

A BIG baseball book

A little while back, a representative of the publishers of the third edition of Paul Dickson's Baseball Dictionary wrote to offer me a free copy, in the hope that I would review the book on Language Log. I replied that I was an idiot about baseball — yes, I know, this totally undercuts any claim I might have to being a real American man, but I coped with that long ago — and so was not the person they wanted to take on this task.

But I did buy the book, because I knew that Dickson's dictionary was a work of serious lexicographic scholarship (with careful citations and thoughtful definitions, the sort of thing that could be accommodated in a revision of the OED). Many specialized dictionaries are not like this, and for good reason: in many domains, the evidence for usages in written texts is very hard to come by, and very spotty.

Read the rest of this entry »

Comments off

Shattering the illusions of texting

In my capacity as executive producer of the Visual Thesaurus, I recently had the opportunity to interview David Crystal about his new book, Txtng: The Gr8 Db8, a careful demolition of the myths surrounding text messaging. You can read the first part of my interview on the Visual Thesaurus website here, with parts two and three to follow in coming weeks. As Mark Liberman has noted, texting is only now achieving levels of popularity in the US that Europe and parts of Asia saw about five years ago. That also means that the US is also about five years behind the curve on the concomitant hysteria over how texting presages the death of the language.

Time and time again we've seen this strain of "hell in a handbasket" degenerationism pervading attitudes about contemporary language use (e.g., here, here, here, and here). But the furore over texting in the United Kingdom, which Crystal says began with a 2003 Internet myth about a school essay written entirely in textisms, takes this alarmism to new levels. Will the U.S. be whipped up into the same fervor, five years later? Geoff Nunberg gave some indications of this possibility in a "Fresh Air" commentary a few months ago about excessive reactions to a Pew Research Center study on texting. The publication of Crystal's book in the US is therefore remarkably well-timed, since it can serve as a useful antidote to this sort of overheated discourse.

Read the rest of this entry »

Comments (47)