Language Log

Google Books: A Metadata Train Wreck

August 29, 2009 @ 5:46 pm · Filed by Geoff Nunberg under Books, Computational linguistics

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few. (You can find images of most of these on my slides, here — I'm not giving the url's since I expect Google will fix most of these particular errors now that they're aware of them).

And while there may be particular reasons why the 1899 date comes up so much, these misdatings are spread out all over the place. A book on Peter Drucker is dated 1905, a book of Virginia Woolf's letters is dated 1900, Tom Wolfe's The Bonfire of the Vanities is dated 1888, and an edition of Henry James 1897 What Maisie Knew is dated 1848.

Vwoolf1900
It might seem easy to cherry-pick howlers from a corpus as exensive as this one, but these errors are endemic. Do a search on "internet" in books written before 1950 and Google Scholar turns up 527 hits.

Or try searching on the names of writers or famous restricting your search to works published before the years of their birth. You turn up 182 hits for Charles Dickens, more than 80 percent of them misdated books referring to the writer as opposed to someone else of the same name. The same search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

A search on books mentioning candy bar that were published before 1920 turns up 66 hits, of which 46, or 70 percent, are misdated. I'd be surprised if that proportion of errors or anything like it held up in general for books in that range, and dating errors are far denser for older works than for the ones Google received from publishers. But even if the proportion is only 5 percent, that suggests hundreds of thousands of dating errors.

In discussion after my presentation, Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. He was woolgathering, I think. It's true that there are a few collections in the corpus that are systematically misdated, like a large group of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's doing. Of the first ten full-view misdated books turned up by a search for books published before 1812 that mention "Charles Dickens", all ten are correctly dated in the catalogues of the Harvard, Michigan, and Berkeley libraries they were drawn from. Most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text. For example the 1604 date from a 1901 auction catalogue is drawn from a bookmark reproduced in the early pages, and the 1574 dating (as of this writing) on a 1901 book about English bookplates from the Harvard Library collections is clearly taken from the frontispiece, which displays an armorial bookplate dated 1574:

Similarly, the 1719 date on a 1919 edition of Robinson Crusoe in which Dickens' name appears in an advertisement is drawn from the line on the title page that says the book is reprinted from the author's edition of 1719. And the 1774 date assigned to an 1890 book called London of to-day is derived from a front-matter advertisement for a firm that boasts it was founded in that year.

Then there are the classification errors. William Dwight Whitney's 1891 Century Dictionary is classified as "Family & Relationships," along with Mencken's The American Language. A French edition of Hamlet and a Japanese edition of Madame Bovary both classified as "Antiques and Collectibles." An edition of Moby Dick is classed under "Computers": a biography of Mae West classified as "Religion"; The Cat Lover's Book of Fascinating Facts falls under "Technology & Engineering." A 1975 reprint of a classic topology text is "Didactic Poetry"; the medievalist journal Speculum is classified "Health & Fitness."

MaeWest
Speculum

moby

And a catalogue of copyright entries from the Library of Congress listed under "Drama" — though I had to wonder if maybe that was just Google's little joke.

Here again, the errors are endemic, not simply sporadic. Of the first ten hits for Tristram Shandy, four are classified as fiction, four as "Family & Relationships," one as "Biography & Autobiography," and one is not classified. Other editions of the novel are classified as "Literary Collections," "History," and "Music." The first ten hits for Leaves of Grass are variously classified as "Poetry," "Juvenile Nonfiction," "Fiction," "Literary Criticism," "Biography & Autobiography," and mystifyingly, "Counterfeits and Counterfeiting."

LOG

Various editions of Jane Eyre are classified as "History," "Governesses," "Love Stories," "Architecture," and "Antiques & Collectibles" ("Reader, I marketed him").

In his response on the panel, Dan Clancy said that here, too, the libraries were to blame, along with the publishers. But the libraries can't be responsible for books mislabeled as "Health and Fitness" and "Antiques and Collectibles," for the simple reason that those categories are drawn from the BISAC codes that the book industry uses to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And inasmuch as BISAC classifications weren't in use until about 20 years ago, only Google could be responsible for their misapplications on books published earlier than that: the 1904 edition of Leaves of Grass assigned to "Juvenile Nonfiction"; the 1919 edition of Robinson Crusoe assigned to "Crafts & Hobbies"; the 1845 number of the Edinburgh Review assigned to "Architecture"; the 1907 edition of Sir Thomas Browne's 1658 Hydriotaphia: Urne-Buriall, or a discourse of the sepulchrall urnes lately found in Norfolk assigned to "Gardening"; and countless others.

Google's fine Bayesian hand reveals itself even in the classifications of works published after the BISAC categories came into use, such as the 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture and the Body (misdated 1899), which is assigned to "Health & Fitness" — not a classification you could imagine coming from the University of California Press, though you can see how a probabilistic classifier could come up with it, like the "Religion" tag on the Mae West biography subtitled "Icon in Black and White."

UnbearWeight

But whether it gets the BISAC categories right or wrong, the question is why Google decided to use those headings in the first place. (Clancy denies that they were asked to do so by the publishers, though this might have to do with their own ambitions to compete with Amazon.) The BISAC scheme is well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing for books on the shelves. But it's not particularly helpful if you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example, the BISAC "Juvenile Nonfiction" subject heading has almost 300 subheadings, including separate categories for books about "New Baby," "Skateboarding," and "Deer, Moose, and Caribou." By contrast, the "Poetry" subject heading has just 20 subdivisions in all. That means that Bambi and Bullwinkle get a full shelf to themselves, while Schiller, Leopardi, and Verlaine have to scrunch together in the lone subheading reserved for "Poetry/Continental European." In short, Google has taken the great research collections of the English-speaking world and returned them in the form of a suburban mall bookstore.

These don't exhaust the metadata errors by any means. There are a number of mismatches of titles and texts. Click on the link from the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voices of the Heart, whereas the link on a misdated number of Dickens' Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. The link from the title Supervision and Clinical Psychology takes you to a book called American Politics in Hollywood Film. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James": bovarybyjames

More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled.

Mosaic

For the present, then, linguists, humanists and social scientists will have to forego their visions of using Google Books to assemble all the early nineteenth-century book sale catalogues mentioning Alexander Pope or tracking the use of "Gentle Reader" in Victorian novels: the metadata and classifications are simply too poor.

Google is certainly aware of many of these problem (if not on this scale) and they've pledged to fix them, though they've acknowledged that this isn't a priority. I don't doubt their sincere desire to get this stuff right. But it isn't clear whether they plan to go about this in the same way they're addressing the many scanning errors that users report, correcting them one-by-one as they're notified of them. That isn't adequate here: there are simply too many errors. And while Google's machine classification will certainly improve, extracting metadata mechanically simply isn't sufficiently reliable for scholarly purposes. After some early back-and-forth, Google decided it did want to acquire the library records for scanned books along with the scans themselves, and now it evidently has them, but I understand the company hasn't licensed them for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file.

In our panel discussion, Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors, presumably via some kind of independent cataloguing effort. But there are hundreds of thousands of errors to pick up on here, not to mention an even larger number of of files with simply poor metadata or virtually no metadata at all. Beyond clearing up the obvious errors, the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right. That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Books Settlement over the coming month.

Some of the slack here may be picked up by the HathiTrust, a consortium of a number of participating libraries that is planning to make available several million of the books that Google scanned along with their WorldCat records. But at present HathiTrust is only going to offer the out-of-copyright books, which are about 25 percent of the Google collection, since libraries have no right to share the orphan works. And it isn't clear what search functionalities they'll be offering, or to whom — or, in the current university climate, for how long. In any event, none of this should let Google off the hook. Google Books is unquestionably a public good, but as Pam Samuelson pointed out in her remarks at another panel, a great public good also implies a great public trust.

August 29, 2009 @ 5:46 pm · Filed by Geoff Nunberg under Books, Computational linguistics

Permalink

81 Comments

Graeme said,

August 29, 2009 @ 6:57 pm

Meta-data disaster.

Borges would be proud.
Chris said,

August 29, 2009 @ 7:19 pm

As always, you get no more than what you paid for (and, as always, usually less). High-quality libraries, which is to say paid-for libraries, will continue to exist, because people need them. I'm not worried that Google will be the "last library", simply because it's too crappy. It's a superlative mall bookstore, so it's still a net win.

GN: The thing of it is that up to now libraries have been giving you a whole lot more than you pay for, with governments and universities picking up the tab. But research libraries are under a lot of financial pressure, particularly in the current climate — and now that Google Book is out there, more people are saying they're basically duplicative: why spend all that money storing books? Then too, if the settlement is approved and Google is permitted a legal monopoly, they'll be the only one who can give you access to orphan works, crappy or no.
language hat said,

August 29, 2009 @ 8:38 pm

I don't doubt their sincere desire to get this stuff right.

I'm not sure how you can say that, given these other excerpts from your post:

they've acknowledged that this isn't a priority.

Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors

The errors, while infuriating, don't bother me as much as the smugness and stonewalling: why say "Sorry, we screwed up" when you can lie and blame the libraries? Show some class, Google, and earn the respect and affection you've been basking in.
Jon Orwant said,

August 29, 2009 @ 8:41 pm

There's a lot to respond to here, but one quick comment about the prevalence of 1899 dates: here at Google, we recently received a large number of Brazilian records from a metadata provider. This provider seemingly used 1899 as a placeholder for "no date", which is why we have 250,000 books incorrectly identified as being published in 1899.

Our providers have millions of errors like these, and we do what we can to eliminate them. We have made substantial improvements over the past year, but I'm sure we can all agree there's a great deal more to do.
GN: I did mention the Portuguese collection here, though in fact only one of the books I cited with an 1899 date is in Portuguese — and that only because I couldn't resist mentioning the misdated Yellow Submarine. Mark conjectures that the frequency of the 1899 dating in other collections may result from the programmer's use of "99" as a way to code "original state, nothing entered yet", which someone else's software might have translated to "1899." But there are huge numbers of misdated books from other years — and some errors, as I noted, are the obvious results of Google's attempts to automatically extract a publication date from OCR'd text, when the correct date was already contained in the provider's record. Similarly, the BISAC misclassifications of books before the 1980's could only be Google's handiwork (as could many of the misclassfications of books after that). So it simply isn't true that all or even most of the metadata errors are the fault of providers who supplied erroneous information: a great number have clearly been introduced by Google itself.

The larger question remains: with the best of intentions, does Google Book have a clear idea what it intends to do about metadata, or what it would mean to get this right (which is not at all the same as getting it correct)? There's no indication that they've given much thought to how people — scholars included — might want to use Google Book other than as a way of getting at the useful information contained in books or of what various purposes metadata might actually serve (the data-mining they envision is another issue). But then, this has never been Google's line of work.
mgh said,

August 29, 2009 @ 8:42 pm

Mark's post yesterday made passing reference to comparisons between the Google Books project and the Human Genome Project. The comment below suggests using the genome project as a model to fix the metadata problem; I'm sorry the comment is so lengthy.

Mark said the genome project and GB are not really comparable, 1) because the genome project was publicly funded and its results are publicly held, while GB is privately funded and privately held and 2) because the genome project was a competitive affair while GB is a one-player game.

But, Geoff's points here suggest that GB is very much like HALF the genome project — the privately funded half. Many of you may know that the race to the genome involved a public consortium, which took a "slow but steady" approach where fragments of DNA were sequenced in an orderly end-to-end fashion, competing with a private company effort that was taking a "shotgun" approach, rapidly sequencing many many many small chunks in random order and hoping to reassemble them properly later.

In this analogy, the raw DNA sequence is like the text of the books; the mapping of each DNA sequence to a position on a chromosome is the metadata. From Geoff's description here, it sounds like what Google is generating is much like what Celera (the private venture) generated: vast rapidly-acquired stretches of raw sequence. What is missing is the "alignment" of those texts into an orderly assembly, with each text in its proper category.

In the end, the public and private efforts saw that their approaches were complementary — one had piles of sequence, the other had the map on which to align it — and they collaborated. (The real push of course was that they'd both rather tie the race than lose it.)

It sounds like a happy solution would be for a public (or academic) consortium to negotiate to provide the metadata — which is a lot of work to generate — for Google, if Google would agree to a more public repository of the GB data. This would be very much like what happened in the genome race, except there's no competition here.

I'd emphasize that there would not be re-scanning of entire books: much like the public genome consortium was able to align Celera's short DNA reads onto the public backbone assembly, the group that assembles the metadata would simply drop the existing GB scans into their catalog index (maybe scanning a couple pages of each book to be sure they are "aligning" the proper edition?).
Mike Aubrey said,

August 29, 2009 @ 9:08 pm

"But research libraries are under a lot of financial pressure, particularly in the current climate — and now that Google Book is out there, more people are saying they're basically duplicative: why spend all that money storing books?"

If research libraries with financial pressure go under, what happens to the high level publishers who *only publish* for libraries (e.g. Brill).
Theo said,

August 29, 2009 @ 9:15 pm

(Incidentally, your link to http://ischool.berkeley.edu/~nunberg/GBook/GoogleBookMetadataSh.pdf in the third paragraph should point to http://ischool.berkeley.edu/~nunberg/GBook/GoogBookMetadataSh.pdf. Also, you probably don't want random people like me to have access to the directory data that let me figure that out. The easiest short-term fix is to create a http://ischool.berkeley.edu/~nunberg/GBook/index.html file, and in the long term is to change the permissions/.htaccess file in your public_html directory.)
Nick Lamb said,

August 29, 2009 @ 10:26 pm

“Then too, if the settlement is approved and Google is permitted a legal monopoly, they'll be the only one who can give you access to orphan works, crappy or no.”

That's not clear to me at all. Why can't someone else scan the books? Who would be granting this hypothetical monopoly over works for which there is already a de jure monopoly right (copyright) which is unexercised (hence the phrase "orphan works") ? If it becomes established that this is a legitimate thing to do (ie won't get you locked up or sued out of existence), you can expect volunteers to do it on a much, much bigger scale than Google has attempted.

GN: Since I'm not an attorney, I'll refer you to Pam Samuelson's explanation of this point in a posting called "Why is the Antitrust Division Investigating the Google Book Search Settlement?" An excerpt:

Google already has a five-year head start, an ability to integrate GBS with other products and services, and licenses in place with many institutions. Any firm contemplating a competitive product would quickly realize that it couldn't offer a comparably complete database of books. Google's head start may, moreover, provide sufficient time for network effects to kick in, which would further deter entry. Google is thus likely to have a de facto monopoly on institutional subscriptions. The license to out-of-print books that Google would get if the settlement is approved is a key and perhaps an insurmountable barrier to entry for other firms.

Now: On the metadata thing, where the metadata is recoverable (ie a human reader can figure out which edition this is by reading it, rather than needing say to look at a document proving when the volume was purchased by the library) it's well suited to crowd sourcing. A million metadata entries is too many for a single person or even a small team, but it's no challenge for the resource that built Wikipedia and hand-transcribed most of the early eBooks (in Project Gutenberg).

Too tedious? Don't believe it. Volunteers have transcribed Britain's census (100+ year old census paperwork is released to the public on the basis that most people mentioned in it are long dead) and other public records which are every bit as dull as the phone book. BUT to make it happen Google need to reassure people that they're not being taken advantage of, the facts collected must be irrevocably put into the public domain.

GN Not all metadata is recoverable from the text (e.g., various intertextual relations or the fact that "Currer Bell" was Charlotte Bronte). Another problem here is that crowds are not very good at building catalogues that require a high degree of formal and conceptual consistency across the entries — by "catalogue" I mean not just a library catalogue but a dictionary or a system like Gracenote. Paul Duguid discussed this a while ago in a First Monday article. More to the point, we already have pretty good metadata for most of these books compiled by skilled cataloguers in an institutional setting. But all that said, it's easy to imagine collaborative efforts aimed at elaborating and enhancing current metadata for particular collections or subjects.
JenJen said,

August 29, 2009 @ 10:59 pm

You don't believe GB is adequate for scholarship, and I hope you're right. I hope scholarship can remain diligent and detail-oriented and committed to getting it right. I hope the "good enough is good enough" attitude I see every day among our students somehow vanishes when they become the scholars of the future. When more Americans trust Joe the Plumber than NASA, I worry that at some point nobody's going to really care when Henry James wrote Madame Bovary (!) , so long as they can get the full text online. Hey, at least there are some fruitful lines of inquiry in the study of Google Books!
Lars said,

August 29, 2009 @ 11:57 pm

Surely there is enough goodwill in the world that if Google was to make the process available through a public wiki, users could help to correct the meta data as they go, in somewhat the same way that Sun Systems allows the public to participate in the production of OpenOffice.org. Under any circumstances it will be a long time before GB meets the requirements of Academia, but in the meantime there are millions of people who are grateful for access, which they would otherwise be denied, to the books available through GB.
George J said,

August 30, 2009 @ 12:00 am

I remember when Yahoo was starting up they were vocal about hiring catalogers and other librarians (skeptics said they were doing it just to gain credibility)…and if I recall correctly, Google in its infancy was also talking up the importance of getting librarians involved in their efforts. Perhaps it has just been symbolic collaboration. In any case, I suppose that for the foreseeable future, we'll be toggling between our favorite library catalogs for the metadata and GB for some content.
Brandon said,

August 30, 2009 @ 12:10 am

I once came across a nineteenth-century book of theology that GB said was authored by the Holy Trinity.

I like a lot of things about GB, but the number of errors really is extraordinary at times; dealing with them will require a systematic plan, not the haphazard approach to correction currently in place. That is, if GB is to live up to even half the hype Google will really need to approach this as not just a mass scanning problem but as a library organization problem.
dr pepper said,

August 30, 2009 @ 12:26 am

How about a meta meta analysis of the categories and tags assigned by users of Library Thing?
John Cowan said,

August 30, 2009 @ 3:43 am

I can't say very much (the first rule of Google is that you don't talk about Google), but according to an internal talk I heard from an outside research group (who haven't published yet) that is collaborating with Google, there is now in existence a really huge corpus of books with high-quality metadata, approximately one BNC-worth (100 million words) for every year from 1801 to 2006, and lesser amounts of books for a century or more before that. They are discovering things about the history of collocations that have been a complete mystery till now, and this will go on. Further deponent sayeth (alas!) not.
GN: Well, I'll attend deponent's further depositions. This would certainly be a boon to historical linguists, pace various reservations about access and the restrictive conditions that the settlement places on text-mining. But it wouldn't answer to most of the needs of humanists and historians, who generally need access not to a representative or balanced corpus, but the whole shebang.

On the other hand, I tend to be a little more optimistic about the prospects than some of my colleagues (or some of the commenters here). Not that I'm counting on pure public-spiritedness to motivate Google to invest the time and resources in getting this right. But I do have a sense that a lot of this is due to the company's fumbling as it tries to master what it is only begining to realize is a very different domain from the Web. And if recent history teaches us anything, it's that Google is a very quick study.
Sili said,

August 30, 2009 @ 7:51 am

First of all I have to mention that, infuriating though it must be to the scholar, I love all this found poetry. There should indeed be a religion dedicated to Mae West (beats Lauraism by a mile) and as a failed mathematician, I like the idea of topology as didactic poetry.

I don't think the idea of letting the users correct the errors is all bad. Galaxy Zoo classified millions of galaxies via crowdsurfing in much shorter time than expected. If the looking for metadata is made into a game in the same way, there won't be a lack of people wasting time procrastinating while fixing the errors. It's essentially just making good use of the effort that people otherwise put into playing patience.
James Kabala said,

August 30, 2009 @ 9:08 am

In case anyone doubts one of the claims above:

http://books.google.com/books?id=sxYEAAAAQAAJ&pg=PA1&dq=%22Holy+Trinity%22#v=onepage&q=&f=false

Apparently it is an autobiography! I think we can agree this should be an authoritative book in all future theology classes.

More seriously, at least that book has no listed author on the title page. Here is one attributed to "Stratford-upon-Avon Holy Trinity" (the name of the church where the sermons were delivered) even though the actual author, Rev. W.K.W. Chafy-Chafy (really), is listed on the title page:

http://books.google.com/books?id=EbAHAAAAQAAJ&printsec=frontcover&dq=%22Holy+Trinity%22&lr=#v=onepage&q=&f=false
John Mark Ockerbloom said,

August 30, 2009 @ 9:23 am

"And a catalogue of copyright entries from the Library of Congress listed under "Drama" — though I had to wonder if maybe that was just Google's little joke."

Actually, that might not be quite as out-of-the-blue as it might sound, since a number of the volumes of the Catalog of Copyright Entries are specifically about copyright registrations for dramatic works. When these are separately cataloged, they are sometimes assigned subject categories related to drama (which makes sense, since, among other things, they can be used to help compile bibliographies for drama.)

The volume being shown, as far as I can tell, does include dramatic copyright registrations, but it also includes other copyright registrations, though. So in this case it was not the best single category to assign.

It is possible in some cases for third parties to try to improve on the description (as aptly criticized here) as well as the organization (GBS has never handled multi-volume works like this one well) of what's in the corpus. I've done it, with some help, for the Catalog of Copyright Entries volumes scanned by various folks, including Google. (My organized set of links can be found at http://onlinebooks.library.upenn.edu/cce/ ).

I can't say my current processes scale as well as Google's, though. My own index of free online books (which has fairly high-quality metadata) contains fewer than 40,000 records; Google's contains millions. (Though there is some boiling down that occurs; for instance, my recent addition of the Dictionary of National Biography distilled hundreds of records at Google and the Internet Archive down to one record in my database.) But Google could be making better use of the higher-quality library metadata that's out there. And I'm looking into ways that I can scale up my index to get bigger with limited, controllable compromises in quality.
language hat said,

August 30, 2009 @ 9:50 am

Our providers have millions of errors like these, and we do what we can to eliminate them.

Once again, you are not acknowledging that 1) most of the errors do not come from the providers, and 2) a great many of them come about because you have pigheadedly refused to use the metadata available from the providers. This does not inspire goodwill and confidence.
Leonardo Boiko said,

August 30, 2009 @ 3:39 pm

Lots of people suggesting wiki-like metadata editing. One more reason why Google Books and librarything should be made to work together.
Ray Girvan said,

August 30, 2009 @ 5:03 pm

Yep. I'm glad to see a major LL post on this, as it's been an irritation for a long time. For example, I was recently trying to use GB to see if I could find pre-1900 citations for the word "spaceship". The dates for this search – supposed to find occurences of "spaceship" between 1600-1900 – are completely broken.
Marty Manley said,

August 30, 2009 @ 6:14 pm

Great work, and highly entertaining. Biggest embarrassment for Google since the China train wreck. BTW, the slides from Geoff's preso are here http://tinyurl.com/lhjvns.

I did not see the presentation or attend the conference (disclosure: my wife is the I-School Dean), but it was by all accounts productive and useful. Based on this post, the slides, and the Quentin Hardy post, I wish Geoff had pushed a bit further into what exactly Google should do.

There are a fair number of bibliographic services and public bibliographic online protocols that Google could seemingly use to enhance their data. In general these folks are loathe to license to commercial enterprises or to companies that make use of their data online (I founded Alibris, which sells a lot of out of print books. We lived variations on the book metadata problem for many years).

But I cannot see why Google could not license Library of Congress and OCLC catalogs to improve the metadata. It would have to strictly limit how much of this data it gave consumers in order to get the license (at least from OCLC), although as a practical matter most of this data is available under the industry Z39.50 data protocols that most academic libraries use to run online card catalogs. (Melvyl at Berkeley being one example).

Using industry data (from Ingram or Baker and Taylor) would be far better than BISG data, which is a joke within the industry. The problem with this data is that its quality degrades pretty quickly before about 1975 and pretty much dies in the world before ISBNs came out in 1970.

Policy questions about whether Google should be required to maintain high quality metadata, whether they are really the last library (I seriously doubt it), and whether the truly awful quality of some of the scans will every be searchable or indexable in the first place are important issues beyond this post.
—
GN: I don't know nearly as much about cataloguing as you do, Marty, but licensing seems the only feasible solution. It's hard to imagine algorithms or crowd-sourcing replicating 10 million WorldCat entries.
Ryan Shaw said,

August 30, 2009 @ 8:44 pm

John Cowan wrote:

"…according to an internal talk I heard from an outside research group (who haven't published yet) that is collaborating with Google, there is now in existence a really huge corpus of books with high-quality metadata, approximately one BNC-worth (100 million words) for every year from 1801 to 2006, and lesser amounts of books for a century or more before that. They are discovering things about the history of collocations that have been a complete mystery till now…"

Unfortunately for this outside research group, no one will take their work seriously unless it can be confirmed and replicated. Just claiming that you have access to a super-secret stash of "books with high-quality metadata" doesn't fly: you have to actually make that stash available to others. On the other hand, if they're just using regular Google Books and not a super-secret stash–perhaps hoping that the metadata is "good enough"–then Nunberg's presentation should be enough to cast serious doubt on any conclusions they've drawn. Either way, using the Google Book corpus to do computational linguistic or literary research doesn't seem like a wise choice right now.
Joshua said,

August 30, 2009 @ 9:44 pm

Part of the problem is that Google Books doesn't have adequate options for users to correct errors, even when the visible part of the book provides enough information for the user to know what the correct information is.

Example: A book titled "DEUS DE BARACK OBAMA, O: PORQUE NAO EXISTE LIDERANÇA SEM FE" is one of the Portuguese-language books dated 1899. The copyright page bears a 2008 copyright date. But clicking on "Feedback" reveals only the following:

"Report a problem on the page
Current page: Page 4

[] Part of the page is unreadable
[] Missing page"

But the page isn't unreadable or missing. The problem that needs correction is the date of the whole book, and there's nowhere obvious to submit a correction for that.
Nick Lamb said,

August 31, 2009 @ 12:13 am

I don't want to get into this too deeply, but Gracenote is a terrible example. The state of the art in volunteer CD metadata has for years been something like Musicbrainz.
Stephen Jones said,

August 31, 2009 @ 4:08 am

Another problem here is that crowds are not very good at building catalogues that require a high degree of formal and conceptual consistency across the entries

As a quick glance at WIkipedia will show crowds are not very good at formal and conceptual consistency within a single entry, let alone across the entries.
Evan said,

August 31, 2009 @ 4:30 am

What solution do you propose besides "Google should prioritize this problem more highly"? Do you have some services you might volunteer?
language hat said,

August 31, 2009 @ 9:53 am

Do you have some services you might volunteer?

Why should people volunteer to help a fantastically rich for-profit company fix its own stupid mistakes? Why haven't they been dealing with this serious problem themselves? When it turns out a car company has a problem with its engines, do you show up with a wrench and offer to fix it for them?
Seth Finkelstein said,

August 31, 2009 @ 10:54 am

I wonder if the "1899" year issue comes from a bug where many time formatting routines work with years-since-1900, and then a value of "-1" for error or unknown (i.e. years-since-1900 + "-1" value for error or unknown = "1899"). Just a thought.
Gene Golovchinsky said,

August 31, 2009 @ 12:15 pm

Thanks for a great analysis and drawing attention to this important problem. I would have loved to have been at the conference, but I suppose it wasn't widely publicized because it was by-invitation only.

Other aspects related to problems with the settlement include its effect on used book stores, and the marginalization of other media that will be trapped in copyright purgatory without more considered policies. More here.
Camilla S. said,

August 31, 2009 @ 6:04 pm

Since dates are also used to determine whether a book is in copyright or has entered the public domain, I would think Google would have a strong incentive to get this right. Anyone whose book is made available in full text due to a misclassification and who has opted out of the settlement would be able to bring another suit against Google. The misclassified / misdated books mentioned above are mostly only limited view or no preview only, but I've found at least one misclassified books available in full view (as it was assumed to be a public domain book).
Jon Orwant said,

September 1, 2009 @ 1:51 am

Let me apologize in advance for the length of what follows. I manage the Google Books metadata team, and in concert with technical lead Leonid Taycher and our metadata librarian Kurt Groetsch, we'd like to respond to Geoff's post. I'll explain why we display the metadata we do, but before I dive in I'd like to make a few broad comments.

GN I'm grateful for this, which opens up the discussion — the more information everyone has, the more productive the discussion can be. Up to now, Google has been playing its catalogue cards a little close to its chest, which has contributed to the confusion. I was clearly wrong in some of my guesses about how these bad metadata arose. But Jon's post does raise or underscore several questions, which I'll take up below.

First, we know we have problems. Oh lordy we have problems. Geoff refers to us having hundreds of thousands of errors. I wish it were so. We have millions. We have collected over a trillion individual metadata fields; when we use our computing grid to shake them around and decide which books exist in the world, we make billions of decisions and commit millions of mistakes. Some of them are eminently avoidable; others persist because we are at the mercy of the data available to us. The quality of our metadata now is a lot better than it was six months ago, and it'll be better still six months from now. We will never stop improving it.

Second, spare a thought for what we are trying to do. An individual library has the tough goal of correctly cataloging all the books in its collection, which might be as many as 20 million for a library like Harvard. We are trying to correctly amalgamate information about all the books in the world. (Which numbered precisely 168,178,719 when we counted them last Friday.) We have a cacophony of metadata sources — over a hundred — and they often conflict. A particular library only has one set of cataloging practices to deal with (sometimes more, since cataloging practices inside a library change over the years and decades), and we have to dynamically adapt to every library, every union catalog, every commercial metadata provider.

Now, I'd like to go through Geoff's post point by point. Researching his observations over the past 48 hours has brought us face to face with a lot of metadata errors — some Google's, others external, and I'd like to thank him for going to the effort. Where the error was ours I will admit it. Where the error was not ours, I will describe the source but not name it; there's entirely too much finger-jabbing in the world for my taste. Where we discover systemic errors in external metadata, we try to notify the metadata provider so that they can correct the errors in their own database and avoid polluting other metadata customers.

Geoff begins by underscoring the importance of getting metadata right. No argument here. I wouldn't call Google Books "the Last Library" — we are not a library, and rely on brick-and-mortar libraries and flesh-and-blood librarians to practice genuine librarianship — but eagerly acknowledge that it's critical to properly curate the collection we have. Without good metadata, effective search is impossible, and Google really wants to get search right.

In paragraph three, Geoff describes some of the problems we have with dates, and in particular the prevalence of 1899 dates. This is because, as I said in my earlier post, we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other. Geoff responded by saying that only one of the books he cited was in Portuguese. However, that metadata provider supplies us with metadata for all the books they know about, regardless of language. To them, Stephen King's Christine was published in 1899, as well as 250,000 other books.

To which I hear you saying, "if you have all these metadata sources, why can't the correct dates outvote the incorrect ones?" That is exactly what happens. We have dozens of metadata records telling us that Stephen King's Christine was written in 1983. That's the correct date. So what should we do when we have a metadata record with an outlier date? Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we'd occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he's even still alive — prefers the former.

It's an interesting argument, but I'm going to ask my colleague Paul Duguid to respond to this one in a separate post, since I'd like to hear his take on this.

This Brazilian provider is an extreme, but we've learned the hard way that when you're dealing with a trillion metadata fields, one-in-a-million errors happen a million times over. We've special cased this provider so that their 1899 dates — and theirs alone — are ignored. You should see the improvements live on Google Books by the end of September.

Paragraph four claims that these errors are widespread. Again, no disagreement here. But in our defense, let me explain where these errors came from. The 1905 date for the Drucker book was courtesy of a New Jersey metadata provider, which used 1905 in the same way that the Brazilian provider used 1899 (this, by the way, is a large part of the reason why there are so many books purportedly mentioning "Internet" prior to 1950). The 1900 Virginia Woolf date came from a British union catalog that has multiple MARC 260.c fields, some with the correct date and some without, but the 1900 field also occurs in the record's MARC 008 field. A time-traveling Tom Wolfe wrote The Bonfire of the Vanities in 1888 rather than 1988 because one of our all-too-human humans miskeyed the date. Henry James wrote What Maisie Knew in 1848 rather than 1897 because a French union catalog tells us so. Four bad dates, four different causes.

Let's turn to Dickens. Geoff points out 182 hits for Chas prior to his birth year of 1812. I hope I won't be thought cavalier for not listing each case, instead focusing on the top hit: a British library associated a barcode for a 1740 book, Historie de L'Academie Royale des Sciences, with the bibliographic record for Household Words. Regrettably, Geoff missed an even better chance to poke fun at us: we date one edition of A Christmas Carol from a shockingly pre-Gutenberg 1135. While I personally believe that some Dickensian themes are timeless, I wish this British union catalog had left that record out altogether. (And it's a different British union catalog than the one I mentioned earlier.)

In paragraph eight, Geoff cites Dan as saying that the erroneous dates were all supplied by the libraries. I wasn't at the conference, but would be astonished if Dan said exactly that.

He knows that not all our metadata sources are libraries, and he knows that Google sometimes inadvertently introduces its own errors. This seems like the sort of comment that is easily misheard: maybe someone said "Hey, you've got a lot of metadata errors", Dan replied that "there are a lot of errors in library catalogs", and it's interpreted as Dan asserting that libraries are to blame for all the errors visible on Google Books. I talk to Dan all the time about metadata and can assure you that he has a thorough understanding of the problems. Sometimes nuance and complexity gets lost on stage.

GN True, and I have a singularly rotten memory. So I recorded Dan's remarks so I could be sure of getting them right — I'll put the audio up later if I get a moment, but in the meantime here's a transcript of the relevant section:

There were a number of different types of metadata that Geoff talked about. One was the bisac and the classification. The other was the date of publication, okay? And what we do with metadata is we combine… we get metadata from libraries, we get metadata from OCLC, and we do get metadata from commercial partners as well, Ingram, Bowker, a number of commercial partners. For books we scan with a library, we get it from the library, okay? So invariably, when you see a snippet […?], when when you see wrong metadata, we got that metadata from the library, in terms of the date of publication, okay? And in fact, we have a system — when Cliff mentioned why it's not integrated with the existing infrastructures to do this, it is, actually. We get updates from the libraries, we get updates every week or two with new metadata to help correct the errors that are in there. The interesting thing is, for all the dates that we identify, many of those dates existed, almost all of them existed, for sure, in one of the sources that we got, okay? Prior to full-text search we didn't know there were these errors. It's only as we start searching these books that we're finding a lot of errors that the existing infrastructure sometimes did not detect.

So there's no indication here that Google is responsible for any of the metadata errors. And I actually heard another person from Google make the same statement at a meeting a number of us had at Berkeley prior to the conference. So to put this kindly, Google hasn't exactly been coming clean about this — until now.

Geoff also suggests that "most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text." However, we don't extract publication dates from OCR. Every misdating came from a human — some inside Google, most outside. Where the misdates come from the frontmatter (e.g., the frontispiece or the title page, as in the two examples Geoff cites) the error is more likely to have been a person inside Google. We are investigating the best ways to fix these — through better training for those people, through automated ways to identify the errors, and maybe someday through user-supplied metadata corrections.

GN: It's only a little reassuring to know that Google isn't trying to extract publication dates from the OCR's, since this raises more questions than it answers. It wasn't an unreasonable assumption that these dates came from machine parsing of the texts, given the kinds of errors that are turning up. Take the book London of to-day, from the Harvard Library. The date of 1890 is plainly evident on the cover (despite a botched scan), which reads "London of to-day 1890." It's clearly repeated on the title page: "Boston: Roberts Brothers. 1890." And it's correctly recorded in the Harvard record for the book. But Google dated the book 1774, presumably on the basis of the front-matter advertisement for a shirtmaker that boasts it was established in that year: "Harborow's Shirt & Hosiery Manufacturers/To the Royal Family/ 15, Cockspur Street, Charing Cross, S. W, Established 1774."

I simply assumed that this mistake must have been the work of a program, rather than a human — I mean, could someone really misread that ad as providing a publication date? The answer, according to Jon, is, well, actually, somebody did. Which only goes to show that the Turing test can work both ways: do something dumb enough, and it's hard to tell you from a machine. The immediate question is where "inside Google" the person who made this error came from (and how much they're being paid an hour). But more significant, why in the world was Google paying somebody to determine the publication date of a book from the Harvard Library that was correctly dated in the library's catalogue? Why pay people to compile bad metadata when you already have the good stuff? There's an obvious disconnect here, which calls out for an explanation: Why hasn't Google secured the rights to present the Harvard data?

Now I'll turn, as Geoff did, to classification errors. There have been many attempts to create ontologies of book subjects, such as the Dewey Decimal System that Americans learn about in grade school, the fine-grained Library of Congress classifications, or the retailer-friendly BISAC catagories. Geoff identifies a number of absurd subject classifications that we display.

First, he points out that the 1891 Century Dictionary and The American Language are classified as "Family & Relationships". He is correct and this is our fault. When we lack a BISAC category for a book, we try to guess one. We guess correctly about 90% of the time and Geoff's comments prompted the engineer responsible to suggest some improvements that we will roll out over the coming months. I would be more specific, but he suggested a few different approaches, and we're not yet sure which to take. (In case you're wondering why such an absurd subject appears, it's because the full inferred subject category is "Family & Relationships / Baby Names" and the actual library-supplied subject is "Names/US". I'm not trying to excuse the mapping, just explain it. A similar mistake causes us to classify Speculum as "Health & Fitness".)

In contrast, the edition of Moby Dick identified as being about computers is the fault of a Korean commercial metadata provider. The Mae West biography ostensibly about religion (the jokes just write themselves, don't they?) is from a North Carolina commercial metadata provider. Ditto for The Cat Lover's Book of Fascinating Facts falling under "Technology & Engineering". Geoff identifies a topology text (I assume this is Curvature and Betti Numbers) as belonging to Didactic Poetry; this beaut comes to us from an aggregator of library catalogs. Perhaps the subject heading "Differential Geometry" was next to it in an alphabetic list, and a cataloger chose wrong.

Geoff adds, "And a catalogue of copyright entries from the Library of Congress listed under 'Drama' — though I had to wonder if maybe that was just Google's little joke."

Hey now. We would never ingest our own mirth into metadata records. There's too much there already. Like the time one of our partner libraries supplied us with a catalog record for a turkey baster. Not a book about turkey basting. An actual turkey baster, presumably to be found in the stacks. One European library classified Darwin's Origin of Species as fiction. And there's a copyright record for a book that has no writer, only a psychic who received the text "clairaudiently."

But this one we got right — take a look at the actual book, which is in full view. It is for Class D copyrights, which include dramatic works.

Not so fast. This raise another issue, which you could think of as "Ceci n'est pas une pièce." Why classify a catalogue containing play copyrights as drama? A catalogue of things is not a member of the same category as the things it enumerates: Section B of the LOC copyright list, for foreign books in foreign-languages, is an American book in English. So, no, you didn't get this one right. Harvard did, in their record for the book, which came from their collection: they have it under "American literature — Bibliography — Catalogs." So again, why didn't Google use Harvard's records? This is their line of work, after all, not Google's: as they say in the ads, "Don't try this at home."

I'll skip over the misclassifications for Tristram Shandy and Leaves of Grass, although if anyone is interested, let me know and I'll dive in. As with other examples in Geoff's post, it's a mix of our mistakes and others' mistakes. But I do want to explain why one edition of Leaves of Grass is identified as "Counterfeits and Counterfeiting." It's because a library cataloger decided that was the appropriate subject for a pirated book. That was picked up by an aggregator of library catalogs (in the MARC 650 field) and you can find it online under that subject heading if you know where to look.

An Australian union catalog holds that Jane Eyre is about governesses; a Korean commercial provider claims it's about Antiques & Collectibles. We suspect that the prevalence of Antiques & Collectibles for some classic editions derives from a cataloger's conflation of a particular item's worth ("that first edition is a real collectible!") with the subject classification for the edition. The architecture subject heading was our fault.

While it's true that BISAC didn't exist when many of these books were published, it's not the case that Google necessarily invented the BISAC classifications for them. Sometimes we did, but often commercial metadata providers (not publishers or libraries) provided them, for the benefit of retailers.

Geoff asks why we decided to infer BISAC subjects in the first place. There is only one reason: we thought our end users would find it useful. As I mentioned above, we estimate that we get it right 90% of the time. I hear loud and clear from Geoff that 90% is not enough. Is 95%? 99.9%? Tell us what you think. If the accuracy needed is in excess of what we can provide, we'll simply stop inferring BISAC subjects and chalk it up to a failed experiment.

The question is, why did you think end-users would find this useful? Which end-users did you talk to about this? I don't think you'd find a whole a lot of scholars who would embrace the idea of using the BISAC classifications in place of other library classification schemes. In fact, why would anybody think that a scheme designed for organizing the shelves of a Barnes & Noble outlet would be appropriate for a collection assembled out of the holdings of major research libraries? This was, frankly, a silly choice, and suggests that Google really didn't think this through. Again, if Google licensed the WorldCat subject codes for presentation, it could at least make them available for toggling on and off and for faceted search.

The 1818 Théorie de l'Univers links to Barbara Taylor Bradford's Voices of the Heart because of a barcoding error (ours) while Dickens' Household Words linking to the Histoire de l'Académie Royale des Sciences was another barcoding error (the library's). When Supervision and Clinical Psychology links to American Politics in Hollywood Film, that's because two books scanned one after the other, with no boundary in between — again, our fault.

Madame Bovary was written by Henry James and not Flaubert according to an aggregator of library catalogs; they could tell you which library made the original cataloging error.

Geoff says, "More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled." The explanation behind Mosaic is more prosaic: an Armenian union catalog got it wrong, and we believed them.

Commenter Brandon said that he found a theology book for which we listed the author as "Holy Trinity." Here it is. That is direct from the library catalog, and you can find it online if you search for it. The best part is that Holy Trinity is listed in their metadata record as the corporate name. Which means that the actual author was a contractor.

Geoff says, "I understand [Google] hasn't licensed [library records] for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file." First, as mentioned above, we do not recover dates from OCR — all publication dates come from external metadata sources or from occasionally fallible humans. Second, we certainly do use and display metadata from library records. (We don't display the raw library metadata because of a contractual obligation with a library catalog aggregator that forbids us from doing so. Given how they pay their bills, this is understandable.)

I'm genuinely puzzled here. If you can "use" the OCLC or library metadata, why do you have humans trying to extract it independently, and getting it wrong? There's some shoe that isn't being dropped here.

Now, if anyone is still with me after all that, I'd like to address Geoff's broader point about Google's intentions. It is hard to figure out how to answer this. We're committed to get metadata right, but you shouldn't take my word for it: promises are sometimes broken, and good intentions are never enough. So let me talk a little about how things work inside Google: we measure everything. Internally, we have a number of different ways to measure our metadata progress, and we measure ourselves by how much improvement we make. That's an approach that's worked well for Google in other areas, so perhaps it can be externalized: if you care about our metadata, come up with your own measure and track it over time. I'm confident that while there may be occasional quality dips (say, when we get data from a provider that misdates 250,000 books), the trend will be positive. It may not be as fast as you'd like, but if it's any consolation, it won't be as fast as we'd like either.

GN What's missing here is a sense that Google can't and shouldn't go this alone. The reason this is all so frustrating — and not just for scholars — is that Google Books represents such an extraordinary resource, and already one that numerous researchers are trying to exploit. But you have the sense that the decisions about metadata and related issues are being shaped by a bunch of engineers sitting over their free Odwallas in Mountain View, who haven't really tried to determine what scholars need to make this work for them — or for that matter, how people in general can use the resource (I don't think anybody would have gone with the BISAC choice if they had worked this through). And there's the suspicion, too, that the Google people don't deeply understand the cataloguing process as professionals undertake it. Those are the crucial disconnects, and until they're bridged it's hard to see how Google Books can live up to its potential, for all the best intentions of the people there.

Finally, Geoff's efforts will have singlehandedly improved nearly one million metadata records in our repository once the code changes that his blog post inspired wend their way through our systems. While I winced at times reading his message and the conclusions he drew about our intentions and abilities, I can't deny that he's done Google a great service via has research. So: thank you, Geoff.

At the beginning of this message I mentioned three of the people most deeply involved with our metadata efforts: the technical lead, our metadata librarian, and myself. Against my better judgment, my colleagues insisted that they include their email addresses here: the technical lead's is his first name concatenated with "+metadata@google.com"; our librarian's is his first initial and last name concatenated with "+metadata@google.com". And I won't let them suffer alone: my email address is my last name "+metadata@google.com". (By the way, I know the "+metadata" isn't fooling anybody, but it's a neat trick — available with every gmail account — that makes filtering email easy.)

Again, this is a very welcome intervention, and one hopes that it will be the beginning, as the fellow said, of a beautiful friendship.
James Grimmelmann said,

September 1, 2009 @ 10:54 am

Jon, if you're still reading, when you catch mistakes in metadata given to you by providers, do you pass the corrections back upstream to them?
Dan T. said,

September 1, 2009 @ 1:15 pm

There may be several relevant dates for historical reference and for copyright status purposes; the date of the particular edition, as well as the date(s) of creation of the text and images within.
John Wilkin said,

September 1, 2009 @ 1:48 pm

This is a wonderful and regrettably amusing treatment of the metadata problems in Google Book Search that everyone, particularly Google, interested in digital libraries should read. There are, however, a few significant errors and vague innuendos such as the tiresome and fear-mongering ‘de facto monopoly’ argument that has been trundled out in response to commercial digitization efforts for the last fifteen years. As Executive Director of HathiTrust, the error I need to respond to, however, is the characterization of HathiTrust.

Nunberg states that HathiTrust may “only provide access to books in the public domain,” and this is simply not true. We may provide access to books within the parameters established by the law. Most notably, this allows us to open access to works where the individual or organization gives us permission. I won’t argue that this has happened on a very large scale, but then again we have yet to undertake the work with our communities—communities of scholars—to make that happen. I came to work today to find nearly a dozen signed permissions agreements requesting we open access to works whose rights have reverted to the authors, and this is indeed what we’ll do.

It would also be wrong to think that this sort of open reading access is the only meaningful use HathiTrust institutions can make of these works. One of the most significant uses is their preservation. The widespread use of acidic paper for most of the 19th and 20th centuries means that nearly all of the works being digitized are deteriorating. Preserving these works is a key library function sanctioned by the law and doing so in a digital form allows the HathiTrust libraries to share the burden of preservation much more effectively. There are other uses established by the law, including access by our users with print disabilities and supporting computational research. Nunberg’s grudging “only provide access to books in the public domain” fails to acknowledge these important activities by HathiTrust partners.

It is worth pointing out a couple of subtler quibbles with Nunberg’s characterization of HathiTrust and the problem of orphan works. First, it needs to be said that many works assumed to be in-copyright orphans are actually in the public domain, and it’s the arduous work of establishing rights that keeps some of these waters muddied. By coming together as they have, HathiTrust institutions can attack this particular problem with shared resources. With generous support from the Institute of Museum and Library Studies, we are in the process of creating a Copyright Review Management System and, even in the planning and development stages, our work serves to “free” several thousand titles each month. Second, although HathiTrust is indeed “a consortium of participating libraries” (and I believe Nunberg implies here “*Google* participating libraries”), HathiTrust’s intention is to bring together *research libraries*, whether Google partners or not. We are in active discussions with several research libraries that are not Google partners, discussions that will expand our collective collections and bring even more library resources to bear on these questions of preservation and access.

I should add one final note about the search capabilities HathiTrust plans to offer. Our plans for reliable and comprehensive bibliographic and full text search across both in-copyright and public domain works are ambitious and well-documented on the HathiTrust website. For example, our full text search initiatives are covered in detail at http://www.hathitrust.org/large_scale_search, and we recently announced plans to launch our comprehensive search service in October, 2009.

GN:Thanks very much for this, which is quite useful. I couldn't do justice to HathiTrust in a sentence or two — I basically wanted to make the point that however it's done it doesn't let Google off the hook — though one thing I should have stressed is that the project's first goal, appropriately, is the preservation of the files. As for access to the orphan works, as John says, this is up in the air, both as to which texts might be available and who would have access, though absent the kind of global permission that Google would have under the settlement it's hard to see how the Trust could duplicate even a fraction of Google Books' resources unless some orphan works legislation is passed. (John doesn't mention one other very important potential feature of HathiTrust, in its ability to offer material, as from special collections, that Google hasn't scanned.)

I've also looked pretty carefully at HathiTrust's plans regarding search, which strike me as quite thoughtful (e.g., in the discussions of how to augment Lucerne, how to treat stop words and reduce index size, improve performance, and so on). I'm hoping, among other things, that they'll improve on the very flaky hit-count-estimation algorithms that Google uses, provide for proximity search, wild-card searches, kwic, and other functionalities from the wordinista shopping list, and that they'll offer robust faceted searching, as seems to be their intention. So I'll look forward to seeing how search works after the October launch.
Paul Duguid said,

September 1, 2009 @ 3:55 pm

Jon writes:

"Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we'd occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he's even still alive — prefers the former".

This seems to me an odd dichotomy: Either include 'very possibly wrong' metadata or 'prevent this metadata from ever being seen'. Is that a warning that any criticism (from people like us) will drastically suppress records? Better, in such an event, to suppress criticism. Surely there's a path between the two. You can show the book and add metadata later. After all, many books in GBS are not given BISAC categories, and some are not given dates. Or you could set aside and work more quickly on the outliers–they are not, after all, going to make a huge portion of the collection. Or you could have a system for tagging dubious records for both Google and the public to review.

Google wants and deserves some sympathy for what it is doing–the task is colossal. And Jon deserves a great deal of praise for being so forthcoming. But Google took on the task, including vacuuming metadata from a variety of sources, trying to match metadata from different libraries, and in particular adding BISAC categories liberally to records from scholarly libraries.

Yet everyone knows that library metadata always has a fair proportion of mistakes. And few admire metadata providers for their scholarly integrity. Surely, then, the likelihood of someone sending 250,000 records dated 1899 or 1905 was something to be planned for by someone who has taken on this task. More generally, anyone hoping to mix and match from such different sources would have to be aware of the trouble that could ensue if it wasn't done cautiously and must have known that a day of reckoning would come and, consequently, couldn't be too surprised at the fall out.
Lucy H said,

September 1, 2009 @ 9:48 pm

In his response to Jon's comment, Geoff says, "The question is, why did you think end-users would find this useful? Which end-users did you talk to about this? I don't think you'd find a whole a lot of scholars who would embrace the idea of using the BISAC classifications in place of other library classification schemes. In fact, why would anybody think that a scheme designed for organizing the shelves of a Barnes & Noble outlet would be appropriate for a collection assembled out of the holdings of major research libraries? "

As a random user of libraries (and of Google Books), the answer is, "because library subject headings are often bloody useless." In particular, my grandmother (an archivist) recently tried to find books on mosses (those little plant thingies) in our local public library, using a subject search. She found maybe 5 books, none in our actual branch. By going to the relevant shelf, she found dozens of books _in situ_.

Similarly, a search by subject heading in Google Books for "mosses" brings up 627 books. The 2001 book _Plants and Plant Life: Mosses and ferns_ by Jill Bailey does not appear in that list, though it's highly relevant.

Doing a basic search for "mosses" gets more than 10,000 (including the aforementioned book). But, of course, some of those are by authors named "Mosses", or are editions of a certain Hawthorne book. (All quite correct, that's what I asked for.) When I first checked this a few weeks ago (before the big UI changeover?), the overview listed some inferred subjects, which did substantially better. (Sadly, they appear to be gone.)

Anyway, for those of us who wish to search for books beyond the scope of our particular scholarship, BISAC may not be a replacement, but it or something like it is a necessary supplement.
Evan said,

September 1, 2009 @ 10:50 pm

Why should people volunteer to help a fantastically rich for-profit company fix its own stupid mistakes? Why haven't they been dealing with this serious problem themselves? When it turns out a car company has a problem with its engines, do you show up with a wrench and offer to fix it for them?

That's a hard comparison to make, as a car company has never given me their products for free (and yes, the GBS metadata is free).
Ben O'Steen said,

September 2, 2009 @ 7:24 am

So, shame on Google for mishandling MARC records. I don't think this is a useful argument to pursue further as I think many already have the opinion that Google's descriptive metadata is lacking or inaccurate. Your article provides a powerful assay that backs up that hunch.

The bottom line is that the institutions that took part in the scheme have access to download all of their scans for the price of a net connection. The administrative work they undertook to enable google to scan the books, while not cheap, was certainly far less expensive than attempting it themselves.

The quality is far from perfect, but considering the frankly herculean task that google set for themselves, they did alright. The fact that I can find a phrase in a book that had sat in a deep, dark archive, and then go and get an image of that very page is incredible enough.

And as you have said, we have already paid the price for the creation of high quality metadata – it's available from the OPACs in theory.

The next logical step is to link the OPACs directly to the books, so that people can use the OPACs higher quality metadata and incorporate the scanned resources 'invisibly' from google? The ball is surely in our court now? Google provides ways to link directly to the book and even a page using a very simple technique.
language hat said,

September 2, 2009 @ 10:20 am

[Reposting from the thread Jon originally posted his comment in, since this seems to be where the action is:]

Jon: Many thanks for your comment. It must have been hard to write without the slightest hint of defensiveness, let alone belligerence; nobody likes being criticized as harshly as you guys have been, and as one of the harsher critics (and as someone given to sometimes unhelpful belligerence in online exchanges), I am impressed (and suspect you rewrote it more than once).

The reason my comments about Google Books have gotten harsher over the several years I have been complaining (which I was at first reluctant to do, because Google Books has improved my life so greatly, both personally and professionally in my capacity as copyeditor/factchecker) is that I have seen no sign that Google even acknowledged the problem beyond a cavalier "Well, of course there are the occasional glitches, we have a lot of data to deal with, we're working on it, now shut up and eat your gruel." I can't begin to tell you how much good it does me to hear you say "Yes, there's far more bad data than there should be and much of it is our fault, we appreciate your criticism and are taking it into account, and here's how." For the first time I feel that the people in charge there are taking the problem seriously and really doing something about it. So don't rue the time you spent crafting that comment — it was time well spent.
John Cowan said,

September 2, 2009 @ 1:56 pm

Ryan Shaw: The complete secrecy is only until the research group publishes, which should be some time this year, if I remember correctly. I'm somewhat pushing the envelope by even mentioning the project's existence in a public forum.

Obviously, releasing the full machine-readable text of in-copyright books, especially recent in-print ones, is not going to happen. (Field linguists don't always release their raw recordings, either; they may contain slander, to mention just one problem.) Assuming I was not being lied to, though, the corpus does exist and its metadata (or at least its publication date) was hand-checked for accuracy. I would certainly expect the identities (titles, authors, etc.) of the books in the corpus to be public knowledge.
Adrian said,

September 2, 2009 @ 2:48 pm

http://news.bbc.co.uk/1/hi/technology/8233324.stm
Ryan Shaw said,

September 2, 2009 @ 6:31 pm

John Cowan wrote:

"Obviously, releasing the full machine-readable text of in-copyright books, especially recent in-print ones, is not going to happen."

They wouldn't necessarily have to release the full machine-readable text to allow their results to be replicated. Depending on what they've done, bags of words with frequency counts could be sufficient.

Regardless, the proposed settlement says the research corpus will contain the full machine-readable text of in-copyright books unless they have been removed by the rightsholders.

“Research Corpus” means a set of all Digital Copies of Books made inconnection with the Google Library Project, other than Digital Copies of Books that have been Removed by Rightsholders pursuant to Section 3.5 (Right to Remove or Exclude) or withdrawn pursuant to Section 7.2(d)(iv) (Right to Withdraw Library Scans), which Google provides to a Host Site or that Google, if and as a Host Site, uses.).

http://thepublicindex.org/archives/category/settlement/s-1/s-1-130

"… the corpus does exist and its metadata (or at least its publication date) was hand-checked for accuracy."

Let's hope they did a better job than Google's hand-checkers.
Kip W said,

September 2, 2009 @ 6:46 pm

Mae West can't be a religious 'icon' — she's no angel!
Virginia Faulkner said,

September 2, 2009 @ 10:01 pm

Marty Manley wrote:

"But I cannot see why Google could not license Library of Congress and OCLC catalogs to improve the metadata. It would have to strictly limit how much of this data it gave consumers in order to get the license (at least from OCLC), although as a practical matter most of this data is available under the industry Z39.50 data protocols that most academic libraries use to run online card catalogs. (Melvyl at Berkeley being one example)."

As an antiquarian bookseller who uses OCLC (Worldcat), I know that it is stuffed full of errors. Multiple listings for the same edition of the same title are the norm, not the exception. Reprint editions such as Grosset & Dunlap editions are routinely misdated because they are listed under the original copyright date. I recently traced an odd entry back to the original unversity catalog and found that an ownership provenance in the original catalog had changed to an author's name on OCLC, and this is only one many examples I could give of OCLC problems.

I trust Library of Congress entries far more than I trust OCLC entries, and I also trust individual major library cataogs more than I trust OCLC. Using OCLC would improve Google Books, but it would also put a lot of bad data into it.

Nothing is perfect, but I have yet to identify any errors in the English Short Title Catalogue of pre-1801 books.
David Jones said,

September 4, 2009 @ 5:26 am

«When did "the United States are" start to lose ground to "the United States is"?»

If a researcher is going to answer this sort of question based on textual analysis and metadata, then I would expect any reasonable conclusion to be accompanied by a quantification of the errors. Google Books is no different from Harvard in this regard. If your paper said "years of publication for the works examined were extracted from Harvard Libraries electronic catalogue which I assumed to be infallible" then that would be laughably poor. Surely it's up to the individual researcher to ensure that they have good bounds on the errors. No doubt Google could have fewer errors, but your suggestion that their errors make it unsuitable for scholarly research invites us to conclude that the catalogues that people presumably already use for research are error free. Which is surely not the case.
Marian Veld said,

September 4, 2009 @ 1:37 pm

Virginia Faulkner wrote:"As an antiquarian bookseller who uses OCLC (Worldcat), I know that it is stuffed full of errors. Multiple listings for the same edition of the same title are the norm, not the exception. "

Big problem.

"Reprint editions such as Grosset & Dunlap editions are routinely misdated because they are listed under the original copyright date."

This is because you are trying to use library cataloging for something it was not intended for. The dates in a catalog record are supposed to be exactly what is displayed on the title page or verso. If it is a copyright date, it should have a c in front of it. The publication information field in a catalog record is intended to be an exact transcription of what is in the book and nothing more.

" I recently traced an odd entry back to the original unversity catalog and found that an ownership provenance in the original catalog had changed to an author's name on OCLC, and this is only one many examples I could give of OCLC problems."

Yes, OCLC has so many contributors with varying levels of experience and training. This is a problem.
Richard Volpato said,

September 5, 2009 @ 9:32 pm

Excellent work on finding errors. But this very act shows one of the two ways forward.

Just like Wikipedia, Google will seek participation of people passionate about books. An "error notification" form is not impossible to imagine and perhaps thousands of people offering corrections — especially if for corrections committed, the 'error detective' gets a voucher for accessing premium levels of Google Books offering. Larger 'trades' of quality enhancing labours and access to results will be done even between Google and other companies (and start-ups). This internet, not inter-organisational protocols for transfer of meta-data!

The second way forward is through Google's use of OCRopus. Just look at the activity in improving that amazing collection of software. Look at how many other players are helping make OCRopus a better piece of software. It is getting better all the time. And, because the book scanning is itself stored (Google has not only the resulting OCR'ed book, but all the transformations done on the original digital scan) , material and specifically targetted areas of a document can be RE-scanned with better intelligence. Further, the re-scanning can be guided by the generalization of any errors spotted. For instance in the list of errors, there are clearly extra rules that can guide scanning (eg ignore dates relating to the Publisher's logo).

Participation and adaptive software together give Google a pathway to quality. Significantly this pathway to quality comes via the drive for coverage. It is the coverage that will draw in the 'error detectives' and provide the 'input streams' for Bayesian belief networks (and the like).
Ray Girvan said,

September 6, 2009 @ 10:28 am

Richard Volpato: Just like Wikipedia, Google will seek participation of people passionate about books. An "error notification" form is not impossible to imagine and perhaps thousands of people offering corrections — especially if for corrections committed, the 'error detective' gets a voucher for accessing premium levels of Google Books offering.

Strongly agreed. As Joshua said, the Feedback form does exist, but the report categories assume readability to be the sole error criterion.

Minor, content is readable
Some problems, but still readable
Content is very difficult to read

It would be helpful too if one could send an error report about the results of a search, as this often picks up general metadata faults.
Shawne D. Miksa said,

September 7, 2009 @ 10:49 am

GN wrote: "And there's the suspicion, too, that the Google people don't deeply understand the cataloguing process as professionals undertake it. Those are the crucial disconnects, and until they're bridged it's hard to see how Google Books can live up to its potential, for all the best intentions of the people there."

I have worked as a cataloger and now teach it. At the best of times, the practice of cataloguing and classification takes time to learn and years to master. Unfortunately, impatience and/or unwillingness to invest in that time is one of the key culprits in the misinformation found in most all records, whether created by a library or some other institution or agency.

Google's intentions are not dishonorable, no more than the first 'librarians' who sought to collect and make available (to some extent) recorded works of humankind. But, they didn't do their homework either, relying instead on (or perhaps just taking) what other's had done seemingly without earning the knowledge themselves. In the process they have started to experience the same trials and errors that catalogers discovered years (years!) ago. It's a complicated, messy thing to organize information objects–not matter how much brute force we throw at it. (Lest I make it sound as if librarians do no wrong–my research has shown that undereducated graduates of MLS programs go forth and propagate the very mistakes that have been discussed here because they, too, did not take the time to learn.)

Yes, there are crucial disconnects. One of which is disregarding/failing to respect, or just plain being ignorant of the traditions of practice.
Leo Goodstadt said,

September 7, 2009 @ 4:25 pm

I am slightly disappointed by churlish response to Jon Orwant's post.
As a member of various genome sequencing consortium (including the human genome), the comparisons with Google books are apt. Strangely enough, Mark has got the competition part back to front. The initial release of the publicly funded human genome was a train wreck precisely of the competition with Craig Venter's private efforts (which were in an even worse state, with hind sight).

The apposite point, however, is that genome sequencing centres, like Google, have always argued that early releases even with all the errors are immensely valuable to researchers. Contrary to the most pessimistic predictions, improvements have continued steadily since the original publications. Like Google Books, as soon as one starts diving into the human genomes, one finds endless series of problems and errors, usually similarly due to human error. There seems to be less pessimism that this is due to the incompetence of the annotators or that the problems are not being tackled.

I suspect that you and a few other of the comment authors are not able to grasp the sheer scale of the Google Books project, and hence (for all Google's resources) the extremely limited resources for manual curation that is available *per book*.

GN writes: "And there's the suspicion, too, that the Google people don't deeply understand the cataloguing process as professionals undertake it." The point is that Google Books will *never* be able to provide any cataloguing process as "professionals undertake it."

You say "The question is, why did you think end-users would find this [BISAC classifications ] useful?… (I don't think anybody [any serious minded academic?] would have gone with the BISAC choice if they had worked this through)". Why do you think you represent the only worthwhile group of Google Book users? If as Jon Orwant claims, the BISAC classification is only 90% accurate, that might still be a useful guide for the casual browser. For an academic project, you may find that *all* of what you are interested in the 10% of errors, and the BISAC codes will be stupid, irrelevant and misleading. So don't use them!

Finally, GN says "you have the sense that the decisions about metadata and related issues are being shaped by a bunch of engineers sitting over their free Odwallas in Mountain View, who haven't really tried to determine what scholars need to make this work for them". I cannot understand how you can still say this after Jon Orwant's reasoned, remarkably friendly post. This seems remarkably churlish: that if you are not involved in the project, those who are undertaking must be stupid, ill-informed and malicious.

"Extracting metadata mechanically [Do you mean using computer algorithms?] simply isn't sufficiently reliable for scholarly purposes." The underlying fallacy is the lack of appreciation that the book-by-book careful curation approach cannot apply when you are dealing with hundreds of millions of volumes. All the sources of metadata are hopelessly full of errors, contradictory and incomplete. Privileging one source of metadata a priori means that there will be no way down stream of spotting problems. The sheer number of books in the collection means that allowing wikipedia-style manual corrections will not be a panacea either, especially at these early stages, when the scanning is incomplete. (Volunteers have indeed transcribed Britain's Census but this would not have worked (think through the psychology) if the public had been invited on board before the census data was complete.). I have no doubt that Google will eventually add some sort of system later on.

Google's approach of gathering all annotated metadata under the sun, and working out how to sort it out later at least allows the metadata to be corrected progressively later on. If you believe that you have better sources of metadata, there is nothing to prevent you from creating your own links between your more accurate metadata and Google Book search results. It seems a trivial exercise to correct publication dates…
Karen Coyle said,

September 7, 2009 @ 5:31 pm

There is another possible reason for the poor quality of the metadata in GBS, and I'm afraid it lies right on the libraries' doorstep: the restrictions on use of library metadata that is part of the WorldCat database. This is a long a torturous tale, but two starting points are:

my recent blog post, with links to an earlier post showing how metadata for the Google Books project has been truncated.

the controversy around OCLC's attempt to limit use of library metadata

This doesn't account for all of the metadata problems, but it does show evidence that Google is NOT making use of the metadata created by libraries for the books in its database.
Curt G said,

September 7, 2009 @ 8:58 pm

What this proves is that the catalogs are very error-prone.

Why not just use the date from the scan of each book?

Of course, those aren't perfect — but this would eliminate most of the very serious errors and save everyone a lot of tedious, contentious work.
Bob Blair said,

September 7, 2009 @ 9:06 pm

I'm more concerned about much more prominent meta data. Google books is plagued with errors in title and author information. I can't count the times I have searched google with inauthor: or intitle: tags and found nothing; and then searched for text that I knew to be in the book and found it.

It's hard that you must already have a copy of a book in order to find it on Google.

But I don't mean this as discouragement. Google has made it possible, though not easy, to do research when I'm 90 miles from the nearest decent university library. I thank them for that and hope they will continue to work on their massive backlog of cataloging errors.
Christian Treczoks said,

September 8, 2009 @ 3:23 am

As one who has written software to aggregated more-or-less organized data from several sources into a single index, I have great respect for the big job that Google did here. In comparison, my job was easy – I dealt with about 15000 records from two main sources and a bunch of additions. But the data I had for input was a bloody mess.

Hats of to Google for doing what they are doing – comparing sources, finding errors and helping their sources to correct them in their own records.
Michael R. Bernstein said,

September 8, 2009 @ 10:33 am

I'd like to draw some attention to another form of corrupt metadata: The copyright status of works.

There exists within the GB corpus a large, easily identified, corpus of public domain works that are nevertheless assumed to be under copyright, and therefore are not available for full view.

These are works authored by the federal government and it's various branches.

Consider the following example searches:

http://books.google.com/books?q=inauthor%3A%22United+States+Government%22

http://books.google.com/books?q=inauthor%3A%22U.S.+Government%22

http://books.google.com/books?q=inauthor%3A%22United+States+Senate%22

http://books.google.com/books?q=inauthor%3A%22U.S.+Senate%22

http://books.google.com/books?as_auth=%22house+of+representatives%22

http://books.google.com/books?q=inauthor%3A%22United+States+Congress%22

http://books.google.com/books?q=inauthor%3A%22U.S.+Congress%22

Nearly all works published after 1923 are assumed to be under copyright even if, as works authored by the US govt., they are actually in the public domain.

In some cases the works in question actually were first published before 1923, but the copyright metadata is derived from a reprint date. Here is one example: http://books.google.com/books?id=51PEAAAACAAJ&dq=inauthor:"United+States+Government"&lr=&ei=22SmSvG6PI7okATrjoCTCA

However, even assuming that the copyright date is correct, the works should still be in the public domain, due to their authorship.

Most egregiously, for some works, no preview is available in spite of a pre-1923 publication date:

http://books.google.com/books?as_auth="U.S.+Senate"&as_drrb_is=b&as_minm_is=0&as_miny_is=1776&as_maxm_is=0&as_maxy_is=1923

In summary, Google Books needs a better process for identifying public domain works and switching on the full view for them.
Michael R. Bernstein said,

September 8, 2009 @ 10:40 am

My apologies. The example of a pre-1923 work with a post-1923 publication date is:

http://books.google.com/books?id=51PEAAAACAAJ

And hopefully this link to several pre-1923 works with no preview will fare better.
Michael R. Bernstein said,

September 8, 2009 @ 10:48 am

Well, that didn't work.

In lieu of linking to the pre-1923 search, here is one of the results of that search:

http://books.google.com/books?id=fFl9PgAACAAJ

It is "Important Serial Documents published by the Government, and how to find them" by James M. Baker (Assistant Librarian, U.S. Senate).
Aleta said,

September 8, 2009 @ 10:59 am

Google started something from scratch when there was no need to. This seems to happen more often than not, projects started without the time or effort to see what has come before, and grow from that. GB is now experiencing problems that catalogers have been dealing with for years. Wouldn't it have been wonderful if they had sat down before starting and discussed meta data, what problems have come up in the past, and what they could do to avoid or quickly fix them upon implementation? They could have pushed the creation and use of meta data ahead decades! Instead, they are mired in the same problems that all catalogers have been aware of, and seem surprised. What I hear is the "Had we known…" Well, they should have known, if they had taken the time to research all the work catalogers have been doing and seen all the issues that we deal with daily.
David Cortesi said,

September 8, 2009 @ 12:37 pm

Both GN and Google greatly underestimate the power of crowdsourcing. As a longtime participant in Project Gutenberg Distributed Proofreaders (pgdp.net), Galaxy Zoo (galaxyzoo.org) I have seen how effective and productive a vast horde of contributors can be, even when each is only lightly trained, even when any one actively participates only a few minutes a day. PGDP has corrected the OCR errors in tens of thousands of books. GalaxyZoo has yielded excellent science.

As a start at harnessing the crowd, Google could very easily expand the reader-comment form to contain check-boxes for catalog error:date, catalog error:subject, etc. That alone would quickly generate a stream of well-targeted exceptions for them to examine.

Such a stream of error reports might well be too voluminous to handle with paid employees; the solution would be to give the initial processing over to volunteers on the PGDP model. That is, allow people who care about books to look at and verify the reported errors.

At this stage you would apply the crucial methodological principle shared by PGDP and GalaxyZoo: multiple views on every item. PGDP forces every OCR page to go through six readings (and finds new errors at every stage). GalaxyZoo has each candidate galaxy categorized many dozens of times. Similarly, every Google Books error report could be examined by multiple reviewers, with only the ones that get a majority vote of "real error" being forwarded to the paid employee who can make the correction.

Beyond that, the same method could be applied to correlating and correcting catalog data. A program could merge the metadata records for a given title from all the available sources. When they agree, fine; when the program finds any apparent contradiction, the merged record could be forwarded to a queue to be examined by volunteer catalogers for analysis.
Ann C. Davidson said,

September 8, 2009 @ 12:41 pm

As a cataloger of some 20 years' experience, I can attest to the fact, as some in the comments to Mr. Nunberg's post already have, that the Library of Congress, British Library, the National Libraries of Australia and Canada, and OCLC all have their fair share of errors, some as unintentionally funny as Google's. I take Mr. Orwant at his word that Google is serious about correcting its metadata problems, so please allow me to add my voice to those who have already suggested that Google institute an error-correcting mechanism through the submission of an online form, a practice that OCLC has used for a long time.

OCLC also recently expanded the ability of professional catalogers to submit corrections directly to errors found in bibliographic records in its database without recourse to the formerly cumbersome process of allowing only the simplest corrections to be added to its records online, while having to submit notice of more serious errors by faxing the recto and verso of the title page along with a form that had to be filled out by hand. Since Google Books already has scans of the chief sources of information for a work, may I suggest that Google explore a similar method by allowing libraries and catalogers, as well as members of the public, to submit metadata corrections using an online form if they so desire? The form should include a section asking the submitter to verify the level of their expertise (supplying an LC or OCLC institution code, for example, along with contact information). Google's internal metadata quality control personnel would thus be better able to ascertain whether the suggested correction is valid or not. This will also prevent some of the more obvious problems of the Wikipedia approach, which allows anyone to change an entry whether they possess the requisite subject matter expertise or not.
John Brice said,

September 8, 2009 @ 3:00 pm

To start with I have worked over thirty years in libraries and and have cataloged many books in my day. At our library we use Library of Congress Cataloging and then modify it to meet our needs. Usually the LoC cataloging is fine but about once a week or so there are obvious errors in the cataloging, sometimes it is factual or sometimes it is on judgment calls of how to classify or the wrong subject heading is used.

That is the problem with all cataloging or what this discussion calls metadata. Anyone can create metadata and store it on the Internet. Anyone can create local rules and within those local rules the information is accurate however, outside of the local context it is inaccurate. You put the same book in front of ten different catalogers and you can get different call numbers and subject headings. That is the problem with metadata there is no quality control. The similar problems with Google's metedata can be found in any library catalog including Library of Congress, Harvard, Yale, University of Michigan, etc.

The question to ask today is not that Googles metadata is corrupt, but why do we need metadata? Metedata was created so books could be cataloged (a description of a book) and classified (how does this individual book fit in within a framework of a collection of items). Classification is done by both assigning it a number (Dewey, LC, etc) and or subject headings including Library of Congress, Sears, etc. All classifications schemes have serious flaws in them. Dewey Classification has many editions and a single book can be classified in many different areas depending on the edition of Dewey being used. Library of Congress subject headings still uses many out of date terms, such as bile for gallbladder. The reason the classification schemes developed was that when books were cataloged onto index cards and only the most broad generalizations were made so they could fit onto the cards. The 19th century catalogers couldn't index every word so they generalized.

Why are we stilling using 19th analog information processes in the 21st century? There is very little need for metedata information. The only needs I see are date of publication, physical description and publisher. With todays information technology every word in every book can be recorded in a flick of an eye. If you want a book on certain subject it should be possible to type in the terms you are looking for and what your not looking for and have a list appear in front of you for your judgment and perusal. An index of every word in every book is much more reliable resource than metedata developed over two centuries using changing standards with no quality control.

Having said that I do believe that the storage, indexing and retrieval of the index of information should be an open process. I much prefer the Project Gutenberg model over the Google Open Book model for this reason. However, Google and so far Google alone has had the vision, funding and resources to create such a huge and usable resource. We should be thankful for Google for their epic efforts. I do however, have severe reservations and believe that Google needs to create a more open process. Lets not criticize Google over inaccurate metedata, lets have a serious discussion on how libraries can access a digital repository in an open fashion and retrieve the data in a clean and pure a fashion as possible.

Melvil Dewey in now remembered as a great librarian, cataloger, linguist (he is the person responsible for dropping "ue" from the word catalog) . However, what is not well known today is that he had problems when he instituted the Dewey Decimal System at Columbia University. Prior to Dewey the books at Columbia were organized by size. When the collection was reorganized the professors were so incensed that they had Dewey fired. Melvil Dewey once said that "the one must be a library militant before being a librarian triumphant". Right now, from my perspective, Google is the library militant.

In other words the library profession has failed to properly use 21st Century information technology. In order to rectify this situation there needs to be an effort to replicate what Google is doing in the library world. If the library world can create a usable online digital depository that has open standards and processes (transparency) then Google can do what they like. What we have now is not a failure of metedata but a failure of vision.
bruce said,

September 8, 2009 @ 8:55 pm

I'm satisfied with GB: it found 'Everyman his own Poet', Roberts' 41years in India' and some George Stevenson stuff right off. Beats paying for Questia.

I'm sure it has huge flaws. Libraries do. Why we don't ever want a 'last library'.
Hugh said,

September 9, 2009 @ 2:29 am

I agree with Karen Coyle that it seems obvious that GB is not using cataloging information from the source libraries. It is bizarre that a Google spokesman would put forward as a serious reason for an error about an entry listing Freud as an author of an internet text that it arose out of a reliance on an Armenian union catalog. Not only is this goofy but it further suggests that Google doesn't have a system to weight the data they do use, or if they do it is really, really bad. I mean an Armenian catalog might be excellent in how it treats Armenian texts, it might be good on Russian texts, and possibly less good with regard to Turkic and Azeri ones. And who knows its treatment of English texts might be good too. But would you weight its data over that from various American/English cataloging sources on English language texts? And if it were the only source wouldn't you or your weighting algorithm flag it? The paradigm that Google is using for this project seems a lot like the one they employ in their search engine. It is meant to generate the maximum number of hits and list them by their popularity. As a rough and ready method where precision is not expected and bad or fruitless searches are common, it still gets many of us where we want to go. But for purposes of research where precision matters, you want to be able to define parameters and have a reasonably good expectation that the database you are using delivers all, and only, those data which fit the defined parameters. It isn't just that GB doesn't do this. It is more like it is antithetical to their whole philosophy and approach.
Ann C. Davidson said,

September 9, 2009 @ 1:03 pm

John Brice writes, "Anyone can create metadata and store it on the Internet. Anyone can create local rules and within those local rules the information is accurate however, outside of the local context it is inaccurate. You put the same book in front of ten different catalogers and you can get different call numbers and subject headings. That is the problem with metadata there is no quality control. The similar problems with Google's metedata can be found in any library catalog including Library of Congress, Harvard, Yale, University of Michigan, etc.

"The question to ask today is not that Googles metadata is corrupt, but why do we need metadata?"

Mr. Brice then goes on to describe the two main types of metadata used in a typical library record: descriptive cataloging and classification (call numbers and subject headings). However, his chief criticisms of catalog metadata concern the classification and subject headings. He then states, "The only needs I see are date of publication, physical description and publisher."

Only!!!

If I understand his premise, he would toss out classification schemes and topical headings (perhaps 5 – 10% of a typical catalog record), and keep "only" descriptive cataloging (90 – 95% of the record). That's fine, as far as it goes, but getting that 90 to 95% of metadata right is precisely the point. Recall that Mr. Nunberg's opening salvo concerned date of publication issues. I would agree that Mr. Brice's "flick of an eye" or full-text search would certainly bring up results; the real question is how many of those results would be valuable. Anyone who has tried to comb through a Google search returning a million records knows of what I speak. Even the advanced search tools in Google seldom return a search that doesn't include a lot of garbage.

And by the way, the Library of Congress Subject Headings (LCSH) do not use "bile" for "gallbladder"; the term "gallbladder" has been available in the LCSH since 1986. LCSH, as with much else in cataloging, has long since moved to the digital environment, and is continually changing and, hopefully, improving.
Carol Seiler said,

September 9, 2009 @ 2:37 pm

I have been following this discussion with avid interest. I have been in libraries (practically all areas) for a number of years. Of late I have been a trainer specializing in cataloging. As such I find the metadata of extreme import.
The example I often use is of a local tea shop. I like to use loose tea instead of tea bags. A local shop sells loose tea, and it is very good loose tea. I wanted to go by after work to make a purchase but I did not know the shop’s name nor hours. I “googled” it. I searched every term I could think of and scrolled thru several pages of results. Remember, I am a librarian (a cataloger!) and persistent. I searched for about 15 minutes before giving up and calling a friend to learn the name of the shop. Once I knew the name I could find the shop via Google very easily. Why did it not appear with my searching on “loose tea” and location (city, general area within city, etc.)? Because their site was built in Flash without metadata. It is essentially invisible unless you happen to know the name of the shop which appears as the URL.
@John Brice – I see your point that the terminology we use in cataloging is generalist however I disagree that it should be ended. I think instead, there should be more specific metadata with a backup of a thesaurus (or “authority file”).
In the above discussion, I see the point that the problem with the searching of the full text provides too many false hits (setting aside the blatant errors in the data itself). Google’s success was in developing algorithms that weighed searching. Thank you Google! Prior to this, everything was weighed the same and hits came based on the number of times your term appeared in the data. For Google to succeed with the e-book project, I think they need to (1) repair the errors in the data (and yes, there will always be errors but the by using suggestions above on better reporting and open editing Google can reduce the figure) and (2) utilize a backend thesaurus or authority file to help users (“you typed ‘canine’, do you mean ‘dog’?” and “you typed ‘cnaine’, do you mean ‘canine’?”). These, of course, will not make everything perfect but certainly will improve matters.
T.W. said,

September 10, 2009 @ 4:10 pm

The world may never know how & why Google managed to take books that were all nicely cataloged by the participating libraries, and completely fail to preserve that cataloging information. Google has digitized huge multivolume scholarly collections, but not the volume numbers!

I do recommend the newly updated Harvard online catalog —
http://hollisweb.harvard.edu/
It makes access to anything Google scanned from Harvard (a lot!) a breeze. But we shouldn't have to go around to 10 library catalogs when Google could and should assemble all the bibliographic information into its own proper catalog.

My best guess is that Google's creed that the world is made better by being able to search all of creation for little snippets blinded them to the fact that book readers are usually not looking for little snippets but for the book they want!

Meanwhile, we are in the ridiculous position of relying on amateur experts and outsiders to catalog Google's collection for them.
David Prager Branner said,

September 13, 2009 @ 8:57 pm

The only people who might be surprised at the misassignments are those who have never used large libraries before — they are crawling with miscatalogued materials. At least now it's easier to catch these errors — and perhaps correct them.

In the past, at the University of Washington especially, I've often been told, "Oh, we got the data from Library of Congress and there's nothing that can be done to correct it…"

Shine sunlight on the corrupt and moldy insides of those places!
John L said,

September 17, 2009 @ 1:31 pm

I have a different take on all this. Yes, Google is full of awful misinformation. But, if I can get access to the digitized book, I will do the reading and data extraction myself.

If I can. But early on, I discovered that several libraries were producing page after page of out-of-focus digital images, or fingers or even entire hands obscuring the text. Or, a horribly common failing, pages cut at the margins forcing you to guess what might have been there.

Some of these problems were "solved" by a Google message "Mark this page as unreadable".

Fine for the pages that have the problems I have mentioned. But what if a page or multiple pages are missing? There is no way to let Google know that anything is missing. I should say, no easy way. I discovered, after great effort, that it is possible to send messages to the Google Book Team, and often they will respond with something like "Thank you for letting us know about this problem". And sometimes, after many months, a problem will get solved.

But it is very frustrating if you are living somewhere where there is no major library and no inter-library loan available. Google is all we have, and I want to make it absolutely clear: I couldn't do my work without Google.

I only hope I live long enough for this resource to be reliable as well as useful.

So, although I find the discussion about Google mis-information about the contents of the books fascinating, I could live with it. But having the books themselves — with the information intact — seems far more important.
S.W. said,

October 2, 2009 @ 10:53 am

It seems to me that what we really need to be worried about is the point GN makes in passing near the beginning of the comments:

—-
GN: The thing of it is that up to now libraries have been giving you a whole lot more than you pay for, with governments and universities picking up the tab. But research libraries are under a lot of financial pressure, particularly in the current climate — and now that Google Book is out there, more people are saying they're basically duplicative: why spend all that money storing books?
—-

Google has said over and over that their digitized collections are in no way intended to replace actual libraries. So it's fine if there are literally millions of errors, illegible pages, missing illustrations, etc –because the researcher can always get the actual book from an actual library. If GB's metadata is lousy –no problem, just search WorldCat or the British Library Catalogue or the ESTC or some other research tool created with great effort and care by highly-educated professionals.

But what happens if funding for research libraries is cut drastically "because everything is on Google now"? Considering the struggling economy, the declining status of the humanities in our society, and the corporatization of universities, this is probably inevitable. So, while we should obviously try to push Google Books to improve, the future of research libraries is a much bigger concern.
T.W. said,

October 23, 2009 @ 12:08 pm

Someone above has written, "The similar problems with Google's metedata can be found in any library catalog including Library of Congress, Harvard, Yale, University of Michigan, etc."

This is absurd. Suppose I want to follow up a couple of basic references, say

Rudolf Hirzel. 1914. “Die Person.” Sitzungsberichte der Königlich Bayerischen Akademie der Wissenschaften: Philosophisch-philologische und historische Klasse, Abh. 10.

or maybe some particular volume with an edition of an old text (with its own author metadata) in some massive series like Monumenta Germaniae Historica or Patrologia Graeca.

If I'm using Harvard or Michigan's catalog, no problem. If I'm using Google's catalog, I find, well, it's not really a catalog. Two crucial points here:

1. Google got these books from libraries who had already compiled all the metadata and linked it to a bar code on the physical book! It is unforgivable that Google doesn't seem to have bothered to import MARC records for each scanned item and reproduce the librarians' cataloging information. (This makes all the cries of, "What can we do?" "Can we crowdsource a solution?" "What could Google have done?" spectacularly beside the point.)

2. Google's fundamental error, it's Original Sin, was assuming we just want to find snippets of texts. They couldn't imagine the situation of someone who wants to do scholarship and find obscure books & articles to read if their life depended on it. It is beyond obvious that they did not have a single Ph.D. student in a library-intensive discipline on their team to answer the basic question, "Are we completely travestying the basic idea of what a library catalog is supposed to do for the items within it?" In other words, full-text search is brilliant, but basic library-catalog functionality is equally important, and Google doesn't show any signs of having seen that.
aimstarathome said,

January 15, 2010 @ 2:35 pm

I do not think the idea of users editing error will lose all. Then. If payment has been approved and Google will allow the legal monopoly, they are only people who can access the orphans, poor or not.
aimstar4u said,

February 6, 2010 @ 6:41 am

I think that many providers may have thousand or million of errors similar to these but we do what we can to eliminate them because dates are being used to determine whether a book is in copyright or has entered the public domain. . The misclassified or misdated books mentioned above are mostly only limited view or no preview only, but I've found at least one misclassified books available in full view (as it was assumed to be a public domain book).
So you are not acknowledging that
1. most of the errors do not come from the providers.
2. many of them come about because you have refused to use the metadata available from the providers.
I hope that these errors will be eliminate as soon as possible.
Steve Johnson said,

April 9, 2010 @ 6:30 pm

Google's "no preview" of countless works in the public domain is as another variation on the train wreck theme. How can Google reasonably claim uncertainty about the copyright status of US government publications issued more than a hundred years ago? How can Google claim to be an authority on the copyright status of the works in its collection?
Sam@ Learning Language said,

June 29, 2010 @ 9:40 am

"Kip W said: Mae West can't be a religious 'icon' — she's no angel!"

Hey Kip, don't be so hard on her, I bet 20 years from now even Madonna could be a 'religious icon' ;)
behzatc said,

September 1, 2010 @ 10:51 am

The only people who might be surprised at the misassignments are those who have never used large libraries before — they are crawling with miscatalogued materials. At least now it's easier to catch these errors — and perhaps correct them.

In the past, at the University of Washington especially, I've often been told, "Oh, we got the data from Library of Congress and there's nothing that can be done to correct it…"

Shine sunlight on the corrupt and moldy insides of those places!
MIke O'Malley said,

October 25, 2010 @ 11:38 am

This may simply be naive of me, but isn't it possible to just say "the hell with metadata?" I used to use LOC subject headings to get me in the ballpark; then I'd start reading. Now I search for terms, words and phrases across all the metadata: The ability to ignore the metadata is in fact one of the most creative and fruitful things about digital media. In my own research, in fact, metadata is kind of the thing I want to escape.

I don't want to be glib or superficial here. I can certainly agree with the need for accurate metadata but can someone explain to me what I'm not getting here?
Ryan Shaw said,

October 25, 2010 @ 3:20 pm

@Mike O'Malley:

Mike, your post is puzzling. You begin by declaring, "the hell with metadata," yet you conclude by agreeing that we need accurate metadata. Which is it?

Furthermore, you claim in the same breath to search across all metadata and to achieve creative freedom by ignoring it. Again, which is it? Searching across metadata isn't ignoring it or escaping it, it's using it.

Metadata is just data. Sure, one can ignore it, just as I can ignore the blathering of art historians because what's important to me is my numinous experience of the artifact itself. (The art historians' contextualization is the kind of thing I want to escape.) But if I were designing the World's Last Art Museum, I wouldn't decree "the hell with art history."
MIke O'Malley said,

October 25, 2010 @ 6:39 pm

It's to hell with it, but I want to be educated as to why I might be wrong.

If I enter a search query into google books, the metadata–which I take to be subject classification, place of publication, author, date–is of very little importance to me. I want to know these things, but I can get them from looking at the title page, which is what I did with the paper book. What's most valuable about digital searching is finding connections that cross subject catagories, the kinds of questions which LOC metadata impeded. I['d likke the metadata to be accurate, but it's very low on my list of desires from google boooks. So what am I missing?

I'm not sure that the world's last library needs the metadata devised for the world's first library, but I'm willing to listen.
Ryan Shaw said,

October 25, 2010 @ 10:39 pm

@Mike O'Malley:

Consider the scholar who wishes to obtain the text of all books published between 1889 and 1907 by British publishing houses, so that she can do a statistical analysis of the text. Unless the publishing metadata is correct, she cannot do that. Simply searching for the names of the publishers or the years 1889-1907 will not cut it, since a book published in 1972 could mention a publisher, etc. etc.

Regarding finding connections that cross subject categories: you assume that all books that can be connected across subjects use the same vocabulary, and thus will be retrieved by a full-text search. That is not always the case, especially when searching across long historical periods, as vocabulary changes. In some cases, assigned subject headings and the connections between them can help find connections that full-text search cannot. They are very far from perfect, but they're an additional tool that we should hesitate to simply throw away.
John Cowan said,

December 16, 2010 @ 7:49 pm

Very late comment: The n-gram engine from the corpus I mentioned above is now available at http://ngrams.googlelabs.com; try it out. Some of the n-gram files themselves can be downloaded there as well; Google is staging their release. As I predicted, the whole corpus isn' t available; unfortunately, the whole metadata isn't available either. The research article is available at Science Express (paywall); it says that the corpus contains just over five million books, about 4% of all books ever published. (In this domain, the notion of "the whole shebang" is just meaningless: nobody has it, nobody will have it.)
Karen Patrick said,

January 16, 2014 @ 7:13 am

In the beginning, there was Google Books. Well, not exactly. But one can certainly argue that the project is as old as Google itself.

RSS feed for comments on this post

Google Books: A Metadata Train Wreck

81 Comments

Graeme said,

Chris said,

language hat said,

Jon Orwant said,

mgh said,

Mike Aubrey said,

Theo said,

Nick Lamb said,

JenJen said,

Lars said,

George J said,

Brandon said,

dr pepper said,

John Cowan said,

Sili said,

James Kabala said,

John Mark Ockerbloom said,

language hat said,

Leonardo Boiko said,

Ray Girvan said,

Marty Manley said,

Ryan Shaw said,

Joshua said,

Nick Lamb said,

Stephen Jones said,

Evan said,

language hat said,

Seth Finkelstein said,

Gene Golovchinsky said,

Camilla S. said,

Jon Orwant said,

James Grimmelmann said,

Dan T. said,

John Wilkin said,

Paul Duguid said,

Lucy H said,

Evan said,

Ben O'Steen said,

language hat said,

John Cowan said,

Adrian said,

Ryan Shaw said,

Kip W said,

Virginia Faulkner said,

David Jones said,

Marian Veld said,

Richard Volpato said,

Ray Girvan said,

Shawne D. Miksa said,

Leo Goodstadt said,

Karen Coyle said,

Curt G said,

Bob Blair said,

Christian Treczoks said,

Michael R. Bernstein said,

Michael R. Bernstein said,

Michael R. Bernstein said,

Aleta said,

David Cortesi said,

Ann C. Davidson said,

John Brice said,

bruce said,

Hugh said,

Ann C. Davidson said,

Carol Seiler said,

T.W. said,

David Prager Branner said,

John L said,

S.W. said,

T.W. said,

aimstarathome said,

aimstar4u said,

Steve Johnson said,

Sam@ Learning Language said,

behzatc said,

MIke O'Malley said,

Ryan Shaw said,