Language Log

Serial improvement

August 31, 2009 @ 2:32 pm · Filed by Mark Liberman under Computational linguistics

Although I share Geoff Nunberg's disappointment in some aspects of Google's metadata for books, I've noticed a significant — though apparently unheralded — recent improvement. So I decided to check this out by following up Bill Poser's post yesterday about insect species, which I thought was likely to turn up an example of the right sort. And in fact, the third hit in a search for {hemipteran} is a relevant one: Irene McCulloch, "A comparison of the life cycle of Crithidia with that of Trypanosoma in the invertebrate host", University of California Publications in Zoology, 19(4) 135-190, October 4, 1919.

This paper appears in a volume that is part of a serial publication. And until recently, Google Books routinely gave all such publications the date of the first in the series, even if the result was a decade or a century out of whack.

But no longer.

True, Dr. McColloch's article is categorized as "Juvenile Nonfiction"… But the date is correctly given as 1919, despite the fact that the series in question began in 1902. The volume in which McColloch's article was published contains articles dated 1919 and 1920, and the treatment of the 1920 portions is a bit variable:

Still, this is a big step forward over the situation a couple of years ago. See for example the discussion in "Shack!", 7/23/2007, where I observed that Google Book Search misdated the Nov. 1956 issue of Boeing Magazine "as 1934, following its usual unfortunate practice of dating all issues of a serial in terms of the earliest issue".

The relevant page now comes up correctly dated:

And in general, serial publications now seem to be given the date of the bound volume that was scanned, rather than the date of the first publication in the series. This still leaves many errors in the date fields of Google Books' metadata. In fact, I guess it's possible that some of the errors that Geoff found (and that anyone can turn up in a few seconds of searching) were actually created by a process for inferring publication dates from OCR output, rather than relying on whatever catalog information was previously giving them the start date for all subsequent issues of a periodical or serial publication.

Overall, it would be nice if Google were a bit more open about what's going on with their metadata for books — the successes as well as the messes — but they deserve credit for this success, even if the solution created some additional messes.

August 31, 2009 @ 2:32 pm · Filed by Mark Liberman under Computational linguistics

Permalink

12 Comments

Dan T. said,

August 31, 2009 @ 5:27 pm

I think there are plenty of grounds to think positively about the prospects for gradually improving metadata. The earlier entry here mentioned the premise that it's unlikely that the massive task of scanning all of these books will be done more than once, so the scans Google is doing now will have to suffice for researchers for all eternity; even if that's true, it doesn't mean that the metadata won't be steadily improved, since that part of the task can be done without the physically difficult labor of re-scanning the books. It only takes some gradual tweaking of metadata by human intervention, and/or improvement in the automated algorithms that determine this data, in some combination; and this process can continue to occur over the years, decades, and centuries to come, making use of every new advance in artificial intelligence or human effort organization (e.g., Wikipedia-style crowdsourcing).
Garrett Wollman said,

August 31, 2009 @ 10:15 pm

I found it fascinating (in the train-wreck sort of way) the way so many people in your description of the meeting seemed to be says "we trust Google, but…". Leaving aside the question of whether Google should be trusted in general, the widespread botched metadata suggests to me that maybe this was not a task that they *should* have been trusted with. (Another important bit of our heritage that Google appears to have totally botched, and shows no evident interest in fixing, is the Usenet archive they got when they bought DejaNews. This archive pretty clearly *can't* be replicated by anyone else, since Google owns the only surviving copies of much of that early data.)
Jon Orwant said,

September 1, 2009 @ 1:24 am

Let me apologize in advance for the length of what follows. I manage the Google Books metadata team, and in concert with technical lead Leonid Taycher and our metadata librarian Kurt Groetsch, we'd like to respond to Geoff's post. I'll explain why we display the metadata we do, but before I dive in I'd like to make a few broad comments.

First, we know we have problems. Oh lordy we have problems. Geoff refers to us having hundreds of thousands of errors. I wish it were so. We have millions. We have collected over a trillion individual metadata fields; when we use our computing grid to shake them around and decide which books exist in the world, we make billions of decisions and commit millions of mistakes. Some of them are eminently avoidable; others persist because we are at the mercy of the data available to us. The quality of our metadata now is a lot better than it was six months ago, and it'll be better still six months from now. We will never stop improving it.

Second, spare a thought for what we are trying to do. An individual library has the tough goal of correctly cataloging all the books in its collection, which might be as many as 20 million for a library like Harvard. We are trying to correctly amalgamate information about all the books in the world. (Which numbered precisely 168,178,719 when we counted them last Friday.) We have a cacophony of metadata sources — over a hundred — and they often conflict. A particular library only has one set of cataloging practices to deal with (sometimes more, since cataloging practices inside a library change over the years and decades), and we have to dynamically adapt to every library, every union catalog, every commercial metadata provider.

Now, I'd like to go through Geoff's post point by point. Researching his observations over the past 48 hours has brought us face to face with a lot of metadata errors — some Google's, others external, and I'd like to thank him for going to the effort. Where the error was ours I will admit it. Where the error was not ours, I will describe the source but not name it; there's entirely too much finger-jabbing in the world for my taste. Where we discover systemic errors in external metadata, we try to notify the metadata provider so that they can correct the errors in their own database and avoid polluting other metadata customers.

Geoff begins by underscoring the importance of getting metadata right. No argument here. I wouldn't call Google Books "the Last Library" — we are not a library, and rely on brick-and-mortar libraries and flesh-and-blood librarians to practice genuine librarianship — but eagerly acknowledge that it's critical to properly curate the collection we have. Without good metadata, effective search is impossible, and Google really wants to get search right.

In paragraph three, Geoff describes some of the problems we have with dates, and in particular the prevalence of 1899 dates. This is because, as I said in my earlier post, we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other. Geoff responded by saying that only one of the books he cited was in Portuguese. However, that metadata provider supplies us with metadata for all the books they know about, regardless of language. To them, Stephen King's Christine was published in 1899, as well as 250,000 other books.

To which I hear you saying, "if you have all these metadata sources, why can't the correct dates outvote the incorrect ones?" That is exactly what happens. We have dozens of metadata records telling us that Stephen King's Christine was written in 1983. That's the correct date. So what should we do when we have a metadata record with an outlier date? Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we'd occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he's even still alive — prefers the former.

This Brazilian provider is an extreme, but we've learned the hard way that when you're dealing with a trillion metadata fields, one-in-a-million errors happen a million times over. We've special cased this provider so that their 1899 dates — and theirs alone — are ignored. You should see the improvements live on Google Books by the end of September.

Paragraph four claims that these errors are widespread. Again, no disagreement here. But in our defense, let me explain where these errors came from. The 1905 date for the Drucker book was courtesy of a New Jersey metadata provider, which used 1905 in the same way that the Brazilian provider used 1899 (this, by the way, is a large part of the reason why there are so many books purportedly mentioning "Internet" prior to 1950). The 1900 Virginia Woolf date came from a British union catalog that has multiple MARC 260.c fields, some with the correct date and some without, but the 1900 field also occurs in the record's MARC 008 field. A time-traveling Tom Wolfe wrote The Bonfire of the Vanities in 1888 rather than 1988 because one of our all-too-human humans miskeyed the date. Henry James wrote What Maisie Knew in 1848 rather than 1897 because a French union catalog tells us so. Four bad dates, four different causes.

Let's turn to Dickens. Geoff points out 182 hits for Chas prior to his birth year of 1812. I hope I won't be thought cavalier for not listing each case, instead focusing on the top hit: a British library associated a barcode for a 1740 book, Historie de L'Academie Royale des Sciences, with the bibliographic record for Household Words. Regrettably, Geoff missed an even better chance to poke fun at us: we date one edition of A Christmas Carol from a shockingly pre-Gutenberg 1135. While I personally believe that some Dickensian themes are timeless, I wish this British union catalog had left that record out altogether. (And it's a different British union catalog than the one I mentioned earlier.)

In paragraph eight, Geoff cites Dan as saying that the erroneous dates were all supplied by the libraries. I wasn't at the conference, but would be astonished if Dan said exactly that. He knows that not all our metadata sources are libraries, and he knows that Google sometimes inadvertently introduces its own errors. This seems like the sort of comment that is easily misheard: maybe someone said "Hey, you've got a lot of metadata errors", Dan replied that "there are a lot of errors in library catalogs", and it's interpreted as Dan asserting that libraries are to blame for all the errors visible on Google Books. I talk to Dan all the time about metadata and can assure you that he has a thorough understanding of the problems. Sometimes nuance and complexity gets lost on stage.

Geoff also suggests that "most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text." However, we don't extract publication dates from OCR. Every misdating came from a human — some inside Google, most outside. Where the misdates come from the frontmatter (e.g., the frontispiece or the title page, as in the two examples Geoff cites) the error is more likely to have been a person inside Google. We are investigating the best ways to fix these — through better training for those people, through automated ways to identify the errors, and maybe someday through user-supplied metadata corrections.

Now I'll turn, as Geoff did, to classification errors. There have been many attempts to create ontologies of book subjects, such as the Dewey Decimal System that Americans learn about in grade school, the fine-grained Library of Congress classifications, or the retailer-friendly BISAC catagories. Geoff identifies a number of absurd subject classifications that we display.

First, he points out that the 1891 Century Dictionary and The American Language are classified as "Family & Relationships". He is correct and this is our fault. When we lack a BISAC category for a book, we try to guess one. We guess correctly about 90% of the time and Geoff's comments prompted the engineer responsible to suggest some improvements that we will roll out over the coming months. I would be more specific, but he suggested a few different approaches, and we're not yet sure which to take. (In case you're wondering why such an absurd subject appears, it's because the full inferred subject category is "Family & Relationships / Baby Names" and the actual library-supplied subject is "Names/US". I'm not trying to excuse the mapping, just explain it. A similar mistake causes us to classify Speculum as "Health & Fitness".)

In contrast, the edition of Moby Dick identified as being about computers is the fault of a Korean commercial metadata provider. The Mae West biography ostensibly about religion (the jokes just write themselves, don't they?) is from a North Carolina commercial metadata provider. Ditto for The Cat Lover's Book of Fascinating Facts falling under "Technology & Engineering". Geoff identifies a topology text (I assume this is Curvature and Betti Numbers) as belonging to Didactic Poetry; this beaut comes to us from an aggregator of library catalogs. Perhaps the subject heading "Differential Geometry" was next to it in an alphabetic list, and a cataloger chose wrong.

Geoff adds, "And a catalogue of copyright entries from the Library of Congress listed under 'Drama' — though I had to wonder if maybe that was just Google's little joke."

Hey now. We would never ingest our own mirth into metadata records. There's too much there already. Like the time one of our partner libraries supplied us with a catalog record for a turkey baster. Not a book about turkey basting. An actual turkey baster, presumably to be found in the stacks. One European library classified Darwin's Origin of Species as fiction. And there's a copyright record for a book that has no writer, only a psychic who received the text "clairaudiently."

But this one we got right — take a look at the actual book, which is in full view. It is for Class D copyrights, which include dramatic works.

I'll skip over the misclassifications for Tristram Shandy and Leaves of Grass, although if anyone is interested, let me know and I'll dive in. As with other examples in Geoff's post, it's a mix of our mistakes and others' mistakes. But I do want to explain why one edition of Leaves of Grass is identified as "Counterfeits and Counterfeiting." It's because a library cataloger decided that was the appropriate subject for a pirated book. That was picked up by an aggregator of library catalogs (in the MARC 650 field) and you can find it online under that subject heading if you know where to look.

An Australian union catalog holds that Jane Eyre is about governesses; a Korean commercial provider claims it's about Antiques & Collectibles. We suspect that the prevalence of Antiques & Collectibles for some classic editions derives from a cataloger's conflation of a particular item's worth ("that first edition is a real collectible!") with the subject classification for the edition. The architecture subject heading was our fault.

While it's true that BISAC didn't exist when many of these books were published, it's not the case that Google necessarily invented the BISAC classifications for them. Sometimes we did, but often commercial metadata providers (not publishers or libraries) provided them, for the benefit of retailers.

Geoff asks why we decided to infer BISAC subjects in the first place. There is only one reason: we thought our end users would find it useful. As I mentioned above, we estimate that we get it right 90% of the time. I hear loud and clear from Geoff that 90% is not enough. Is 95%? 99.9%? Tell us what you think. If the accuracy needed is in excess of what we can provide, we'll simply stop inferring BISAC subjects and chalk it up to a failed experiment.

The 1818 Théorie de l'Univers links to Barbara Taylor Bradford's Voices of the Heart because of a barcoding error (ours) while Dickens' Household Words linking to the Histoire de l'Académie Royale des Sciences was another barcoding error (the library's). When Supervision and Clinical Psychology links to American Politics in Hollywood Film, that's because two books scanned one after the other, with no boundary in between — again, our fault.

Madame Bovary was written by Henry James and not Flaubert according to an aggregator of library catalogs; they could tell you which library made the original cataloging error.

Geoff says, "More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled." The explanation behind Mosaic is more prosaic: an Armenian union catalog got it wrong, and we believed them.

Commenter Brandon said that he found a theology book for which we listed the author as "Holy Trinity." Here it is. That is direct from the library catalog, and you can find it online if you search for it. The best part is that Holy Trinity is listed in their metadata record as the corporate name. Which means that the actual author was a contractor.

Geoff says, "I understand [Google] hasn't licensed [library records] for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file." First, as mentioned above, we do not recover dates from OCR — all publication dates come from external metadata sources or from occasionally fallible humans. Second, we certainly do use and display metadata from library records. (We don't display the raw library metadata because of a contractual obligation with a library catalog aggregator that forbids us from doing so. Given how they pay their bills, this is understandable.)

Now, if anyone is still with me after all that, I'd like to address Geoff's broader point about Google's intentions. It is hard to figure out how to answer this. We're committed to get metadata right, but you shouldn't take my word for it: promises are sometimes broken, and good intentions are never enough. So let me talk a little about how things work inside Google: we measure everything. Internally, we have a number of different ways to measure our metadata progress, and we measure ourselves by how much improvement we make. That's an approach that's worked well for Google in other areas, so perhaps it can be externalized: if you care about our metadata, come up with your own measure and track it over time. I'm confident that while there may be occasional quality dips (say, when we get data from a provider that misdates 250,000 books), the trend will be positive. It may not be as fast as you'd like, but if it's any consolation, it won't be as fast as we'd like either.

Finally, Geoff's efforts will have singlehandedly improved nearly one million metadata records in our repository once the code changes that his blog post inspired wend their way through our systems. While I winced at times reading his message and the conclusions he drew about our intentions and abilities, I can't deny that he's done Google a great service via has research. So: thank you, Geoff.

At the beginning of this message I mentioned three of the people most deeply involved with our metadata efforts: the technical lead, our metadata librarian, and myself. Against my better judgment, my colleagues insisted that they include their email addresses here: the technical lead's is his first name concatenated with "+metadata@google.com"; our librarian's is his first initial and last name concatenated with "+metadata@google.com". And I won't let them suffer alone: my email address is my last name "+metadata@google.com". (By the way, I know the "+metadata" isn't fooling anybody, but it's a neat trick — available with every gmail account — that makes filtering email easy.)
Jon Orwant said,

September 1, 2009 @ 1:42 am

Aw, man. After typing all that, I realize that I meant to post it here instead! Maybe we shouldn't be entrusted with any metadata at all if we can't handle simple blog comments.
Jon Orwant said,

September 1, 2009 @ 1:55 am

Mark, since I bollixed my earlier post, let me just thank you on behalf of the Google Books metadata team for noticing that (as of a few weeks ago) we've started to include serial and set information. More to come…
Anonymous said,

September 1, 2009 @ 10:30 am

And there's a copyright record for a book that has no writer, only a psychic who received the text "clairaudiently."

The Koran? <gd&rlh>
Dan T. said,

September 1, 2009 @ 11:55 am

Who should the Book of Mormon be credited to?
Coby Lubliner said,

September 1, 2009 @ 2:17 pm

I wonder if Google's serial dating – wait, that can have another meaning – never mind – if Google's practice in dating serial publications is based on IMDB's similar (but perfectly reasonable) practice of dating TV series by the date of the first episode.
Dan T. said,

September 1, 2009 @ 7:17 pm

Wikipedia sometimes disambiguates things of the same title by date, which in the case of single works like books and movies uses the actual release year, but for serial things like periodicals or TV series, uses the starting date. Thus, you have Liberty (1987) for the magazine named "Liberty" which began in 1987, to distinguish it from some other publications of the same title of different eras. That magazine is still being published now, so the 1987 date doesn't give the publication date of everything ever published in that serial.
language hat said,

September 2, 2009 @ 10:12 am

Jon: Many thanks for your comment. It must have been hard to write without the slightest hint of defensiveness, let alone belligerence; nobody likes being criticized as harshly as you guys have been, and as one of the harsher critics (and as someone given to sometimes unhelpful belligerence in online exchanges), I am impressed (and suspect you rewrote it more than once).

The reason my comments about Google Books have gotten harsher over the several years I have been complaining (which I was at first reluctant to do, because Google Books has improved my life so greatly, both personally and professionally in my capacity as copyeditor/factchecker) is that I have seen no sign that Google even acknowledged the problem beyond a cavalier "Well, of course there are the occasional glitches, we have a lot of data to deal with, we're working on it, now shut up and eat your gruel." I can't begin to tell you how much good it does me to hear you say "Yes, there's far more bad data than there should be and much of it is our fault, we appreciate your criticism and are taking it into account, and here's how." For the first time I feel that the people in charge there are taking the problem seriously and really doing something about it. So don't rue the time you spent crafting that comment — it was time well spent.
J. W. Brewer said,

September 2, 2009 @ 12:10 pm

@ Dan T. etc.: The meta-metadata problem about authorship (by which I mean that getting Google Books' coding to accurately reflect what the book jacket, title page, etc. themselves assert about the work does not solve all problems) is not confined to non-secular situations. For example, should the "author" of Profiles in Courage be coded as John F. Kennedy or Pierre Salinger? There are lots of other situations in which various social conventions or "polite fictions" about authorship may have undesirable real-world consequences. For example, until the Beatles finally broke up, songs were almost universally labelled as co-written by Lennon & McCartney even though after a certain point they were mostly written exclusively by one or the other – with what is by now a substantial scholarly/fan consensus on who actually wrote what. But under the copyright law in many jurisdictions a song written by Lennon alone will fall into the public domain at-least-28-years-and-counting earlier than one co-written by Lennon & McCartney, so there will in the fullness of time perhaps be an interesting question as to whether courts will look behind the "title page" to the actual historical reality.

A similar meta-metadata problem is that the "cover dates" for many issues of periodicals do not accurately reflect either actual publication dates or finalization-of-text dates, even though one or the other of the latter is what one would actually be interested in for any sort of historical research. The same I think may be true to some extent for books, with new titles coming out around Thanksgiving/Christmas sometimes claiming on their face to have been published in the next calendar year.
Dan T. said,

September 2, 2009 @ 1:25 pm

In the comic book field, it wasn't until the 1970s that the use of correct and complete credits for the writing, artwork, and other creative works was standard; before that, many comics were published with corporate anonymity, and others had incomplete, incorrect, and/or inconsistent credits, with perhaps a signature of an artist's name or nickname somewhere in a panel of the first page of the story, or perhaps a credit line mandated contractually that didn't necessarily reflect reality (for a long time, all Batman stories were required by contract to say "by Bob Kane" even though Kane rarely had anything to do with writing or drawing them beyond an early stage of the character's history, and is now not believed to be the sole creator even at the beginning). All Disney-owned comics had "Walt Disney's…" prominently displayed even long after Mr. Disney's death, and Marvel comics used to have "Stan Lee Presents…" at the top of the title page, persisting after Mr. Lee stopped being the editor in chief there. When there were any credits at all, they were most likely to be of the artist, with writers rarely getting credited unless they were famous, like some science fiction authors like Ray Bradbury who sometimes wrote stories adapted by EC Comics in the 1950s (but did they actually write the comic book adaptation, or did some anonymous writer adapt it from the original prose story?) There is much comic-geek activity in determining and cataloging actual creators of comic stories.

RSS feed for comments on this post

Serial improvement

12 Comments

Dan T. said,

Garrett Wollman said,

Jon Orwant said,

Jon Orwant said,

Jon Orwant said,

Anonymous said,

Dan T. said,

Coby Lubliner said,

Dan T. said,

language hat said,

J. W. Brewer said,

Dan T. said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta