Language Log

Google Scholar: another metadata muddle?

September 26, 2009 @ 12:31 pm · Filed by Geoff Nunberg under Computational linguistics, Language on the internets

Following on the critiques of the faulty metadata in Google Books that I offered here and in the Chronicle of Higher Education, Peter Jacso of the University of Hawaii writes in the Library Journal that Google Scholar is laced with millions of metadata errors of its own. These include wildly inflated publication and citation counts (which Jacso compares to Bernie Madoff's profit reports), numerous missing author names, and phantom authors assigned by the parser that Google elected to use to extract metadata, rather than using the metadata offered them by scholarly publishers and indexing/abstracting services:

In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records— Limited (234,000) and Ltd (452,000).

What makes this a serious problem is that many people regard the Google Scholar metadata as a reliable index of scholarly influence and reputation, particularly now that there are tools like the Google Scholar Citation Count gadget by Jan Feyereisl and the Publish or Perish software produced by Tarma Software, both of which take Google Scholar's metadata at face value. True, the data provided by traditional abstracting and indexing services are far from perfect, but their errors are dwarfed by those of Google Scholar, Jacso says.

Of course you could argue that Google's responsibilities with Google Scholar aren't quite analogous to those with Google Book, where the settlement has to pass federal scrutiny and where Google has obligations to the research libraries that provided the scans. Still, you have to feel sorry for any academic whose tenure or promotion case rests in part on the accuracy of one of Google's algorithms.

September 26, 2009 @ 12:31 pm · Filed by Geoff Nunberg under Computational linguistics, Language on the internets

Permalink

9 Comments

mollymooly said,

September 26, 2009 @ 12:36 pm

I'm gonna change my name to Molly Limited and get myself tenure.
Rosie Redfield said,

September 26, 2009 @ 2:04 pm

Where accuracy matters, most academics have access to the Web of Science, whose citation data is probably much more solid. At least, when I look at the citations it lists, they're real. Google Scholar, on the other hand, once listed as 25 separate citations the 25 job application letters a former student of mine had created on a web-accessible server, because each included the citation info for that paper.
Sili said,

September 26, 2009 @ 2:24 pm

What is this? Arrogance, or just ignorance? To me it smells of the former: our software is unsurpassed. So what if the (meta)data is already there, ready to be used, we can do better ourselves. So neener-neener!
language hat said,

September 26, 2009 @ 2:26 pm

I have to agree with Sili, but I hope someone from Google will show up and offer the case for the defense. Seriously, Google, just because we all love you for good and sufficient reasons doesn't mean that you can pull stuff like this with impunity.
Gregory Murphy said,

September 27, 2009 @ 2:06 am

I get 454,000 hits on "ltd". I sampled a few dozen results, and it looks like in many cases the legitimate author's name is correctly recorded in the metadata. The problem is that, for search purposes, the "author" field appears to cover just about everything found in the frontispiece or byline or whatnot. Even stranger is that this seems to include examples from Google Books, where the metadata for author are limited to what looks to me be just the person who wrote the book.
Kenny Easwaran said,

September 28, 2009 @ 7:54 pm

Rosie – Web of Science may be great for people working in the sciences, but JSTOR and the like don't seem to have similarly useful features for those working in the humanities. And even in the sciences, I'm sure that some fields have better or worse coverage.
J. Spenader said,

October 6, 2009 @ 3:22 pm

Agree with Kenny's point. Web of Science doesn't actually even index a number of important language research journals (somewhat objectively defined as A-list journals on the ERIH-list Linguistics from the European Science Foundation, which categorizes journals as A, B, or C.). It mostly lists psycholinguistic and neurolinguistic journals only. And it doesn't index conference proceedings, which in some fields are very important (comp ling, AI, etc). So google scholar is at least one alternative to see what kind of impact a conference paper or even journal paper had for many language researchers.
Zach said,

October 7, 2009 @ 10:41 am

I actually ran into exactly this problem last month, as I was finishing up my dissertation and trying to get my BibTeX files in order to generate the bibliography. Agree that Google Scholar's metadata are worse than useless, not only because the author line is generally gibberish, but also because journal names are often truncated. I had to clean up dozens of entries by hand before I discovered that ADSABS and PubMed actually do have clean metadata, and it's very easy to locate articles by DOI on one or the other of those two services.

Don't know about the humanities, though.
Vacilando said,

February 12, 2011 @ 9:40 am

For those interested, here is a story of an encounter with phantom authors, doubts about the "Author" meta tag relevance, and more pressure on Google to start putting things in order… http://vacilando.net/node/411418

RSS feed for comments on this post

Google Scholar: another metadata muddle?

9 Comments

mollymooly said,

Rosie Redfield said,

Sili said,

language hat said,

Gregory Murphy said,

Kenny Easwaran said,

J. Spenader said,

Zach said,

Vacilando said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta