All hail the Hathi Trust

« previous post | next post »

Anyone who has ever tried to use Google Book Search for serious historical research has had to grapple with its highly frustrating limitations. I've griped about the situation on several occasions (here, here, here, here). The problem is twofold: GBS is plagued by inaccurate or misleading dating, particularly for serial publications, and it does not offer full page images even for many works that are clearly in the public domain (namely, pre-1923 US works and noncopyrightable government publications). Many of us have been patiently waiting for Google to ease up on its viewing restrictions, which would simultaneously ameliorate the dating problem: if you can skim through page images, then you can determine if the year that Google gives you in the metadata is actually correct.

Help is on the way — but not from Google, exactly. Rather, several of Google's partners in its library scanning project are stepping up to the plate. Jesse Sheidlower of the Oxford English Dictionary passes on the news that the Hathi Trust has been established by the thirteen university libraries that make up the Committee on Institutional Cooperation. This includes the University of Michigan, which has contributed a major portion of Google's scanned material thus far. The Hathi Trust is not nearly as wary as Google in providing page images and fully searchable text for public domain materials. What this means is that if you find something on GBS that only gives you "snippet view," "limited preview," or "no preview available," you may be able to find the full page images by going to a CIC library site. The University of Michigan has already implemented this as part of its Mirlyn Library Catalog, with links to public domain material provided under the name "HathiTrust Digital Library." (Roy Tennant of Library Journal has also mocked up a prototype search service, but it still needs some work.)

Below the jump, an example of Hathi goodness in action.

In my post, "Jottings on the 'Jamaica' joke," I traced some of the history of an old bit of British comedy. ("My wife's gone to the West Indies!" "Jamaica?" "No, she went of her own accord!") In the comments, Ray Girvan chipped in with an example he found on GBS from 1914, in The Railroad Telegrapher. Fortunately, that volume is available in full view, so there's no question that it is indeed from 1914 (evidently from the June 1914 issue, if you scroll back from the joke's appearance on p. 993 to the beginning of the issue on p. 945). Inspired by Ray's find, I ran my own GBS query, looking for the search string "My wife's gone to the West Indies" in works dated 1913 or earlier. When I ran the search, I found two examples, both ostensibly from 1913. One, from The Medical Sentinel, is in full view, so again we can verify the exact date: in this case it's from the December 1913 issue (scrolling back from p. 1317 to p. 1271). But the other example listed on the search results page, from The Spatula: A Magazine for Pharmacists, has no preview at all. The metadata on the "About this book" page tells us it's in Volume 20, from 1913-1914, but we have no clue what issue it's actually in (or if it really is in that volume).

Enter Hathi. The Google metadata also tells us "Original from the University of Michigan," so we can head over to Mirlyn to check it out. Once we locate Volume 20, we can click on the Hathi link to get to the page images (along with other viewing options, including full text and PDF). GBS already informed us that the joke appears on p. 684, so we can go straight to that page image. Sure enough, it's there. Now we can scroll back until we find the first page of the issue: on p. 633 we can see it's actually in the September 1914 issue, so now we know it appeared slightly later than the Medical Sentinel example and slightly earlier than the Railroad Telegrapher example.

Unfortunately, Hathi's search functionality does not seem to be as robust as GBS. If I search on the word "Indies" in that 1913-14 volume of The Spatula, it shows four hits, but not the one with the "Jamaica" joke. For now, at least, it seems like GBS should be used for the heavy-duty searching, and then Hathi can be used to zero in on the page images once you know where to look. That's not such a bad arrangement, I think, and ultimately proves that the partnership between Google and the university libraries can be extremely beneficial to the research community.

(By the way, the Hathi Trust FAQ explains that hathi is "the Hindi word for elephant, an animal highly regarded for its memory, wisdom, and strength.")



19 Comments

  1. dw said,

    September 16, 2008 @ 6:35 pm

    hathi itself comes from Sanskrit "hastin", meaning "the one with a hand". The ancient Indo-Aryans, having arrived in India from the northwest, must have lacked a native word for elephant. Seeing the huge creatures for the first time, they observed that they used their trunks as humans used their hands. I've always found this strangely touching.

  2. Ray Girvan said,

    September 16, 2008 @ 6:38 pm

    It's a PITA. I've developed a few techniques that may or may not be obvious – see Belloc's Lord High-Bo and More on finding poetry … and search tricks – for tracking semi-systematically through texts that don't give a full view or limited view.

    As the situation stands, in many cases you need to build the text, Dead Sea Scrolls style, from the different glimpses you get in the top level Google Books result, the Snippet View, and the "Search in this book" result, which are generally all different.

    As you say, more collaboration – so Google can find texts that are online in what's still the "Deep Web" – would be nice.

  3. Benjamin Zimmer said,

    September 16, 2008 @ 8:05 pm

    From Ray's blog post:

    Another Google trick you can use is a more systematic 'sideways' search, looking for adjacent material by searching on distinctive strings at the edges of that already found.

    I use this (laborious) trick all the time to outwit Snippet View. But sometimes searching on an "edge" doesn't uncover additional text, in which case I find that inserting asterisks as full-word wildcards often works. Thus "search string * * *" can reveal more text after the string, and "* * * search string" can reveal more before it.

    The Hathi Trust arrangement lessens the need to resort to this sort of trickery, at least for public domain materials. Still far from ideal, though.

  4. dr pepper said,

    September 16, 2008 @ 11:19 pm

    Perhaps Hathi can save researchers from having to play the blind men like that.

  5. Aaron F. said,

    September 17, 2008 @ 12:22 am

    This includes the University of Michigan, which has contributed a major portion of Google's scanned material thus far….
    The University of Michigan has already implemented this as part of its Mirlyn Library Catalog….

    Oh man, that's so cool! I'm a senior at U of M, and I knew that librarians here were scanning stuff for Google Books, but I had no idea how important Michigan's contribution was. I also had no idea that public-domain scans would be accessible through Mirlyn… I ought to check that out!

  6. Jeremy said,

    September 17, 2008 @ 1:53 am

    In The Spatula, it is actually "My wife's gone's", with an "s" after "gone". I wonder what the could have stood for.

  7. Stuart said,

    September 17, 2008 @ 2:10 am

    hathi itself comes from Sanskrit "hastin", meaning "the one with a hand". The ancient Indo-Aryans, having arrived in India from the northwest, must have lacked a native word for elephant. Seeing the huge creatures for the first time, they observed that they used their trunks as humans used their hands. I've always found this strangely touching.

    Thank you so much for this! I've often wondered if there was any connection between हाथ, hand and हाथी, elephant. Can you recommend a good etymological dictionary for Hindi?

  8. Andy J said,

    September 17, 2008 @ 5:23 am

    An interesting post that was, for me, rendered confusing by BZ's perfectly legitimate use of the abbreviation GBS; to me that will always translate as George Bernard Shaw.

  9. Kenny Easwaran said,

    September 17, 2008 @ 7:17 am

    I studied Sanskrit for a few weeks from a book called "Teach Yourself Sanskrit". So I never got very far (barely learned about half the alphabet), but I was pretty sure that I learned that the word for "elephant" was "gajah". And in fact, a google search for "gajah" turns up several photos of elephants on the first page. Did that word fall out of favor to be replaced by the other, or are there some highly non-transparent sound changes, or did something else go on?

  10. Benjamin Zimmer said,

    September 17, 2008 @ 8:54 am

    Jeremy wrote:

    In The Spatula, it is actually "My wife's gone's", with an "s" after "gone". I wonder what the could have stood for.

    Looks like a perseveratory typo.

  11. Chris said,

    September 17, 2008 @ 10:00 am

    Actually, there are links to all digitized volumes from Mirlyn, even the in-copyright ones. Although you cannot view the text, for researchers at UM, it is valuable to be able to search a text for the words or phrases they are interested in and see the number of times the word appears and on what pages. You can decide then if it is worth your time to go find the volume.

  12. Maria said,

    September 17, 2008 @ 10:12 am

    It may be of interest to readers of this post that UM also just received a significant grant from IMLS to undertake and record copyright determinations for works that are post 1923 but may never have had their copyright renewed. This should open up a considerable body of work. More information here.

  13. dw said,

    September 17, 2008 @ 10:46 am

    Can you recommend a good etymological dictionary for Hindi?

    There is one online: "Platts, John T. (John Thompson). A dictionary of Urdu, classical Hindi, and English. London: W. H. Allen & Co., 1884." at this link: http://dsal.uchicago.edu/dictionaries/platts/ Once you've figured out its quirks it's pretty useful.

    The entry for "hathi" entry for hathi shows that it is derived via Prakrit "hatyio" from Sanskrit "hastikaH", which itself consists of "hastin" plus the suffix "kaH".

    http://dsal.uchicago.edu/cgi-bin/philologic/search3advanced?dbname=platts&query=%E0%A4%B9%E0%A4%BE%E0%A4%A5%E0%A5%80&matchtype=exact&display=utf8

    As for "gajaH" versus "hastin", a Google Book search on "sanskrit dictionary elephant" reveals this entry. Once you've magnified the page large enough to read the text, you see "gajaH", "hastin" and several other words for "elephant". As in most things, Sanskrit is blessed with a multitude of synonyms.

  14. Benjamin Zimmer said,

    September 17, 2008 @ 11:45 am

    Gajah is the word that Javanese and Malay borrowed from Sanskrit for 'elephant'. The legendary prime minister of Java's 14th-century Majapahit Empire was named Gajah Mada, which Zoetmulder's Old Javanese-English Dictionary relates to Sanskrit matta-gaja, 'mad/rutting elephant'. One of Indonesia's most prestigious universities is named after him.

  15. Benjamin Zimmer said,

    September 17, 2008 @ 2:36 pm

    Here is an example of what Chris is talking about — a "keyword searchable only" text that is not displayed in full view because of copyright issues. This particular title, The Sounds of Spoken English by Walter Ripman, is actually pre-1923 (1914, to be precise — there's a "Jamaica" joke in it). But because it's published in the UK, it's not subject to the same public-domain rules as US works.

  16. Leonard Zwilling said,

    September 17, 2008 @ 3:51 pm

    And then there's the old Sanskritists' joke–every noun in Sanskrit has four meanings–it means itself, its opposite, elephant, and a position in sexual intercourse.

  17. Stuart said,

    September 17, 2008 @ 4:21 pm

    As for "gajaH" versus "hastin", a Google Book search on "sanskrit dictionary elephant" reveals this entry. Once you've magnified the page large enough to read the text, you see "gajaH", "hastin" and several other words for "elephant". As in most things, Sanskrit is blessed with a multitude of synonyms.

    Indeed. When I checked my preferred online Hindi dictionary, I was quite suprised to find "haathii" listed LAST as a translation of elephant. First came गज gaja, then गजराज gajaraj, then हास्ती haastii and finally हाथी haathi. This dictionary is wiki-style, so it seems that its contributors have a real preference for the more Sanskritised Hindi.

  18. Stuart said,

    September 17, 2008 @ 4:35 pm

    A belated thanks to DW for the reminder about the online Platts. My priint copy of Platts has basically zero etymology, so it's nice to know that the online one is better.

  19. John Cowan said,

    September 18, 2008 @ 12:16 am

    (This comment is also posted to LH.)

    I'll just mention that it's very much worthwhile, if you've gotten a Google Books link through regular Google search, to do a Books-specific search from Google Advanced Book Search specifying the author and title. You can often find another entry that is full-text.

    For example, if you search for the phrase "intrinsec service" [sic], the first hit is a Google Books link to Pollock & Maitland's The History of English Law Before the Time of Edward I. Unfortunately, it's a limited-page view dated 1996, a reprint, probably from the publisher. But if you do the advanced search for those authors and title, the first hit is the full text of the 1899 second edition, from a university library.

    Disclaimer: I work for Google, but not on Book Search, and I don't know how it works specifically.

RSS feed for comments on this post