When the Google Ngram Viewer came out, I tempered my enthusiastic praise with a complaint ("More on 'culturomics'", 12/17/2010):
The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture". But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.
I'm happy to say that the (non-Google part of) the Culturomics crew at the Harvard Cultural Observatory have taken a significant step in that direction, building on the work of the Open Library. You can check out what they've done with an alpha version of an online search interface at http://bookworm.culturomics.org/. But in my opinion, the online search interface, alpha or not, is the least important part of what's going on here.
Let's start by laying out what they've done, in their own words:
What is this?
Bookworm demonstrates a new way of interacting with the millions of recently digitized library books. The Harvard Cultural Observatory already collaborated with Google Books on the Google ngrams viewer that has data for years. Bookworm doesn't work so closely with Google Books: instead, it uses books in the public domain so you can explore the information we know about a book from many angles at once: genre, author information, publication place, and so on. We're submitting it as part of the Digital Public Library of America's Beta Sprint initiative.
As the DPLA's 5/20/2011's press release explains,
The Beta Sprint seeks ideas, models, prototypes, technical tools, user interfaces, etc. – put forth as a written statement, a visual display, code, or a combination of forms – that demonstrate how the DPLA might index and provide access to a wide range of broadly distributed content. The Beta Sprint also encourages development of submissions that suggest alternative designs or that focus on particular parts of the system, rather than on the DPLA as a whole.
The current bookworm interface is interesting, but it expands the Google Ngram interface in some ways (e.g. author metadata) while limiting it further in others (e.g. no multi-word sequences yet):
What can I do with it?
Library metadata makes all sorts of interesting queries possible. For example:
Say you want to know about the history of Social Darwinism: when did "evolution" cross over from the sciences into the social sciences? You can compare the paths of keywords like "evolution" in different genres.
You can also use geographical information to make comparisons. Suppose that you want to know whether British or American fiction has more female characters. Searching for female pronouns shows you that American literature does seem to use 'she' a little bit more. But you'll need to do some more searches, and look at some books, to be sure.
Although you can't (yet) do multiword phrases, you are able to combine words if you want to search for things like plurals or places that have two names; you can, for example examine the history of the "long-s" by comparing the usage of the words "fo" and "so" together and apart.
You don't have to plot by publication year, either: you can use a number of different variables, including the age of the author when the book was published. Death and taxes may be the only two constants in life, but authors seem to care about them at different ages: the young and old talk more about death, while only the safely middle-aged seem to care about taxes.
The important thing about this collection is that others (including you!) can in principle get at the whole thing, not just "culturomic trajectories" or lists of common ngrams:
What Books does this use?
All of our site builds on the amazing work of the Open Library and Internet Archive projects. The Internet Archive makes scans of books publically available to the public with Optical Character Recognition already perfomed. The books come mostly from major research libraries and are scanned by the Internet Archive itself, Google, Microsoft and other scanning initiatives. The Open Library is the Internet Archive's cataloging wing; they hope to create a publically editable library catalogue with an entry for every book ever published. We hope to include all the million or so books listed in both the Open Library and the Internet Archive; currently we have roughly 300,000. (We'll have feedback soon on the number of books inside the book collections you make–for now, you can use the "Raw Counts" function to get a rough idea.
If you find mistakes in the catalog information (which you will!), you can go to a book's page at Open Library and correct whatever's wrong; when we next refresh our data against theirs, we'll get your changes in our system.
There's a lot that still needs to be done. The OCR in the collection is of variable quality, and always pretty far from perfect; the metadata is similarly fallible, and the Open Library's process of FRBRization is incomplete:
We are also analyzing relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works). This process is known in the library world as "FRBRization". See http://frbr.org for more information.
From the point of view of linguistic history, we need to deal not only with multiple editions (and multiple digitizations), but also with the more difficult question of when passages were written as opposed to when they were published.
Diving into one random bookworm.culturomics.org search turned up two "different" hits that are different digitizations of the same book:
Both are given the (correct) publications date of 1866 (which is why I noticed the duplication, since they appeared in adjacent spots on the same list), and are correctly attributed to J. Hain Friswell (1825-1878). But the contents were mostly not actually written in 1866 by a 51-year-old man (i.e. Friswell), but rather are reproduced from much earlier writings, for instance a work by Lodowick Muggleton originally published in 1651, or another by Sir Thomas Browne, written in the 1670s and originally published in 1716.
Another random dive into the same set of search results turns up several other duplicate digitizations of works published in 1871, e.g.
as well as a case where the titles appear to be different but (some of?) the content may be the same:
There are also multiple cases of works published in 1871 but written much earlier, e.g.
Presumably the Google Ngram corpus has many similar issues, but there's no way for users to find or fix them. The Open Library's approach, unlike Google's, explicitly asks for active feedback from users to improve the data and metadata. There's a lot of room for improvement — but there are potentially a lot of users.
N.B. The intellectual leadership of the Harvard Cultural Observatory comes from Jean-Baptiste Michel and Erez Lieberman, The bookworm interface (and the back-end work behind it?) was done by Martin Camacho and Ben Schmidt.