Language Log

Bookworm

September 23, 2011 @ 9:38 pm · Filed by Mark Liberman under Computational linguistics, Linguistic history

When the Google Ngram Viewer came out, I tempered my enthusiastic praise with a complaint ("More on 'culturomics'", 12/17/2010):

The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture". But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

I'm happy to say that the (non-Google part of) the Culturomics crew at the Harvard Cultural Observatory have taken a significant step in that direction, building on the work of the Open Library. You can check out what they've done with an alpha version of an online search interface at http://bookworm.culturomics.org/. But in my opinion, the online search interface, alpha or not, is the least important part of what's going on here.

Let's start by laying out what they've done, in their own words:

What is this?

Bookworm demonstrates a new way of interacting with the millions of recently digitized library books. The Harvard Cultural Observatory already collaborated with Google Books on the Google ngrams viewer that has data for years. Bookworm doesn't work so closely with Google Books: instead, it uses books in the public domain so you can explore the information we know about a book from many angles at once: genre, author information, publication place, and so on. We're submitting it as part of the Digital Public Library of America's Beta Sprint initiative.

As the DPLA's 5/20/2011's press release explains,

The Beta Sprint seeks ideas, models, prototypes, technical tools, user interfaces, etc. – put forth as a written statement, a visual display, code, or a combination of forms – that demonstrate how the DPLA might index and provide access to a wide range of broadly distributed content. The Beta Sprint also encourages development of submissions that suggest alternative designs or that focus on particular parts of the system, rather than on the DPLA as a whole.

The current bookworm interface is interesting, but it expands the Google Ngram interface in some ways (e.g. author metadata) while limiting it further in others (e.g. no multi-word sequences yet):

What can I do with it?

Library metadata makes all sorts of interesting queries possible. For example:
Say you want to know about the history of Social Darwinism: when did "evolution" cross over from the sciences into the social sciences? You can compare the paths of keywords like "evolution" in different genres.
You can also use geographical information to make comparisons. Suppose that you want to know whether British or American fiction has more female characters. Searching for female pronouns shows you that American literature does seem to use 'she' a little bit more. But you'll need to do some more searches, and look at some books, to be sure.
Although you can't (yet) do multiword phrases, you are able to combine words if you want to search for things like plurals or places that have two names; you can, for example examine the history of the "long-s" by comparing the usage of the words "fo" and "so" together and apart.
You don't have to plot by publication year, either: you can use a number of different variables, including the age of the author when the book was published. Death and taxes may be the only two constants in life, but authors seem to care about them at different ages: the young and old talk more about death, while only the safely middle-aged seem to care about taxes.

The important thing about this collection is that others (including you!) can in principle get at the whole thing, not just "culturomic trajectories" or lists of common ngrams:

What Books does this use?

All of our site builds on the amazing work of the Open Library and Internet Archive projects. The Internet Archive makes scans of books publically available to the public with Optical Character Recognition already perfomed. The books come mostly from major research libraries and are scanned by the Internet Archive itself, Google, Microsoft and other scanning initiatives. The Open Library is the Internet Archive's cataloging wing; they hope to create a publically editable library catalogue with an entry for every book ever published. We hope to include all the million or so books listed in both the Open Library and the Internet Archive; currently we have roughly 300,000. (We'll have feedback soon on the number of books inside the book collections you make–for now, you can use the "Raw Counts" function to get a rough idea.

If you find mistakes in the catalog information (which you will!), you can go to a book's page at Open Library and correct whatever's wrong; when we next refresh our data against theirs, we'll get your changes in our system.

There's a lot that still needs to be done. The OCR in the collection is of variable quality, and always pretty far from perfect; the metadata is similarly fallible, and the Open Library's process of FRBRization is incomplete:

We are also analyzing relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works). This process is known in the library world as "FRBRization". See http://frbr.org for more information.

From the point of view of linguistic history, we need to deal not only with multiple editions (and multiple digitizations), but also with the more difficult question of when passages were written as opposed to when they were published.

Diving into one random bookworm.culturomics.org search turned up two "different" hits that are different digitizations of the same book:

http://openlibrary.org/books/OL6601572M
http://openlibrary.org/books/OL7216689M

Both are given the (correct) publications date of 1866 (which is why I noticed the duplication, since they appeared in adjacent spots on the same list), and are correctly attributed to J. Hain Friswell (1825-1878). But the contents were mostly not actually written in 1866 by a 51-year-old man (i.e. Friswell), but rather are reproduced from much earlier writings, for instance a work by Lodowick Muggleton originally published in 1651, or another by Sir Thomas Browne, written in the 1670s and originally published in 1716.

Another random dive into the same set of search results turns up several other duplicate digitizations of works published in 1871, e.g.

http://openlibrary.org/books/OL23445939M
http://openlibrary.org/books/OL23446166M

http://openlibrary.org/books/OL7207083M
http://openlibrary.org/books/OL6763816M

as well as a case where the titles appear to be different but (some of?) the content may be the same:

http://openlibrary.org/books/OL6970882M
http://openlibrary.org/books/OL24184817M

There are also multiple cases of works published in 1871 but written much earlier, e.g.

http://openlibrary.org/books/OL23386159M
http://openlibrary.org/books/OL7079189M

Presumably the Google Ngram corpus has many similar issues, but there's no way for users to find or fix them. The Open Library's approach, unlike Google's, explicitly asks for active feedback from users to improve the data and metadata. There's a lot of room for improvement — but there are potentially a lot of users.

N.B. The intellectual leadership of the Harvard Cultural Observatory comes from Jean-Baptiste Michel and Erez Lieberman, The bookworm interface (and the back-end work behind it?) was done by Martin Camacho and Ben Schmidt.

September 23, 2011 @ 9:38 pm · Filed by Mark Liberman under Computational linguistics, Linguistic history

Permalink

9 Comments

David Y. said,

September 23, 2011 @ 11:23 pm

I worked a while back to locate electronic copies of the entire series of the State Department's annual "Country Reports on Human Rights Practices," a document of some 1000-3000 pages, depending on the year. By the time I had located at least one scanned version of each year's issue, I had located as many as FIVE distinct scans (identifiable from differences in quality or, more obviously, library stamps on the first few pages), and I found at least two of most years, in about ten different databases.

In other words, some poor work-study students at these libraries have duplicated not a few, but THOUSANDS, of pages of effort. And this is just one publication.

Think how many more things could have been scanned if there were some kind of system for allocating the scanning resources. And how much easier it would be to deal with the results if the institutions could get their acts together to create a single library of scanned public-domain sources instead of 10+ libraries with parts of the overall task.

[(myl) This is a bit misleading. Most of the scanning, as I understand it, has been and is being done using machines that handle most of the work — one company advertises that one worker can manage 5 scanning machines simultaneously, each one scanning 2,000 pages/hour. So it would almost certainly be more work (and more difficult work) to ensure that exactly one copy of everything got scanned, than just to go systematically through each participating library's collection. The same thing applies, I think, to multiple editions within one library's holdings (where there may be other reasons to want to digitize the all of them, anyhow.)

The thing that still largely needs to be done is the FRBRization that would allow users to ensure that a given digital collection had just one copy of each work.]
David Y. said,

September 24, 2011 @ 1:50 am

Your links in response to my post are promising, but about half my duplicates — on a quick check of the first dozen or so pages of each — have clear shots of fingers at the edge of the scans. Federal work study funding may be more reliably available than funding for expensive machinery, for at least some libraries. But I realize that you're referring mainly to participants in the Google Books library project, where money may be less of a problem.

I do think that the fact that multiple scanning projects are proceeding simultaneously, feeding into different databases, is making it increasingly difficult to find — and for librarians to catalog — what's out there. I found my at least one of my scanned "Country Reports" on the State Department's own website, a well-hidden obsolete page buried on the State Department's server for some older issues, FDsys, Google Books, HathiTrust, Lexis, Barnes & Noble, and Amazon (I'm forgetting a couple sources).

My closest research institution's library doesn't even have a correct index of which issues they have on paper, much less which databases contain which issues. Google Books is actually surprisingly incomplete when it comes to government documents; I can't tell whether they're omitting them intentionally.
Joe said,

September 24, 2011 @ 5:29 am

"Suppose that you want to know whether British or American fiction has more female characters. Searching for female pronouns shows you that American literature does seem to use 'she' a little bit more"

This is probably OT, but could this method reallly work? The problem as i see it is that plural pronoun is upmarked for gender, and if a given novel had, say, one female character that was often referred to with a 3rd person singular, but another had several that were often referred to collectively (with the unmarked plural) then the novel with one female character would appear to have more. Has there been any studies that would rule this possibility out (i.e., by counting plural forms and seeing how well they correlate with the singular?)

[(myl) As with all attempts to explain observational differences, there are many possible explanations for the observed difference in the relative frequency of the pronoun "she", which is that (after 1870 or so), bookworm's count for "she" in American "Fiction and juvenile belles lettres" runs about 0.08% (absolute) higher than the same estimate for the same genre in the UK — e.g. in 1921, about 0.7008% vs. 0.6155%. You raise one alternative explanation, but there are many others: Perhaps American fiction authors use the same number of female characters overall, but refer to them with pronouns more often, as opposed to names or definite descriptions; perhaps British novelists are somewhat more likely to use one of their female characters as narrator, thus replacing third-person pronouns with first-person pronouns; and so on.

It's not going to be easy to distinguish these alternatives on the basis of overall differences in the time function of the frequency of single words. However, given access to the full texts of the works involved, the situation changes.

For example, a future task for "information extraction" technology is to compile a dramatis personae list for each work of fiction, with the demographic metadata attributed to each fictional character. And then if we had (even approximate) sales figures for each novel, or some other proxy measure for how many readers it had, we'd be in a position to do some serious cultural analysis.

Given a large collection of texts, with full access shared by a community of scholars, this sort of thing is quite possible, and even (in my opinion) inevitable.]
Chris said,

September 24, 2011 @ 7:18 am

@Joe: I think you've hit on exactly why these projects are so misleading. Simple frequency counts don't tell us much, yet they capture the public attention and enable wild misinterpretation (recall the various discussions of presidential pronoun frequencies). As these technologies mature, I think we'll see more sophisticated analysis.

More to Mark's point, I've also wondered about the anthology problem. Imagine an anthology published in 1924 that contains various political essays from the 1700s. Is there yet a way to automatically determine that that language use is NOT representative of 1924?

[(myl) Certainly human readers can figure out fairly accurately what was written when. It's plausible that automatic text-analysis technology could some day do the same thing; but we're not there yet.

Again, given a large collection of texts accessible to a community of scholars, it's likely that this analysis will eventually be done, and the results will be available to all.

There are some more pressing things to improve at the moment, in my opinion, such as the (often inexcusably bad) quality of the OCR in the Open Library's collection.

But one relatively easy thing to determine, on the basis of a sampling experiment where a sample of the hits from sample searches are checked by human readers, would be what fraction of passages published in year X were actually written within a decade of the year of publication, as opposed to being quoted, anthologized, or reprinted from significantly earlier works. (Or later ones — one of the ~15 volumes that I checked in my quick scan of the results of one bookworm search was this, which is identified as being Tennyson's Poetical Works published in 1870, but appears in fact to be A Textbook of Mechanical Drawing, published in 1904.) On the basis of the (obviously inadequate) sample of things checked in the body of this post, I conjecture that the out-of-decade percentage in bookworm results may be non-trivial, perhaps on the order of 10%.

This may not matter for "culturomics", I'm not sure — if something was published in 1870, it was in some sense part of the cultural landscape in 1870, even if it was written in 1770 or 1670. (Though if it was actually published in 1904, not so much — and also, it would be nice to know how many copies were printed and sold, so that we could distinguish something read by a hundred people from something read by ten million.) But from the point of view of the history of the language, we need to know when and where something was written, not when and where it was subsequently edited, anthologized, or republished.]
Jadwiga said,

September 24, 2011 @ 8:19 am

you can, for example examine the history of the "long-s" by comparing the usage of the words "fo" and "so" together and apart.

I hope this is a typo, or did they really replace all ſ by f?

[(myl) It's not what "they" (the folks at the Harvard Cultural Observatory) did, but what the OCR software used in the digitization process did.]
languagehat said,

September 24, 2011 @ 8:27 am

by Lodowick Maggleton originally published in 1651

I believe the surname should be Muggleton. (I was quite a fan of the Muggletonians in my youth.)

[(myl) Sorry — I should have known that a passage rendered by OCR as

A Remonstrance from the Eternal God ; declaring several Spiritual Transactions unto the Parliament and Commonwealth of JSnglcmdj ^c, 8fc. By John Reeve and Lodowick Maggleton. 1651.

was not to be trusted even in the areas not obviously garbled.]
mollymooly said,

September 24, 2011 @ 9:23 am

One eventual benefit of having duplicate scans of the same edition is that the OCR result of processing both versions together might be better than for either version alone. Once the data has been annotated to indicate that two scans are of the same volume, it ought to be easy to determine which pages of each scan correspond.
Matthew said,

September 25, 2011 @ 6:27 am

Tie editorial response to David Y's comment points to one of the bigger problems with automated machine scanning, specifically that in using puffs of air to turn pages automatically, the process cannot handle fold-outs (the publisher's f-word). Any large map,graphic, or data table otherwise crucial to the material is shown as a folded piece of paper. Not so important for linguistic analysis I know, but a crucial flaw. We can, and do, crowd-source the correction of poor OCRing, as returning f to s for the long-s, etc., but the fold-outs will require rehandling the books and then the manual insertion of new images into the scanned files … a laborious process that I just don't see happening!

[(myl) Actually, in my opinion, the right thing to do with poor OCR is to replace it with good OCR. There's no reason to waste good human time fixing the crap that passes for OCR in a large part of the Internet Archive's holdings — much better OCR software is available, and it could be improved further (e.g. via adaptive language modeling). It makes sense to use crowdsourcing techniques to correct *good* OCR, which will still have a few errors here and there.

Here's a small random example: the page image of a paragraph, the text version from Google Books, and the text version from the Open Library, which is based exactly on the same scanned page image from the Google Books project….

As for fold-outs, it should be possible to diagnose their presence automatically, and eventually to scan them and add the results. But surely the proportion of books where this is an issue is very, very small.]
blahedo said,

September 25, 2011 @ 8:27 pm

@myl "But from the point of view of the history of the language, we need to know when and where something was written, not when and where it was subsequently edited, anthologized, or republished.":

I'd say that we need to know when and where it was written AS WELL AS where it was republished.[0] And if you're storing your metadata in anything like a proper database, it is trivial to track both sets of information (once it is known, whether by automatic algorithms or manual annotation). The problem comes in that so many people's conceptions of databases these days are influenced by things like iTunes or Excel-as-database, where everything is shoehorned into a single ginormous table, however poor a fit that may be. If you acknowledge that "written work" and "physical manifestation of written work" are not in a 1-to-1 relationship and build your database to reflect that, you can have your cake and eat it too.

[0] For example. I remember how surprised I was when I was taking French and chose to read "Don Juan" by Moliere for an assignment—when I went into the university stacks to find a copy, I discovered that if the physical book that I chose was older than some cutoff—iirc around 1900—then what I saw was using a very different orthography (starting with the title, "Dom Juan"). The newer books used the newer orthography, but there was no indication whatsoever that this was not the literal original text flowing from Moliere's pen. Someone tracking certain kinds of usage patterns would be ill-served by simply marking a 1940 printing of Don Juan as "written in 1665". (They'd also be ill-served by simply marking it as written in 1940, of course!)

RSS feed for comments on this post

Bookworm

9 Comments

David Y. said,

David Y. said,

Joe said,

Chris said,

Jadwiga said,

languagehat said,

mollymooly said,

Matthew said,

blahedo said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta