Language Log

Standardized Project Gutenberg Corpus

December 28, 2019 @ 11:16 am · Filed by Mark Liberman under Computational linguistics

Martin Gerlach and Francesc Font-Clos, "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics", arXiv 12/19/2018:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×10⁹ word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

One of the reasons that this is a wonderful idea is that the format of Project Gutenberg texts has changed over time, so that turning downloaded texts into usable data has always been a chore. See e.g. "Literary moist aversion", 12/27/2012, where I complained about the April 2010 Project Gutenberg DVD:

Let me say first that from the point of view of corpus analysis, the Gutenberg DVD is a mess. The text files have multiple formats, with different sorts of boilerplate fore and aft; there are many duplicate works, arranged in ways that make automated un-duplication difficult; there is no master list of path names to texts; in some cases, hyphenation and other relics of print editions are preserved; and so on. I say this not mainly to complain about the quality of free icecream (though I hope that at some time in the future, someone will produce a more analysis-friendly version), but to make the point that the clean-up I was able to accomplish in the space of an hour may very well have a few bugs. So take the following with a suitably-sized grain of salt.

The SPGC dataset is available from https://github.com/pgcorpus/gutenberg, and now comprises 59,503 books.

There are some texts available from Wikisource and not from Project Gutenberg — someday perhaps those will be added.

December 28, 2019 @ 11:16 am · Filed by Mark Liberman under Computational linguistics

Permalink

1 Comment

mg said,

December 29, 2019 @ 8:14 pm

What a valuable resource! And a massive undertaking that must have been a labor of love.

RSS feed for comments on this post

Standardized Project Gutenberg Corpus

1 Comment

mg said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta