Help Wanted: Sharing Data for Research on Reading and Writing

« previous post | next post »

On Friday, July 20, at the 2012 meeting of the Council of Writing Program Administrators in Albuquerque NM, there will be a session called "Help Wanted: Sharing Data for Research on Reading and Writing".  Here's the proposal that was submitted for this session:

Should there be a large, open collection of student writing, representing the range of ability and accomplishment among American high school and college students today? We think so, but we’d like to hear your opinions.

“We” are a group of linguists, psychologists, computer scientists, and writing-program professionals.; and we believe that that a large collection of student writing, as part of a larger collection of texts and annotations, would provide an essential basis for many important kinds of research.

Our general idea is to create an open and evolving dataset of both student writing and expert writing, combined with an open and evolving collection of layers of annotation. The annotations might be linguistic (syntax, word senses, co-reference, discourse structures), editorial (mistakes, infelicities, suggested corrections), or psychological (eye tracking, EEG or MEG, reading comprehension, readers' evaluations, etc.). Since the collection would be a large one, not all kinds of annotation would be applied to all parts of it. And of course, not all users of the data would be interested in all types of annotation.

This collection could be used to estimate how much trouble students of different kinds at different levels have with different aspects of writing. It could be used to study the effects of writers' choices on readers' uptake, and might thus help to create better interactive advice for writers. And in addition to these and other innovative uses, such a collection would provide a larger and more diverse basis for standard sorts of reading research.

There are many available collections of (billions of words) of expert writing, and plenty of reading researchers who are willing to share their data, and plenty of computational linguists who are willing to share their algorithms and even their programs. The piece of the puzzle that is still entirely missing is a large and diverse collection of student writing, as well as editorial annotation, commentary, and evaluation for some of it.

In this session, we’ll sketch some of the kinds of research that such a collection would facilitate, and solicit the opinions of CWPA attendees about problems and opportunities.

This idea emerged from a series of discussions among a loosely affiliated group of people at several institutions — the set of names that ended up on the proposal, in alphabetical order, were Jonathan Brennan (Children's Hospital of Philadelphia), Chris Callison-Burch (Johns Hopkins University), Andrea Feldman (University of Colorado at Boulder), Al Filreis (University of Pennsylvania), Roger Levy (University of California at San Diego), Mark Liberman (University of Pennsylvania), Ani Nenkova (University of Pennsylvania), Rolf Norgaard (University of Colorado at Boulder), and John Trueswell (University of Pennsylvania).

If you're interested in joining the discussions, please get in touch with me.


  1. Nathan said,

    May 18, 2012 @ 8:18 am

    How can such corpora be collected, distributed, and used in the era of automatic copyright?

    [(myl) Through the use of copyright licensing agreements. There is a spectrum of possibilities, from complete Open Access via some type of Creative Commons license, to more restricted licenses negotiated with authors, publishers, broadcasters, or other IPR holders. The Linguistic Data Consortium has published hundreds of collections of material from tens of thousands of authors and hundreds of publishers, broadcasters, and so on; every year the LDC distributed thousands of copies of these more-restricted collections to hundreds of institutions, for purposes of education, research, and development.

    Follow any of the links in the LDC catalogue to learn about the copyright status of individual collections.

    In this case, a minimum requirement would be copyright releases from the individual contributors — of the same general kind that is used for publication in conference proceedings and the like. Of course there would also have to be informed consent to the planned publication and use of the collections, along with assurances of anonymity and so on. ]

  2. Chris said,

    May 18, 2012 @ 8:48 am

    I taught college writing courses for 12 years and the most painful part was my inability to find good examples of actual student research papers. Writing books tend to be filled with essays by famous authors, not research papers by students. It was always a struggle to teach students how to write without good examples for them to model from. I hope the projects is successful. It could revolutionize the teaching of writing.

  3. C Thornett said,

    May 18, 2012 @ 12:34 pm

    Such a collection could also be useful in developing better tools for assessing and preparing adult literacy and ESL students for GED and other tests, and for higher or further education.

    I'd like to see something similar in the UK as well.

  4. M. Knight said,

    May 19, 2012 @ 2:09 am claims to have 220+ million archived student papers. Perhaps they would be willing to share (or sell).

  5. Dakota said,

    May 22, 2012 @ 1:07 am

    Some universities require an English placement exam, which consists of a timed written essay. I know I had to take one before they would let me graduate. In those days it was a paper exam; surely by now they have it computerized. Such tests would be less likely to be plagiarized, and would offer a large cross-section of students who have graduated from high school.

RSS feed for comments on this post