Corpus-based Linguistic Research: From Phonetics to Pragmatics

Finding or Creating Resources

An amazing range and amount of stuff is Out There, though it's sometimes hard to find it, and often hard to get it once you find out about it. This lecture aims to give you some ideas where and how to look, and how to get what you find, both for published (or semi-published or unpublished) datasets and for ways to collect your own data by sipping from the enormous volume of text, speech and video streaming through the intertubes...

1. Indices, Catalogues, Collections of Collections
2. Single Datasets or Specialized Repositories
3. Places for Hunting and Gathering
4. Useful Online Search Portals
5. Some small "Use Cases"...

Indices, Catalogues, Collections of Collections

Some of these are long-established, and some are just getting started; some are small and specialized and some are large; some aim at engineers and some at humanists; some are like libraries, some are like bookstores, some are like building contractors, ...

The lecture will provide some comments on contents, usefulness, and non-obvious ways to get access.

[A sample of self-descriptions, in alphabetical order...]

Appen Butler Hill:

Appen Butler Hill is a language technology solutions and consulting firm; recognized as a global leader in the quality, range and caliber of its expertise. We provide sophisticated speech and language technology services and products to our clients, major international technology companies and government organizations, and help technology firms extend products with core linguistic components into worldwide markets.

CLARIN (Common Language Resources and Technology Infrastructure):

...aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. To this end CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. (CLARIN centres)

Corpora4Learning:

This page offers short descriptions of the most widely known English language corpora.

DOBES (Documentation of Endangered Languages):

The DOBES Archive contains language documentation data from a great variety of languages from around the world that are in danger of becoming extinct. This portal gives access to the material in the archive and provides information about DOBES.

ELRA (European Language Resources Association):

ELRA is the driving force to make available the language resources for language engineering and to evaluate language engineering technologies. In order to achieve this goal, ELRA is active in identification, distribution, collection, validation, standardisation, improvement, in promoting the production of language resources, in supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources and evaluation. (ELDA catalogue, Universal catalogue, R&D catalogue...)

ICAME (International Computer Archive of Modern and Medieval English):

ICAME is an international organization of linguists and information scientists working with English machine-readable texts. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.

ICE (International Corpus of English):

The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-four research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.

LAP (Linguistic Atlas Project):

We offer information about English as it is spoken in the United States. Most of the projects included here present the results of survey research carried out between 1930 and 1980; some are more recent.

LDC: (Linguistic Data Consortium):

The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.

LDC-IL: (Linguistic Data Consortium for Indian Languages):

MISSION STATEMENT: Annotated, quality language data (both-text & speech) and tools in Indian Languages to Individuals, Institutions and Industry for Research & Development - Created in-house, through outsourcing and acquisition.

MetaShare:

META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. META-SHARE targets existing but also new and emerging language data, tools and systems required for building and evaluating new technologies, products and services.

OLAC: Open Language Archives Community:

OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.

Oxford Text Archive:

The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.

Speechocean:

Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available in 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc.

TalkBank:

The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via networked computers.

Single Datasets or Specialized Repositories:

[Again an incomplete sample of self-descriptions, in alphabetical order]

Buckeye:

The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus OH conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer). Software for searching the transcription files is currently being written. The corpus is FREE for noncommercial uses.

CMU Pronouncing Dictionary:

The Carnegie Mellon University Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions.

IViE (Intonational Variation in English):

The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland. Three of our speaker groups are from ethnic minorities: we have recorded bilingual Punjabi/English speakers, bilingual Welsh/English speakers and speakers of Carribean descent.

FRED (Freiburg English Dialect Corpus):

The primary aim of compiling FRED is to provide a sound database that helps strengthen research on morpho-syntactic variation in the British Isles. [...] The full version of FRED (2.5 million words) is available to researchers and (visiting) scholars at the University of Freiburg only, due to copyright restrictions. However, a 1-million word sampler version of the corpus [...] is going to be published on the next ICAME-CD. In the meantime, you can email us to obtain FRED-S.

MiCASE (Michigan Corpus of Academic Spoken English):

The Michigan Corpus of Academic Spoken English (MICASE) is a collection of nearly 1.8 million words of transcribed speech (almost 200 hours of recordings) from the University of Michigan (U-M) in Ann Arbor, created by researchers and students at the U-M English Language Institute (ELI). MICASE contains data from a wide range of speech events (including lectures, classroom discussions, lab sections, seminars, and advising sessions) and locations across the university.

IcePaHC (Icelandic Parsed Historical Corpus):

The corpus is released under a free and open source license (LGPL) and there is no registration wall. The current release is version 0.9 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included but some minor corrections remain to be finished. We recommend use of released versions to ensure that results can be replicated but between releases you can watch the development at Github.

EMILLE (Enabling Minority Language Engineering):

...was a 3 year EPSRC project at Lancaster University and Sheffield University. Its end product was a 97 million word electronic corpus of South Asian languages, especially those spoken in the UK.

Penn Corpora of Historical English:

The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. All of the annotation has been carefully checked by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.

UMass Amherst Linguistics Sentiment Corpora

The UMass Amherst Linguistics Sentiment Corpora consist of n-gram counts extracted from over 700,000 online product reviews in Chinese, English, German, and Japanese. The files are UTF-8 encoded text. They are formatted to be read in as R data frames, but they can easily be manipulated with other tools. We are releasing them under a Creative Commons Share Alike license.

Places to Hunt, Fish, and Gather

Again, this is an incomplete and somewhat English-biased sample...

18thConnect:

A sister-organization for NINES, 18thConnect gathers together a community of scholars that shapes the world of digital resources. Our main concerns are: Access via plain-text searching for all scholars to open access and proprietary and digital archives including EEBO and ECCO, even if their institutions are unable to afford those resources; Peer-review of the growing number of digital resources and archives for which 18thConnect offers an online finding aid; Reflection on Best Practices with scholars who are negotiating new modes of publication and scholarly production.

The American Presidency Project:

The American Presidency Project is the only online resource that has consolidated, coded, and organized into a single searchable database: The Messages and Papers of the Presidents: Washington - Taft (1789-1913); The Public Papers of the Presidents: Hoover to G.W. Bush (1929-2007) & Obama (2009); The Weekly Compilation of Presidential Documents: Carter - G.W. Bush (1977-2009); The Daily Compilation of Presidential Documents: Obama (2009-2012); Our archives also contain thousands of other documents such as party platforms, candidates' remarks, Statements of Administration Policy, documents released by the Office of the Press Secretary, and election debates. [103,802 documents in total]

ECCO:

Based upon the English Short Title Catalogue (ESTC) bibliography and printed works in Gale's The Eighteenth Century microfilm collection, Eighteenth Century Collections Online (ECCO) offers students and researchers access to the most comprehensive online library of 18th century book titles printed in the United Kingdom.

EEBO:

From the first book published in English through the age of Spenser and Shakespeare, this incomparable collection now contains more than 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) and Wing's Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. Libraries possessing this collection find they are able to fulfill the most exhaustive research requirements of graduate scholars - from their desktop - in many subject areas: including English literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics, and science.

Gallica: The digital library of the Bibliotèque National de France. 320,000 books, 830,000 newspapers and magazines.

Project Gutenberg:

Project Gutenberg offers over 42,000 free ebooks: choose among free epub books, free kindle books, download them or read them online. We carry high quality ebooks: All our ebooks were previously published by bona fide publishers. We digitized and diligently proofread them with the help of thousands of volunteers.

Hathi Trust:

HathiTrust is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. [10,748,740 total volumes; 5,638,387 book titles; 280,517 serial titles; 3,762,059,000 pages; 482 terabytes; 3,406,276 volumes(~32% of total) in the public domain.]

Internet Archive (texts, audio, video, ...)

The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. Founded in 1996 and located in San Francisco, the Archive has been receiving data donations from Alexa Internet and others. In late 1999, the organization started to grow to include more well-rounded collections. Now the Internet Archive includes texts, audio, moving images, and software as well as archived web pages in our collections, and provides specialized services for adaptive reading and information access for the blind and other persons with disabilities. [347 billion web pages, 1,313,654 movies, 117,542 concerts, 1,655,392 audio recordings, 4,614106 texts ...]

Librivox ("Acoustical liberation of books in the public domain")

LibriVox volunteers record chapters of books in the public domain, and then we release the audio files back onto the net for free. All our audio is in the public domain, so you may use it for whatever purpose you wish. Languages: Ancient Greek (12), Arabic (12), Bengali (1), Bisaya (4), Bulgarian (8), Catalan (2), Chinese (527), Church Slavonic (8), Czech (1), Danish (52), Dholuo (1), Dutch (188), English (18,138), Esperanto (16), Farsi (1), Finnish (16), French (689), German (1652), Greek (18), Hebrew (21), Hungarian (22), Indonesian (6), Irish (3), Italian (290), Japanese (126), Javanese (23), Korean (3), Latin (49), Latvian (4), Middle English (6), Multilingual (147), Norwegian (3), Old English (6), Polish (48), Portuguese (160), Romanian (15), Russian (34), Spanish (437), Swedish (22), Tagalog (25), Tamil (4), Turkish (4), Ukrainian (1), Urdu (36), Welsh (2), Yiddish (14). [advanced search]

NINES (Networked Infrastructure for Ninetheenth-Century Electronic Scholarship):

... is a scholarly organization devoted to forging links between the material archive of the nineteenth century and the digital research environment of the twenty-first.

Usenet:

Open Access Journals: the Directory of Open Access Journals lists 9,790 journals from 120 countries, comprising 1,133,315 articles.

Oral Histories: According to Wikipedia,

Oral history is the collection and study of historical information about individuals, families, important events, or everyday life using audiotapes, videotapes, or transcriptions of planned interviews. These interviews are conducted with people who participated in or observed past events and whose memories and perceptions of these are to be preserved as an aural record for future generations. Oral history strives to obtain information from different perspectives, and most of these cannot be found in written sources. Oral history also refers to information gathered in this manner and to a written work (published or unpublished) based on such data, often preserved in archives and large libraries.

The Oral History Association lists 54 "Centers and Collections" in the U.S. But web search (e.g. for "PLACENAME oral history") will often turn up other things, e.g. the Regional Oral History Office (ROHO) at Berkeley, or the interview logs from History 400 at Cleveland State University, or the archive of the Ann Arbor Farmers Market Oral History project, or ...

There are lots of Radio/TV Programs with both audio and transcripts(e.g..here), and even more that just have transcripts (e.g. here or here) or just have audio (as podcasts or in streaming form).

There are also plenty of political recordings with transcripts (e.g. here). And of course, podcasts of all sorts.

oyez.org:

The Oyez Project at Chicago-Kent is a multimedia archive devoted to the Supreme Court of the United States and its work. It aims to be a complete and authoritative source for all audio recorded in the Court since the installation of a recording system in October 1955.

Wikipedia: Enough said.

YouTube:

Useful Online Search Sites

corpus.byu.edu:

The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (although see notes on the BNC and Strathy in #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces.

The Sketch Engine:

The Sketch Engine is for anyone wanting to research how words behave. It is a Corpus Query System incorporating word sketches, one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour.

Many of the sites listed above have search capabilities -- and Google site search will also often work.

Some personal examples of small collections for limited purposes

A partial collection of Barack Obama's radio addresses and George W. Bush's radio addresses.

And selections from the Librivox reading of Camilo Castelo Branco's Amor de Perdição (1862) -- 0, 1a, 1b, ...