The BYU Law corpora (updated)

« previous post | next post »

[Cross-posted on LAWnLinguistics.]

I’d imagine that most people who’ve been actively involved with corpus linguistics are familiar with the BYU corpora—a collection of web-accessible corpora created by Brigham Young University linguistics professor Mark Davies. These corpora (and BYU’s corpus-linguistics program more generally) have played an essential part in the development of what I’ll call the corpus-linguistic turn in legal interpretation. The BYU corpora served as my entry-point into corpus linguistics, and they have provided the corpus data that has been used in most of the law-and-corpus-linguistics work that has been done to date. And beyond that, the BYU Law School has played an enormous role, in a variety of ways, in Law and Corpus Linguistics becoming a thing.

One of the things that the law school has been doing has been happening largely behind the scenes. For the past two or three years, people there have been developing the Corpus of Founding Era American English (COFEA)—a historical corpus that is intended as resource for studying language usage in the time leading up to the drafting and ratification of the U.S. Constitution. At this year’s conference on law and corpus linguistics (the third such conference, all of them hosted by the BYU Law School), we were given a preview of COFEA. And via a tweet by the law school’s dean, Gordon Smith, I’ve now learned that a beta version of COFEA is up and available for public playing-around-with, as are beta versions of two other corpora: the Corpus of Early Modern English and the Corpus of Supreme Court of the United States.

All three corpora are hosted on a new website titled BYU Law Corpus Linguistics, the URL for which ( seems familiar to me, for some reason that I can’t put my finger on. Be that as it may, here’s how the website describes the three corpora:

Corpus of Founding Era American English (COFEA)
95,133 texts
138,892,619 words
The Corpus of Founding Era American English covers the time period starting with the reign of King George III, and ending with the death of George Washington (1760-1799). COFEA contains documents from ordinary people of the day, the Founders, and legal sources, including letters, diaries, newspapers, non-fiction books, fiction, sermons, speeches, debates, legal cases, and other legal materials. Three sources have provided the majority of texts, the National Archive Founders Online; William S. Hein & Co., HeinOnline; Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).

Corpus of Early Modern English
40,300 texts
1,283,475,411 words
The Corpus of Early Modern English cover texts from 1475–1800 that were included in the Evans Bibliography, the Early English Books Online (EBO), Eighteenth Century Collections Online (ECCO) corrected by the Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).

Corpus of Supreme Court of the United States
31,682 texts
140,853,673 words
The Corpus of the United States Supreme Court includes all opinions in the United States Reports and opinions published by the Supreme Court through the 2017 term.

All three corpora sport a new user interface that is designed to be more lawyer-friendly than the interface for the existing BYU corpora. My initial impression is that the new interface looks like it will be a step in the right direction; with the ways to invoke the site’s functionality being more immediately visible or at least more easily find-able than is the case with the older interface. (User-interface developers undoubtedly have at least 100 words for the kind of thing that I'm talking about. Unfortunately, I don't know any of them.)

Nevertheless, the interface is definitely still at the beta stage. It’s not self-explanatory, and if there are any help files, I couldn’t find them. In order to take COFEA out for a test-drive, I searched for instances of the string been increased, and although I learned that there were 115 instances of that string in the corpus, I couldn’t figure out how to display any of them. Every time I clicked on what I thought was an appropriate place, all I got was a lot of blankness. And at this point it becomes relevant for me to note that in addition to there apparently being no help files yet, there is no link for reporting issues to the developers. But I am sure that these kinks will be worked out. I will make some inquiries and will report back on what I learn.

Finally, I want to note that although the motivation behind the development of these corpora has been to create tools for dealing with legal issues, it may turn out that the Corpus of Early Modern English, with its wide temporal coverage (325 years as compared to to COFEA's 39), will be of interest more to historical linguists than to law-and linguists. However, whether that turns out to be the case could depend on a variety of factors, so we’ll have to await the historical linguists’ verdict.


I've been informed that the developers are aware of the issue I've raised and are working on it.

In the meantime, I've figured out how to get the results of my search displayed. I don't know whether it's because a fix has been implemented, or because I just stumbled on what needs to be done. So here's how to display results for a two-word string of the form word1 word2.

  1. At the top of the screen, click on "Matches."
  2. Enter word1 in the Query box.
  3. Immediately above the Query box, where there are choices "Matches," "Sections, " and "Collocates," click on "Collocates."
  4. In the Collocate box (to the right of the Query box) delete the asterisk and enter word2.
  5. To the right of the Collocate box are boxes labeled "Left" and "Right." In the Left box, set the value to 0. In the Right box, set the value to 2—even though you're only interested in word2 when it immediately follows word1. For some reason, if you set the value to 1, you won't be able to see the results. [Update: This has now been fixed, so you can set the value to 1 if you only want to see the hits for collocates immediately to the right of the keyword.]
  6. Hit Enter. You will then get a line showing the number of hits.
  7. Click on that line.

That's what worked for me. Hopefully it will work for you, too.


  1. geekosaur said,

    May 7, 2018 @ 1:27 am

    From experience, most of those hundred words aren't suitable for polite company.

  2. J.W. Brewer said,

    May 7, 2018 @ 8:44 am

    I look forward to CoEME getting out of beta. I note that one of its sources of texts appears to be EEBO. I was told some years ago by a professor who used it a fair amount that EEBO was originally (perhaps as a condition of some of its funding) supposed to be gotten into a shape where it would be freely accessible to anyone with an internet connection w/o needing a university-library subscription but there had been multiple delays in getting it to that point. If BYU is finally making that happen (plus throwing in some other sources), more power to them.

    Obviously as one gets back past 1800, English orthography gets less and less standardized. Whether the user interface will end up offering tools to deal with that issue in a semi-automated way I guess remains to be seen?

  3. Jonathan said,

    May 7, 2018 @ 9:05 am

    I think the UI developer word you're looking for is 'discoverable'.

  4. Thomas Shaw said,

    May 7, 2018 @ 9:15 am

    I was able to get a list by clicking on the result line (e.g. on the 115 you mentioned), and to get context by clicking on any of the quotes that came up after that. However, the search for "been increased" seems to bring up instances of "been + [word]", whether or not [word] is "increased". (Although not every instance of "been + [word]". Searching for "been" on its own brings up many others).

  5. Coby Lubliner said,

    May 7, 2018 @ 3:47 pm

    I took a stab at using COFEA in order to substantiate my hunch that "freedom of the press" has to do with printing, not with journalism, and so far so good. Thomas Bradbury Chandler defines is as "the liberty of publishing, by means of the press, remarks upon, objections to, and discussions of, all public transactions, whether relating to religion or government." And, when "the press" is used metonymically to denote a profession, it is that of printer, not journalist.

  6. J.W. Brewer said,

    May 7, 2018 @ 4:57 pm

    @Coby Lubliner: In an age where more politically controversial content in a newspaper was likely to be either anonymous or pseudonymous, the fellow who had to worry the most about being sued or prosecuted for libel was the fellow who published the newspaper, who would be easier to find. In those days, that tended almost invariably to be the same fellow who owned and operated the physical printing press. Thus, in the controversy in 1733-34 over the New-York Weekly Journal, the aggrieved royal governor first cracked down on the printer/publisher (John Peter Zenger, who beat the rap due to jurors hostile to the governor's crackdown), while attempting via the offer of reward money to find out the identities of some of the anonymous authors of specific articles that had been critical of his administration.

  7. Dennis Baron said,

    May 7, 2018 @ 6:49 pm

    that collocate search worked! is there any way to save a search? i've got well over 200 hits for bear arms in COFEA and I'd like to go over them at leisure, if there is such a thing as leisure.

  8. Dennis Baron said,

    May 7, 2018 @ 11:57 pm

    Sorry, J. Scalia, you got it wrong in Heller. I just ran "bear arms" through BYU's EMne [Early Modern English (ng)] and Founding Era American English corpora, and of about 1500 matches (not counting the duplicates), all but a handful are clearly military.

RSS feed for comments on this post