Language Log

NPR: oyez.org finishes Supreme Court oral arguments project

April 25, 2013 @ 6:05 am · Filed by Mark Liberman under Computational linguistics

"Once Under Wraps, Supreme Court Audio Trove Now Online", NPR All Things Considered 4/24/2013:

The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn't available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn't duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.

But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.

Some of the funding for the digitization and transcription, and all of the funding for the technology used in text/audio alignment and speaker identification, came from an NSF grant "ITR-SCOTUS: A Resource for Collaborative Research in Speech Technology, Linguistics, Decision Processes and the Law", which was due to run from 2003-2007 but actually ended (after a few no-cost extensions) in 2010. The P.I.s were Jerry Goldman (then at Northwestern), Tim Johnson (University of Minnesota) and me.

The NPR story embeds an instance of oyez.org's very nice flash app for interacting with individual transcripts and recordings:

Documentation of the speech technology involved can be found in Jiahong Yuan and Mark Liberman, "Speaker Identification on the SCOTUS Corpus", Acoustics 2008. Jiahong did most of this work.

You may wonder why speaker identification was needed. Traditionally, transcripts of U.S. Supreme Court oral arguments did not identify individual justices by name, but instead referred to any one of them as "The Court".

In addition to developing the applied speech technology for text/audio alignment and speaker identification, Jiahong and I have used part of the SCOTUS corpus in some phonetic investigations, such as "Investigating /l/ variation in English through forced alignment” InterSpeech 2009. In that analysis, we used only the data from the 2001 term — which contained 21,706 tokens of /l/. The entire 60-year archive will soon be published in a form that will allow phoneticians as well as legal scholars to use it as a basis for research.

April 25, 2013 @ 6:05 am · Filed by Mark Liberman under Computational linguistics

Permalink

8 Comments

John Singler said,

April 25, 2013 @ 9:10 am

I became aware of Oyez.org just before the semester started (at a post-performance discussion of a theater piece by the NYC group Elevator Repair Service). As a result, Nate LaFave, Amanda Montell, Allison Shapp, and I have been conducting a lifespan study–strictly on the basis of Oyez.org recordings–of Ruth Bader Ginsburg's speech. The link between transcript and recording (especially, with the "clip & share" feature that enables one to specify which part of a recording to download) makes Oyez.org a valuable tool. Identifying the Justice is, as you indicate, a major help as well. Since we're considering Ginsburg–and she and Sandra Day O'Connor don't sound at all alike–identifying her in the earlier years of her time of the bench when Justices weren't distinguished has been possible; however, being able simply to search the transcript for "Ginsburg" is vastly faster.
Thanks, Mark and Jiahong. Quite apart from the public service, you've made a rich source of data readily accessible to linguists.
Jerry Goldman said,

April 25, 2013 @ 2:39 pm

Thanks, Mark and Jiahong, for your significant contributions. I'm still hoping that someone will experiment with speaker identification based solely on disfluencies, filled pauses, etc. that 'voicemark' or profile a particular individual. We sit on tenterhooks for our plan to create public-facing websites and apps for all appeals courts (state and federal).
Rubrick said,

April 25, 2013 @ 4:11 pm

My brain immediately jumped to this xkcd: http://xkcd.com/1199/, substituting Clarence Thomas for John Cage.
Bob said,

April 25, 2013 @ 7:15 pm

are there plans for making the data open/available for acquisition by the public?
Eorrfu said,

April 25, 2013 @ 11:25 pm

Is there a corpus of transcripts for the modern court. I would love to know the percentage of cases Breyer jas used the phrase "da da da". (He usually uses it in the sens of etc. after a long complicated setup for a hypothetical)
Lazar said,

April 26, 2013 @ 6:24 pm

Listening to the first 15 minutes or so of Loving, I did see a few transcription errors – at 05:39 it should be "the purity of" rather than "purely", both times; at 07:40 it should be "races kept" rather than "races as kept"; at 08:20 I'm pretty sure it should be "and some got 'em then"; at 14:39 it should be "I needn't read the whole quote" rather than "I didn't read the whole quote".
Sheldon Stolowich said,

April 27, 2013 @ 12:07 am

Speaking of transcription errors, may I add that at 01:52 it should be "Commonwealth" [sc. "of Virginia"] rather than "common law."
Now You Too Can Hear The Old Justices Argue | Pasco Phronesis said,

May 10, 2013 @ 10:30 pm

[…] has reported (H/T Language Log/Twitter) that the archive, which dates back to 1955, has been digitized and made available at the […]

RSS feed for comments on this post

NPR: oyez.org finishes Supreme Court oral arguments project

8 Comments

John Singler said,

Jerry Goldman said,

Rubrick said,

Bob said,

Eorrfu said,

Lazar said,

Sheldon Stolowich said,

Now You Too Can Hear The Old Justices Argue | Pasco Phronesis said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta