Spontaneous SCOTUS

« previous post | next post »

Years ago, Jerry Goldman (then at Northwestern) created the oyez.org website as

 a multimedia archive devoted to making the Supreme Court of the United States accessible to everyone. It is the most complete and authoritative source for all of the Court’s audio since the installation of a recording system in October 1955. Oyez offers transcript-synchronized and searchable audio, plain-English case summaries, illustrated decision information, and full-text Supreme Court opinions

He rescued decades of tapes and transcripts from the National Archives, digitized and improved them, and arranged the website's interactive presentations of the available recordings. Jiahong Yuan and I played a role, by devising and validating a program to identify which justice was speaking when (See "Speaker Identification on the Scotus Corpus", 2008).

More recently, Jerry has inspired an effort to recreate oral arguments from famous cases that took place before the recording system was installed, starting with Brown v. Board of Education. Rejecting the idea of producing "deep fakes" using the existing transcripts and extant recordings of the justices involved, he and his colleagues decided to create what we might call "shallow fakes", where actors will perform (selections from) the transcripts, and a voice morphing system will then be used to make their recordings sound like the target speakers. The recreated clips will be embedded in explanatory material.

All the scripts have been written, and in a few months, you'll be able to hear the results — which I expect will be terrific.

But Jerry has pointed out to me that there's another issue that's harder to fix. There are many things besides individual vocal identity that transcripts leave out. And even a skilled actor reading a standard transcript will not put most of these things back in, neither in a generic way nor in a way that's faithful to the original.

Here's a random recent example, from the 10/11/2023 oral argument in Alexander v. South Carolina State Conference of the NAACP.

The first justice to speak was Clarence Thomas:

The official transcript:

Mr. Gore, we review this for clear error. And the district court credited the plaintiffs' expert and found your experts non-credible. So how does that meet the clear error standard?

What he actually said, with filled pauses added, and newlines at clear silent pauses

uh mister Gore we uh review this for uh clear error
and uh
the uh district court
credited uh the plaintiff's
expert and found your experts

non-credible
so
uh how does that meet the uh clear error standard?

In a fluent reading of the cleaned-up transcript, the distribution of silent pauses would certainly be different. And there are 8 filled pauses added to the 30  words  of the cleaned-up transcript. So if we consider uh to be a word, then the filled-pause proportion would be 8/38 = 21% filled pauses; if not, then 8/30 = 27% filled pauses.

That kind of transcript cleaning is absolutely standard, plausibly necessary for ease of reading, and in fact rather similar to what our perceptual system does in understanding the speech.

And Clarence Thomas is not the only member of the court to exhibit similar features of spontaneous speech. Here's the next contribution, from John G. Roberts Jr.:

The official transcript does include a couple of the false starts:

Well, I thought — I thought he said that as far as geographic contiguity, that the — the size of the different districts was a adequate proxy for that.

But as expected, it leaves out the filled pauses, and doesn't indicate where the silent pauses happen:

well I thought- I thought he said that
((th-)) as far as geographic uh
contiguity
um
uh that the- the size of the different districts was a
adequate proxy for that

A bit later, we get this from Elena Kagan:

Again, the official transcript include the four repetitions of "by", but not the filled pause, the self-corrected first syllable of "geography", or the silent pause locations:

Because that would have been the easiest way to undermine the theory. I mean, as I understand it, this was hardly touched upon by — by — by — by the state below. And, certainly, the state did not do what would seem to be the — the normal thing if you were really concerned about this, which is to say: Look at our study. We controlled for geography. The results are entirely different.

Because that would have been the easiest way to undermine
the theory. I mean as I understand it
uh this was hardly touched upon
by- uh by- by- by the state below
and certainly
the state did not
do what would
seem to be the-
the normal thing if you were really concerned about this, which is to say
look at our study, we controlled for /dʒɑɪ/- geography
the results are entirely different.

And of course, individual justices have different characteristic styles of spontaneity. For example, Anthony Kennedy, who served on the Supreme Court from 1988 to 2018, often displayed an unusually large number of syllable or onset repetitions. Here's his first contribution to the 2/19/2002 oral argument in Alabama v. Shelton:

The official SCOTUS transcript leaves out essentially all of the disfluencies, as they used to do (and misspells "ruse"):

it seems to me that if — if you say that the sentence cannot be reimposed, you're saying that the State courts are in the position of imposing a sentence that is something of a rouse. Why should you put your own courts in this position?

The oyez.org transcript puts (most?) of the repetitions back in (but leaves out the last sentence, for some reason):

It It It seems to me that if i- if you say that the s- sentence cannot be reim- im- imposed, you're saying that the State courts are in the position of imposing a sentence th- th- that is s- s- s- s- s- s- s- s- something of a ruse.

A more accurate version, noting a couple of filled pauses, more repetitions, and some silent pauses:

uh i- i- it seems to me that
if

i- i- if
you say that the- the sentence cannot be reim- im- imposed
you're saying that the state courts
are in the position of imposing
a sentence
th- that is s- s-
uh s- s- s- s- something of a ruse
why should you put your own courts in this position?

Kennedy's contributions to oral arguments often feature similarly unusual numbers of repetitions — here's another example from later in the same recording:

The official SCOTUS transcript give us two of the repetitions, though none of the filled pauses:

Well, put in its raw form, if the judge said, now, if — if you agree not to have a counsel, I'll agree not to impose a jail sentence, that — that wouldn't be permitted.

More accurately (though less readably):

Well p- -put- put in its raw form, if the judge said now
if- if you agree not to have a counsel, I'll agree not to
uh impose a jail sentence uh
that- that wouldn't be permitted.

In contrast, Kennedy's public speaking seems to have been entirely fluent, as in this 2006 keynote address to the American Bar Association., perhaps because he's reading (or at least performing) prepared material. For a similar observation about George Carlin, see "Rhetoric as music", 9/17/2023.

As we've seen, standard transcripts typically omit all filled and silent pauses within individual speaker turns, as well as most false starts and onset or syllable repetitions.

But the actual pattern of inter-speaker interaction is also not accurately represented in a turn-by-turn transcript. Thus in Choctaw Nation v. Oklahoma, during the oral argument (part 1) on 10/22/1969, an early Q&A between Potter Stewart and Lon Kile is represented this way in the official SCOTUS transcript:

But the actual exchange, disfluencies aside, involves much more interactive overlap:

There's been a lot of work over the years on modeling the locations of filled and silent pauses, fluent repetitions, and inter-speaker "co-construction" of dialogue. So in principle it will be possible to supplement the voice morphing with a "spontaneity creation" algorithm, which would insert appropriate interpolations in appropriate places, recreate naturalistic overlaps, and so on. (Though there won't be time to do it for this first recreation — and we'd need a discussion about whether it's the right thing to do in any case…)

There are also some interesting questions about the production and perception of the features that distinguish spontaneous speech from reading or memorized material.

On the production side, in my experience, people (and even actors) are not very good at performing veridical transcripts in a natural-sounding way, probably because reading and remembering scripts inserts the instructions at the wrong level of the neurological system.

And in listening, all the spontaneous-speech interpolations (filled and silent pauses, onset and syllable repetitions, etc.) tend to evaporate quickly from our memory. A good way to feel that process in action is to try to create an accurate transcription of spontaneous speech. Try this with the samples presented earlier in this post, and I think you'll that it's very hard to get the location and number of the interpolations right, without using a DAW or similar program to isolate short segments of the recording for repeated listening.

Update — Jerry Goldman notes:

After listening to so many hours, I could identify a speaker within 1-2 seconds often by the disfluency style. I remember that Byron White would frequently clear his throat before he spoke. And you tagged Kennedy with his unique manner.

 



2 Comments

  1. D.O. said,

    March 2, 2024 @ 1:45 pm

    Did anyone do an experiment were an actor impersonated ("did") someone and than the frequency of all those filled pauses, disfluencies, breaks and the like were compared between a genuine article and a reproduction? Probably when done for comedic effect, some features of the target are exaggerated, but surely there is a lot of material from the neutral settings. In other words, how important are these conventionally edited out details for the listener?

  2. Seth said,

    March 2, 2024 @ 5:20 pm

    One "fake" type problem which strikes me is that the transcript gives no information on the tone of a statement. Emphasis, rising intonation indicating amount of uncertainty, thundering indignation – all of these are not usually recorded. Thus the recreation becomes more like a radio play, where the argument transcript is the script text. I wonder if there will be temptation to perform it like a radio play, with some dramatic oratorical intonation used, giving an impression of emotion which was not present at the real event. After all, in our present imagination, these cases are viewed as totemic events, in a fight of good against evil. And while that may be so, the reality of the situation can be pretty dry due the formality involved.

    This is especially the case if using actors, who are trained to view this sort of text as a dramatic performance, and not to speak as a non-actor would.

RSS feed for comments on this post