Language Log

Non-markovian yawp

September 18, 2011 @ 7:42 am · Filed by Mark Liberman under Computational linguistics

Now that I've got morning internet access again, and the semester is more or less underway, it's time for another Breakfast Experiment™.

In "Markov's Heart of Darkness" (7/18/2011) and "Finch linguistics" (7/13/2011) , we learned that Joseph Conrad's paragraphs are more markovian — at least in terms of their distribution of lengths — than zebra finch song bouts are. So I wondered about length distributions in some other sources — pause groups in conversational speech, and lines in Walt Whitman's poetry.

By "pause group" I mean simply the stretch of speech between silent pauses, as described in "The shape of a spoken phrase", 4/12/2006. As a source of data, I used the Mississippi State word alignments for Switchboard. (In particular, the version that I used is here, and the word counts for the 509,242 pause groups in the corpus, as I extracted them from the *word.text files, are here.)

The minimum length is one, and the maximum is 63:

The mode is 1 — as it would have to be for a two-state markov process to be responsible for the data — and the mean is 6.03. However, the empirical probability of continuing after N words is by no means constant:

(The histogram counts are here, and the calculated empirical probabilities of continuation are here.)

I conjecture that the special behavior of very short pause-groups reflects the fact that these conversational pause groups are a mixture of at least two quite different processes, one process generating quasi-independent contributions like

yeah i mean
for somebody who is
you know for most of their life has has
uh
not just merely had a farm but had ten children
had a farm
ran everything because her husband was away in the coal mines and
and you know facing that situation it it's quite a dilemma
i think

and the other generating short backchannel feedback like "right", "I know", "yeah really", "oh I see", …

The larger-scale process seems to gradually run out of steam, in the sense that beyond 3, the probability of continuing falls gradually and steadily up to and beyond 25 words. This decline is systematic and statistically significant, and it means that the length distribution of even the longer pause groups can't reflect a simple markov process (for the reasons explained here). However, the fall is quite gradual, and the distribution is quite a smooth one, so that (especially if we treat length one as special) it would be pretty well approximated by an exponential decay:

What about sequential effects? At least taking each conversational side by itself, there is a statistically significant but small (r = 0.1) positive correlation between the lengths of adjacent pause groups. Linear regression yields

G_n+1 = 5.4 + 0.1*G_n

That is, the length in words of pause group n+1 is predicted to be 5.4 plus one tenth of the length of pause group n. There's enough data that the slope of this relationship is clearly different from zero — p < 2*10^(-16), according to R — but only around 1% of the variance in length is being accounted for.

In order to look at line lengths in Whitman's poetry, I downloaded the 1881-1882 edition of Leaves of Grass from the Whitman Archive, and ran the .html file through a little script to eliminate (I hope) everything but poem text and titles, with the titles used to delimit poems but otherwise ignored. Run-on lines (marked with line-initial spaces in this text) were joined. The histogram of line lengths in the result is as follows:

Here we can see some evidence that a different process is involved in creating the counts for very short lines; and in this case, we can tell more exactly what it is. I didn't take the time to distinguish section numbers and internal section names from lines of poetry, and most of these — there are about 200 of them in the 10,202 lines — are of length 1. And in fact, most of the the "lines" of length 1 are such section numbers and internal section names.

The modal line length is 10 words — with or without omitting the lines of length 1 — and the mean is 11.1. Omitting "lines" of length 1, the mean is 11.3.

This seems to be a distinctly different pattern from Conrad's paragraphs and Switchboard pause groups. But a plot of the empirical probability of continuation has some qualitative similarities to the Switchboard plot:

In particular, there's an initial rise (here from 1 to 2) reflecting the fact that most very short "lines" are the output of a quite different process; and then a steady fall (here from 2 to 15) reflecting the same non-markovian "running out of steam" phenomenon that we saw in the Zebra Finch song bouts ("Finch linguistics", 7/13/2011) as well as in the Switchboard pause groups.

The patterns are quantitatively quite different — but perhaps a wider range of free-verse authors, and a wider range of speech styles, would yield more quantitative overlap. In particular, it seems possible that skilled extemporaneous narrative might have a modal pause-group length more like Whitman's modal line length.

What about sequential effects in the Whitman line-length data? There's also a positive correlation between the lengths of adjacent lines, and it's a bit larger than in Switchboard pause groups: r=0.3, 9% of variance accounted for. The coefficients of a linear model are

L_n+1 = 7.9 + 0.3*L_n

A two-dimensional histogram exhibits the relationship graphically:

The aspect of all of this that intrigues me the most is the prevalence of processes in which the probability of continuing decreases gradually as the generated string lengthens. We see the same thing in zebra finch song bouts, conversational pause groups, and lines of free verse. It's a simple idea, and easy to implement algorithmically, but I haven't seen a mathematical or neurological treatment (though this is at least as likely to reflect my ignorance as the state of the literature).

Anyhow, Whitman also saw a connection between his poetry and avian vocalizations:

The spotted hawk swoops by and accuses me, he complains of my gab and my loitering.

I too am not a bit tamed, I too am untranslatable,
I sound my barbaric yawp over the roofs of the world.

September 18, 2011 @ 7:42 am · Filed by Mark Liberman under Computational linguistics

Permalink

5 Comments

Jerry Friedman said,

September 18, 2011 @ 1:06 pm

In Rhyme's Reason, John Hollander distinguishes and imitates a number of typical kinds of free verse:

Translations of verse in the Bible

Broken up by syntactic units

Deliberately breaking syntactic units

Having stanzas that look like traditional stanzas, such as equal-line quatrains or Sapphics

Containing traditional meters disguised by the line breaks

"Pulsing and oceanic", like Whitman

Having short lines to slow reading down (or occasionally speed it up), like the Imagists

Scattered over the page, like Cummings or Ferlinghetti.

He then mentions rhyming with irregular meter, as in the Golden Trashery
Of Ogden Nashery.

Perhaps these styles can be distinguished by frequency plots of line length.

(Hollander doesn't mention any poets; the examples are all mine. The rhyme about Nash is due to Louis Untermeyer, I believe.)

What sort of process would not have a probability of continuing that decreases gradually as the generated string lengthens? Something that people tend to repeat, even perseverate at, so once you start you want to continue?

I can't help adding that the first seven paragraphs of MYL's post, comprising 266 words according to NeoOffice, contain no passive constructions. For comparison, I can't imagine anyone accidentally writing that much prose without the letter "a". I'm sure the distributions of letter occurrences, such as the number of letters between each occurrence of "a", are well-studied (grammotactics?). I'd guess that for the common letters, there'd be a steep cut-off, likely so steep that you can't even tell whether it's exponential. I'm not going to guess what the pattern would look like for the rare letters. Likewise, but no doubt with much more difficulty, one could plot the number of words or tensed clauses between passive constructions. I'm not going to guess what that would look like either.
Jerry Friedman said,

September 18, 2011 @ 1:17 pm

Sorry, I overlooked "as described in 'The shape of a spoken phrase'", which is a "bare passive", right.
Bill Benzon said,

September 18, 2011 @ 2:01 pm

Mark, in the mid-90s Wallace Chafe did some work based on a relatively small corpus of spoken language and came up with three kinds of speech units: substantive, regulatory, and fragmentary. The regulatory units are your back channel feedback. Here's what I said about Chafe's work in a longish paper on "Kubla Khan":

Nonetheless, the linguist Wallace Chafe has quite a bit to say about what he calls an intonation unit, and that seems germane to any consideration of the poetic line. In Discourse, Consciousness, and Time Chafe asserts that the intonation unit is “a unit of mental and linguistic processing” (Chafe 1994, pp. 55 ff. 290 ff.). He begins developing the notion by discussing breathing and speech (p. 57): “Anyone who listens objectively to speech will quickly notice that is not produced in a continuous, uninterrupted flow but in spurts. This quality of language is, among other things, a biological necessity.” He goes on to observe that “this physiological requirement operates in happy synchrony with some basic functional segmentations of discourse,” namely “that each intonation unit verbalizes the information active in the speaker’s mind at its onset” (p. 63). . . .

Chafe identifies three different kinds of intonation units. Substantive units tend to be roughly five words long on average and, as the term suggests, present the substance of one’s thought. Regulatory units are generally a word or so long (e.g. and then, maybe, mhm, oh, and so forth), and serve to regulate the flow of ideas, rather than to present their substance. Given these durations, a single line of poetry can readily encompass a substantive unit or both a substantive and a regulatory unit.

The third kind of unit, fragmentary, results when one of the other types is aborted in mid-execution. That is to say, one is always listening to one’s own speech and is never quite sure, at the outset of a phrase, whether or not one’s toss of the syntactic line will reel-in the right fish. If things do not go as intended, the phrase may be aborted. Fragments do not concern us, as we are dealing with a text that has been thought-out and, presumably, edited, rather than with free speech, which is what Chafe studied.

Chafe’s notion is consistent with an observation made initially by Ernst Pöppel. After reviewing studies by others and offering some of his own, Pöppel concluded that our awareness of the present extends roughly three to four seconds. That suggested that lines of poetry last no longer than that and that, where written lines appeared to take longer to read, they have a strong break in the middle. Working with a poet and critic, Frederick Turner, Pöppel found evidence for these notions in the poetry of several cultures, thus showing how versification technique deals with this constraint (cf. Turner and Pöppel 1983, Pöppel 1985, pp. 75-82).

I don't recall whether or not Whitman was included in the texts examined by Turner and Pöppel.
Peter said,

September 18, 2011 @ 9:21 pm

@Jerry Friedman: Hollander’s catalogue seems to be missing “Those which, from a distance, look like flies.”
Jerry Friedman said,

September 19, 2011 @ 12:18 am

@Peter: I think you mean swans.

(Dishonorable mention.)

Borges aside, maybe I should make it clear that the definitions are mine too. What Hollander did was define each one in a poem in the style he was defining (having done that earlier in the book with sonnets, villanelles, etc.)

RSS feed for comments on this post

Non-markovian yawp

5 Comments

Jerry Friedman said,

Jerry Friedman said,

Bill Benzon said,

Peter said,

Jerry Friedman said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta