Language Log

Markov's Heart of Darkness

July 18, 2011 @ 3:34 pm · Filed by Mark Liberman under Computational linguistics

It seems that the length of Joseph Conrad's paragraphs — unlike the length of zebra finch song bouts — is well approximated by a two-state markov process.

This starts with a post by Bill Benzon over at the Valve: "HD7: Digital Humanities Sandbox Goes to the Congo", 7/18/2011. Bill looked at the distribution of paragraph lengths in Joseph Conrad's Heart of Darkness, and was surprised to find a lawful-looking exponential distribution.

Bill's post reminded me of my recent post about the non-markovian nature of zebra finch motif repetions ("Finch linguistics", 7/13/2011. So I downloaded Heart of Darkness from Project Gutenberg, calculated and plotted a histogram of paragraph lengths, and fit the following trivial model:

1. Reach into your bag of words and pick one. Write it down.
2. Flip an unfair coin with probability P of coming up heads.
3. If the coin comes up heads, insert a paragraph boundary
4. Repeat until you've written enough paragraphs.

For a given choice of P, this predicts a certain distribution of paragraph lengths.

In this case, "enough paragraphs" is 200 — the number I found in the Project Gutenberg version of HoD.

I compared the observed and expected counts of paragraphs of lengths 1-100, 101-200, … , 1501-1600, and chose P so as to minimize the sum of squared differences. To five digits, the best value for P is .00536:

According to the method of interocular trauma (what "strikes the eye"), this is a pretty good fit.

What does it mean? I don't know. Is this an instance of a well-known phenomenon? Don't know that either.

But if you have data, models, references, etc., Bill Benzon wants to hear from you.

Update — Cosma Shalizi writes:

Because the holding-time distribution is geometric in a discrete-time Markov chain, one can use the exact MLE here, which is 1/mean(paragraph length) = 0.005192217. I attach a histogram of the actual distribution, the fitted geometric (solid blue) and the best-fitting Gaussian (which as you say here would indicating some interesting temporal structure).

Update #2 — Nostromo has a similar distribution, though the longest of its 2184 paragraphs is only 859 words:

Update #3 — The distribution of paragraph lengths in Henry James' The Golden Bowl seems to be exponential out to about 200 words, where something else takes over:

And if you look at the distribution of paragraph lengths across the book, the >200-word paragraphs tend to cluster one after another in a few regions:

Someone who likes Henry James could probably tell us what's going on in those long-paragraph sections, for example around paragraph 600:

This does seem to be a real effect, and not a bug or artefact of Project Gutenberg's edition of this work, or of the cumulative ten minutes of processing and interpretation that I've devoted to its analysis. The start of the cluster of over-long paragraphs in the previous graph corresponds exactly to the start of Book Two, for example.

July 18, 2011 @ 3:34 pm · Filed by Mark Liberman under Computational linguistics

Permalink

46 Comments

Dan Davies Brackett said,

July 18, 2011 @ 3:45 pm

To what extent is this attributable to Conrad's publisher's editorial staff rather than Conrad? I don't know to what extent editors change the paragraphing of works, but it seems to me a necessary aspect of the job.
Ross Presser said,

July 18, 2011 @ 4:01 pm

Conrad wrote a paragraph 1500 words long? That seems … excessive.
JL said,

July 18, 2011 @ 4:20 pm

Dan Davies Brackett: that was my first question, too — especially since the book was serialized in a magazine before it was published. If you look at the original MS, here, http://beinecke.library.yale.edu/dl_crosscollex/SetsSearchExecXC.asp?srchtype=ITEM, you can see that only some off the paragraphing was Conrad's own: he tended to write in long, undivided sections, without breaking for, i.e., the start of a new quotation. So the result, interesting though it is, may say more about the conventions of magazines in 1899 than about Conrad's prose style.
Zythophile said,

July 18, 2011 @ 4:39 pm

A 1500-word paragraph? The horror! The horror!

(Well, someone was bound to say it eventually, so I thought it might as well be me …)
dl said,

July 18, 2011 @ 5:05 pm

It looks like a two exponential fit would be significantly better
Bill Benzon said,

July 18, 2011 @ 5:46 pm

Thanks Mark.

@Ross Presser & Zythophile: Not only that, but two 1100 word paragraphs.

@Dan Davis Brackett & Jl: That’s a complicated matter on which I’m not particularly knowledgeable. Conrad wrote long-hand, his wife produced a typescript, Conrad checked it, and that went to the editors. The story first appeared in Blackwood’s Magazine, but was reprinted later, with Conrad having a hand in the reprints. Mark and I are both working from the Gutenberg version, which contains no information that I could see about where it came from. I’ve not compared that text with the one published in the Norton Critical Edition, which also contains interesting commentary on the text and its variations. The commentary is mostly about punctuation and spelling and says nothing about paragraphing.

Such intuitions as I have on the matter tell me that, if there IS something lawful and systematic here, it involves more than just that distribution. It also involves: 1) where paragraphs of various lengths are placed in the text from beginning to end, and 2) what's in various paragraphs and their relationship to the whole story.

In a previous post on HoD I’d argued that the longest paragraph, the 1500 word one, is at the structural center of the text. That structural center does not straddle the midpoint as defined by word count, but is somewhat after that, and just before the final third, again by word count. One of the two 1100 paragraphs is before the structural center and the other is after it. Both are closer to the structural center than either is to its respective end.

Part 1 of 2.
Bill Benzon said,

July 18, 2011 @ 5:48 pm

(continued from previous post)

Why is that long paragraph the structural center? Well, if you really want to know, you should read my earlier post. I’m arguing that the structure of HoD is similar to the ring-forms that the late Mary Douglas explored in Thinking in Circles. It’s a point of, shall we say, semantic integration, pulling things together from the whole text. And that’s a mighty funny thing to have in this position in a long string of words, because, at the time you read it, you’ve only read the front end of the text, the back end is still ahead of you. What sense does it make to integrate material you've not read?

And, wouldn’t you know, there’s a section in this 1500-word paragraph that jumps the temporal gun, giving us important information about something that happens later than the point where we’re currently at in the story. That temporal displacement’s what brought this paragraph to my attention. It’s only after I’d noticed that that I noticed its extreme length.

This is the only such temporal displacement in the text.

And so forth and so on. I’d appreciate any leads anyone can give me.
A said,

July 18, 2011 @ 6:02 pm

"method of interocular trauma"
!
I've never heard this phrase before, but it's fantastic!
Brett said,

July 18, 2011 @ 6:14 pm

What I've always wanted to know about the paragraph breaks in Heart of Darkness is why they suddenly become much more "normal" in the last scene. In the meeting with Kurtz's fiancee, the paragraphs are brief, and there are breaks between speakers' turns. It's actually quite jarring after getting used to Conrad's usually much more dense prose.

The only theory I've heard of why is that it's supposed to suggest a very stilted conversation and so to enhance the contrast between the Kurtz's European betrothed and the African woman who mourns his departure from the the inner station.
Bill Benzon said,

July 18, 2011 @ 6:56 pm

Very good question, Brett. It IS jarring, and that's the longest run of short paragraphs in the text (as you can see in my third chart). Other than the fact that it is a conversation, the longest in the text, I'm not sure what's going on. That whole conversation contains fewer words (1257) than the single longest sentence.
Bill Benzon said,

July 18, 2011 @ 7:31 pm

Well, a thought about that last conversation. There are other conversations in the text, but they’re mostly with men. In fact, they may all be with men except for a remark be a secretary early in the story.

I’ve not reviewed all the conversations. But at least some of them are completely unmarked by paragraphing. The back and forth takes place entirely within a single paragraph.

So maybe we have the paragraphing at the end to separate male from female. Marlow made a BIG DEAL about that at the beginning of that longest paragraph.

I just took a quick run through the text. Perhaps that’s what’s going on. The male-male conversations seem to pretty much internal to paragraphs.

Curiouser and curiouser.
Q. Pheevr said,

July 18, 2011 @ 7:55 pm

Well, at the very least, I think this is strong evidence that Joseph Conrad was not a zebra finch.
Sybil said,

July 18, 2011 @ 10:40 pm

As soon as I read " Bill looked at the distribution of paragraph lengths in Joseph Conrad's Heart of Darkness" I thought "exponential distribution", but I can't come up with a coherent explanation for that thought.

I'm mostly curious now to know whether this is something peculiar to Conrad (or, rather, that varies among writers).

Very interesting.
Jerry Friedman said,

July 18, 2011 @ 11:19 pm

@Bill Benzon: On your blog you wrote, "For reasons having to do with a psychology we don’t understand, the nexus had to be toward the center of the book."

Has this sort of thing been studied psychologically? I can imagine having a large number of people who haven't read HoD reading either it or another version with apparently critical parts of the nexus changed. Would they rate its literary power differently? More practically, I can imagine having people compare the original version of a poem with a version that has been changed to lose something that critics have pointed to as important. Would people prefer the original version? Would it matter whether they saw the interpretation or analysis of that feature, and if so, would it matter whether they noticed it themselves, with or without a hint, or it was pointed out to them?

This last question is prompted by Nabokov's Lectures on Literature, in which he seems to assume that if readers notice some subtle repetition of a motif, they will enjoy a book more. I would have assumed that any effect that repetition has would be unconscious, only on readers who don't notice it. Can these questions be settled empirically?
JL said,

July 18, 2011 @ 11:45 pm

As one who writes novels for a living, allow me to chime in here. There are a few things I might be able to clarify for you — though in a way I'd prefer to think of it as muddying an analysis that's perhaps a little clearer than it should be. And please forgive me if I sound a bit prickly or pedantic.

First off, please bear in mind that any narrative is more or less organically made, and while it'll be possible to find all sorts of patterns in it, circles and spirals or what have you, that's going to be a critical superimposition upon what, for the author, is almost certainly an unconscious, or at least, less rule-bound process. Poems, to some degree, lend themselves to this sort of quantitative scrutiny. Novels don't — except bad ones. Indeed, your invocation of Douglas and Rings and all that, while it's clever, seems to obscure more than it reveals. It would be easier, it seems to me, to simply say that HoD is structured as a classic frame tale (a form which goes back to Plato, comes up through the 1001 and Nights – which is vastly more complicated — and, for example, the Decameron, and appears again, not just with Conrad but with The Turn of the Screw, which was published just the year before). In Conrad's case, he includes a sort of set piece within the inner story frame, which is not a further embedded frame, strictly speaking, but which is both climactic and temporally complicated. That's about all you need to say: adding Circles and other such Joseph Campbell-type garnishes doesn't really improve the taste.

And counting the words in sentences is only going to produce the sort of semi-scientific, accident isomorphism that has lead, for example, so many Straussians astray. Throw enough Bell curves and algorithms and what-not at a piece of writing, and a few of them are bound to stick, but I don't think they add much beyond what a more casual account provides. It's like thinking God must have been a mathematician because he made gravity obey the inverse square law. Jeremy Bernstein has a very good essay on the dangers of this sort of thing, the name of which I'm afraid escapes me now. My point is that these are not at all "lawful and systematic": art almost never is. They are merely intuitive, regardless of whatever formula you may have found to describe them.

Secondly, it's not at all surprising that a book should reach its greatest narrative complexity about two-thirds of the way through. That's a natural point, because it gives readers enough background to build upon, and a little time to decompress afterwards. By then, a reader should be in your hands; you work your magic, and then gently release them back into the real world.

Again, one very common way of signalling that point is by cutting down on your paragraph breaks. It has a hypnotic effect, and it also means that the reader has no natural place to stop and put your story down. It creates a swell, sometimes an overwhelmng one, like the fourth movement in Beethoven's Pastoral Symphony. Proust uses the same sort of effect as the (slightly off-)centerpiece of Sodom and Gomorrah, producing a single sentence several thousands words long, which is both enormously complicated and almost impossible to stop reading. It's not the only way to do things, but it's pretty natural. (For contrast, try a writer like Kleist, or Thomas Bernhard, who can modulate a reader's attention using few if any paragraph breaks at all.)

Moreover, it's conventional to start a new paragraph when a new speaker is speaking, but it's by no means obligatory. Many writers (myself included) will sometimes save such an interruption for voices which are distinctly different than the ones that preceded it, whether they be female, changing the subject, differently accented, or what have you. When you're writing a novel, you're very much concerned with manipulating the reader's attention: it's one of your post powerful tools. Many writers will break all sorts of conventions to do so, without resorting to, or even falling into, neat formulae, universal archetypes, structural analyses, and the like. We generally leave that sort of thing to Procrustean critics.
Bill Benzon said,

July 19, 2011 @ 6:00 am

@Jerry Friedman: A fascinating set of questions. Perhaps I'll say more later. For now, as far as I know, no one's tried such an experiment with this text. But I do believe such things have been done. You should take a look at David Miall, Literary Reading: Empirical and Theoretical Studies (Peter Lang, 2007). He occasionally blogs at OnFiction, which frequently reports on empirical work about literature.

@JL, with all due respect, I’m sure you’re aware that what you’ve said has been said many times and many ways (thanks Mel!) for years and years and years. And years. I’ve been fully aware of that stance for well over three decades and have long since taken it into account.

Mel Torme’s “Christmas Song” isn’t going to loose its magic just because someone’s running Nat King Cole’s version through some instrumentation and examining a sound spectrogram on the back end. Heart of Darkness won’t loose its power just because I’ve counted the number of words in each paragraph.

As far as I’m concerned your stance, time honored though it may be, is a sophisticated form of anti-intellectualism that’s as pernicious in this sphere of inquiry – culture, the arts, the human mind – as creationism and climate denialism are in their spheres. You might want to ask yourself whether readerly attention follows laws known intuitively by writers (who are, after all, readers themselves) or whether it's just coin flipping all the way down.
JL said,

July 19, 2011 @ 6:31 am

Well, OK, as long as we're being frank, I suppose I can be more terse: the question isn't whether you're causing HoD to "loose (sic) its magic". The question is whether you're adding anything at all to anyone's understanding of the book, or whether, instead, you're indulging in puffery and pseudoscience, bringing in spurious mathematical models and childish anthropological "theories" to gaudy up a banal and perfectly obvious account of a very well known phenomenon. (Look! I've discovered that the ratio of vowels to consonants in 'The Waste Land' is exactly equivalent to Planck's Constant! Which is used to explain black holes! See? It really is a depressing poem…)
William Allison said,

July 19, 2011 @ 6:59 am

As best I can make out, the zebra finch song does conform to a Markov process. The song of the Bengalese finch is more complex.
http://www.technologyreview.com/blog/arxiv/26025/

Perhaps Natural Language Processing is a good subject area in which to find more information germane to this discussion (see comments by "Late to the Party" in the blog indicated above).

[(myl) It's true that Bengalese finches sing more complex songs. However, with respect to the simple question of the distribution of motif-counts in zebra-finch song bouts, your best understanding in this case is mistaken, as explained in adequate detail here:

Interestingly, though the syntax of zebra-finch song bouts is extremely simple — to a first approximation, it's just a variable number of motif repetitions — it's trivial to prove that a finite-state (markovian) grammar is not adequate to describe it.

After producing a motif, our hypothetical markov process will be in a state where it must decide whether to stop or to produce another motif. By the markovian assumption ("history is bunk"), the probability of stopping will always be the same, regardless of how many motifs have been produced in the current bout.

Let P_stop denote this probability of stopping after a given motif. Then the the probability of producing a sequence of length 1 will be P_stop , the probability of producing a sequence of length 2 will be (1-P_stop)P_stop, and the probability of a sequence of length N will be (1-P_stop)^N-1 P_stop. This implies that the modal number of motif repetitions will always be 1, and that the relative frequency of higher numbers of repetitions will fall off exponentially, at a rate determined by P_stop.

But zebra finches don't sing that way.

See the cited post for the rest of the demonstration. The work cited in the Tech Review blog post is interesting and worthwhile, but irrelevant to this point.]
Bill Benzon said,

July 19, 2011 @ 7:38 am

It depends on what you mean by "understanding" of the book. Do I think anyone needs to know that sort of thing in order to enjoy or appreciate the book. No, not at all. That's NOT why I'm doing it. Is the phenomenon well-known and understood?

Well, we've known about salt long before chemical theory told us that it was sodium chloride. Does that make chemistry all puffery and pseudoscience?
ENKI-][ said,

July 19, 2011 @ 8:45 am

I have a sneaking suspicion that many works would have similar patterns in paragraph length. I've done analyses on real data (which is to say Gutenberg ebooks) of things like markov model probability shifts per word and found fairly stable patterns (in the case of markov model probability shifts per word, I had a stairstep pattern where the width and height of the stairs were themselves subject to a stairstep pattern, whose width and height appeared to be the result of an exponential by the ocular trauma method). I suspect that structured texts like novels have other elements that are equally as structured as the apparent elements (just as a markov chain can make a convincing Lovecraft paragraph, we can make convincing Lovecraft paragraph breaks and convincing Lovecraft word consistency change patterns).
JL said,

July 19, 2011 @ 9:03 am

Mihi a docto Doctore
Domandatur causam et rationem quare
Opium facit dormire:
À quoi respondeo,
Quia est in eo
Virtus dormitiva…
Lucy Kemnitzer said,

July 19, 2011 @ 10:35 am

I'm delighted to see investigations of the mechanical details of writing. Where a writer places paragraph breaks is an interesting topic, but this doesn't address that. I'm not sure what this investigation tells us about writing, at all (although the curve is beautiful). It seems to say only that shorter paragraphs are impressively more common than longer ones, even in Conrad's writing. But didn't we know that?

What I would find much more interesting is to look at the length of paragraphs by their position in the narrative, as some of the comments have done. It could be done crudely by how close the paragraphs are to the end, or more finely by identifying certain events in the story and counting how close the paragraphs are to the nearest one (and whether they are before or after). It could be done even more finely by relating the content of the paragraphs to the events preceding or coming after them, as well.

My guess is that, if this is undertaken with several works by several authors, strong patterns will emerge for some writers and not for others. And if it is undertaken with a very large number of works by a very large number of authors, a weak pattern will emerge overall.
Bill Benzon said,

July 19, 2011 @ 12:01 pm

@Lucy Kemnitzer: Yes, paragraph position is important, as is qualitative analysis of what’s going on in the paragraphs. The third chart in my original post shows the length of paragraphs as a function of their order in the text. You can see that the three longest paragraphs are toward the center, and you can see the string of shot conversational paragraphs at the end. I’ve also written a post in which I break that long structurally central paragraph into 18 chunks and comment on them individually, with a few comments here on there on how this paragraph is related to the rest of the text.

And, yes, this kind of work needs to be done with other texts and other authors. Without such work we don’t really know what we’re dealing with. In this case I’m looking at three phenomena in a single text: 1) distribution of paragraph lengths by size, 2) distribution of paragraph lengths by position in the text, 3) qualitative analysis the single longest paragraph which argues that it is structurally central. Are these phenomena independent of one another so that their co-presence in this text is just a fortuitous accident? Or is their co-occurrence a necessary consequence of the underlying causal mechanisms?

I don’t know.

[(myl) The arrangement of paragraph lengths in Nostromo looks pretty random — or do you think the long paragraphs roughly at the 1/5 and 4/5 points are important?]
Bill Benzon said,

July 19, 2011 @ 1:21 pm

Yeah, that looks pretty random to me. Can't say that I'm surprised or disappointed.
JL said,

July 19, 2011 @ 6:05 pm

Sorry, let me see if I have this straight.

1. You read Heart of Darkness for the first time about two weeks ago. (This is not clear to me, but it seems to be what you're saying on your website).

2. Your exposure to the secondary literature on the book consists of a Norton Critical Edition and a single monograph. Never mind the other thousands of papers and books written on Conrad.

3. You're uncertain who divided the book into paragraphs, Conrad, his wife, his magazine editor, or his book editor.

4. You ran a statistical analysis of paragraph length that showed that it matched a specific form of randomness.

5. You didn't notice, and were surprised to discover, that paragraph breaks seem to appear more frequently when the conversation includes a woman.

6. You didn't bother to run such an analysis on any other book or story by Conrad, any other 19th century work, any other serialized fiction, any other works written in any other language, or for that matter, by any other authors writing in English who were raised speaking another language. Etc. When someone else does run such an analysis — on one other Conrad work — which seems to contradict your theory, your response is that your not "surprised or disappointed".

6. You nevertheless intuit that there's something "systematic and lawful" going on here — indeed, that "readerly attention follows laws", laws which you yourself have discovered (or has the stuff on Nostromo convinced you to give up that position?)

I'm not going to comment on this. I'm just wondering if I have my facts right here.
Uneasy Rider « finding.my.name said,

July 19, 2011 @ 7:44 pm

[…] trauma" for "what strikes the eye"; I found this phrase while reading the linguistics blog, Language Log. On a related note, test your vocabulary! It's a mostly-scientific survey, and from what I can see, […]
Xmun said,

July 19, 2011 @ 9:33 pm

It seems to me there's been enough clavicular trauma (hammering of the keyboard) on this subject by now.
Jerry Friedman said,

July 19, 2011 @ 11:15 pm

@Bill Benzon: Thanks for the suggestion. That's the sort of thing I was looking for, though I didn't see anything about the kind of experiment I mentioned (maybe for good reason).
Bill Benzon said,

July 19, 2011 @ 11:40 pm

@JL

1. Yes.

2. Not quite. No monographs, but essays in both the Norton Critical and the Bedford/St. Martin's casebook. I'm also generally knowledgeable about academic literary criticism (I have a Ph. D. in English lit) and know generally what kinds of work is done. Looking at distributions of paragraph length isn't something literary scholars routinely do, if at all. One reason for contacting Language Log is that I figured that linguists might have done that sort of thing, especially corpus linguists.

3. Yes.

4. "specific form of randomness"?

5. Instead of "more frequently" it seems to be more like "almost exclusively".

6. I'm a long way from having anything remotely close to a "theory." The question is whether I have anything worth noticing. While I think that I do, prudence dictates that the question is still open. Maybe yes, maybe no. If yes: What is it? The reason for posting at three blogs, for putting notices on two listserves, and for enlisting Language Log, is to find out if anyone knows anything about paragraph distributions. So far the answer is: not much.

Mark Liberman has now run an analysis on The Golden Bowl and contacted me about it privately. It displays a different patten. No surprise there. What kinds of patterns do we generally see for distribution of paragraph lengths?

7. I'm assuming that readerly attention does follow laws. Why do I assume that? More or less as a consequence of a more general assumption that we live in a lawful universe. What are those laws? I don't know. Perhaps attenting to paragraph lengths will tell us something about them. Not everything. Just something.
Ken Brown said,

July 20, 2011 @ 5:59 am

pwned!
Ken Brown said,

July 20, 2011 @ 6:00 am

That should have been

Q: 1. You read Heart of Darkness for the first time about two weeks ago. (This is not clear to me, but it seems to be what you're saying on your website).
A: 1. Yes.

Q: 3. You're uncertain who divided the book into paragraphs, Conrad, his wife, his magazine editor, or his book editor.
A: 3. Yes.

= "pwned"!
JL said,

July 20, 2011 @ 8:42 am

Ah, Bill, good: I think we're getting at the source of our disagreement here, which I think I can sum up with one, possibly overstated (though I don't think so) point:

There are no psychological laws.

There are no cognitive laws.

There are no behavioural laws.

Not one, so far as I know; not even close; not in the sense that most scientifically-minded people use 'law' — that is, very roughly, as a statement which is considered to be flawlessly predictive in the appropriate context. Human cognition is simply not, as they say, nomological — possibly as a matter of principle, possibly as a matter of our ignorance. And if it's the latter, I doubt (though 'doubt' is really too weak a word) that a phenomenon as complicated as reading is going to yield laws, or anything like them, by way of a statistical analysis of paragraph length in a single Conrad novel. Nor by way of a thousand such analyses of a thousand texts. So either you're using 'law' in a very non-standard way, or you're being extremely hyperbolic, or (and this seems to be most likely) you're simply barking up the wrong tree.

[(myl) Simply as a matter of common usage, the history of psychology is full of things called "laws": the Weber-Fechner Law, Stevens' Power Law, Hebb's Law, Fitts' Law, Hick's Law, and so on.

There are also lots of behavioral or cognitive "laws" in historical linguistics — Grimm's Law, Verner's Law, Grassmann's Law, etc. (Hermann Grassmann also has a psychophysical law named after him…)

And of course there's Zipf's Law, and the range of similar observations (about cities, companies, etc., rather than words) that are sometimes referred to as instances of Pareto's Law.

This usage corresponds pretty well, in my opinion, to "the sense [in which] most scientifically-minded people use '[scientific] law'" — a regular and reliable relationship among phenomena, whether derived from observation or deduced from theory or both.

There are thousands of other well-documented law-like relationships among psychological, cognitive, or behavioral phenomena, which happen not to be called "laws" — I think simply because the fashion for such nomenclature faded at some point in the 20th century — for example Saul Sternberg's classic short-term memory results.

Are you ignorant of this history, or are you just trolling?]
JL said,

July 20, 2011 @ 8:54 am

(Actually, that was three points, I guess.)

And, yes, "specific form of randomness" — that's exactly what I meant. Stochastic processes come in various shapes and colors. You didn't think that every random pattern was random in the same way, did you? — This getting worser and worser…
Bill Benzon said,

July 20, 2011 @ 9:20 am

@JL: "This getting worser and worser…"

Sure is.
Bill Benzon said,

July 20, 2011 @ 10:11 am

Up above JL began:

"First off, please bear in mind that any narrative is more or less organically made, and while it'll be possible to find all sorts of patterns in it, circles and spirals or what have you, that's going to be a critical superimposition upon what, for the author, is almost certainly an unconscious, or at least, less rule-bound process. Poems, to some degree, lend themselves to this sort of quantitative scrutiny. Novels don't — except bad ones."

That comparison with poetry got me thinking.

Poetry is very much about manipulating the physical substance of language, rhyme and meter, and scads of other sound patterns, many of which have Greek names, etc. And we’ve got scads of verse forms, which are listed in handbooks, etc. What’s the parallel phenomenon for prose fiction? Where are the lists of ways and forms of language manipulation in prose fiction?

They don’t exist. We distinguish between novels, novellas and short stories. And we talk about style, and analyze it in various ways, including statistics – statistical stylistics is a fairly well-developed discipline. But we don’t have lists of devices and forms. Maybe, as JL pretty much said, they don’t exist.

And maybe we just haven’t known how to look for them.

What I’m thinking is that those patterns of paragraph-length distribution are to prose fiction what patterns of, say, line length and rhyme are to poetry. It’s the basic physical stuff the writer is manipulating in the course of creating patterns of verbal meaning.

So, in Heart of Darkness we have one pattern, by which I mean BOTH the distribution by size and the distribution by temporal order. Nostromo exhibits a different pattern from HoD; it has a similar size distribution but a different time distribution. The Golden Bowl has still another pattern. How many such patterns are there? What are they like?

Further, it’s clear to me that each chapter needs to be examined individually. The first chapter is the only one that starts from nothing; the last chapter is the only that ends in nothing. The inner chapters all have to pick up a story in progress, move it forward, and then leave it unfinished. Does that yield different patterns of paragraph length? Don’t know. Have to check.

And so forth.
JL said,

July 20, 2011 @ 10:13 am

@myl, I'm neither ignorant of the history, nor trolling. If I'm not mistaken — and I may be — all of the psychological laws you referred to, with the possible exception of Hick's, are not so much 'psychological' in the sense of trafficking in content — in beliefs and desires — as they are neuro-physiological. And I can't tell if Hicks Law is well-confirmed. Moreover, my impression is that even economists, working in what may be the most confirmable of the social sciences, are generally wary of calling things laws, except in a semi-casual or metaphorical way, and if pressed, might admit that 'law' isn't really what they mean. I don't believe too many people consider "the law of supply and demand" to be on equal footing with the laws of motion. And that's with something as quantifiable as money. When the subject is aesthetic effect, I think we're getting awfully far from what anyone would consider law-like phenomena.

[(myl) No, the psychological laws (and the historical-linguistics laws) that I cited are claims about behavioral dispositions or behavioral facts, not about either neuro-physiology or beliefs and desires. And anyhow, we're not talking about beliefs and desires here, we're talking about paragraph lengths. Why should a "lawful" distribution of paragraph lengths be any more problematic than a lawful distribution of word frequencies, or city sizes, or whatever?

Furthermore, there's a rather old tradition to the effect that belief ought to follow laws of logic — and there's a tradition almost as old, that tries to characterize in lawful terms the deviations of human belief from logical norms.]

I mean, it's called 'Murphy's Law', too — and I suppose you could argue that that sort of usage is sufficient to redefine what we mean by 'law'. But when Benzon speaks of this sort of thing as being potentially "systematic and lawful", it certainly seems to me that he means something more rigorous by it.

[(myl) The rigor of Zipf's Law, for example, is quite comparable to the rigor of Kepler's Laws, or Boyle's Gas Law, or whatever. Murphy's Law is, of course, a joke that depends on pretending that something is more systematic and lawful than it really is. But Zipf's Law and Grimm's Law are not jokes, they're "laws" in the same traditional sense as the laws of Boyle and Kepler.]
JL said,

July 20, 2011 @ 10:30 am

Let me rephrase this more simply. The assertion that there are no laws in the social sciences is by no means unique to me. You, as a linguistic, may take the position that if it's called a law, it is a law, or at least a candidate for one, and that's a reasonable way to look at it, though not one I share. But Benzon seems to be looking for something stronger than a 'regular and reliable relationship among phenomena' (and roughly how regular and reliable, by the way?), and even if he's looking for something less strictly law-like, I doubt very much that he's going to find it, and I'm pretty confident that one pass on one text doesn't begin to provide enough data to even broach the possibility responsibly. The rest is hubris and hot-dogging.

[(myl) You seem to be shifting ground here: you've moved from claiming that there are no psychological, cognitive or behavioral laws to the claim that there are "no laws in the social sciences", which is quite a different assertion (though I think it's equally false). And it's not very convincing support for an apparently groundless position to note that "it's by no means unique" to you — this argues in favor of flat earth and hollow earth theories, the theory that everything is made up of four elements, etc.

As for "law" in the sense used to describe regularities observed in the course of rational investigation, how should we determine what the word means other than by examining how it's been used over the past couple of centuries? (As far as I know, the earliest use of "law" in this sense in the humanities was "Porson's Law", which first appeared in 1797.)

You may have in mind some different concept, which is genuinely lacking in all psychological, cognitive, behavioral, economic, and other human-related phenomena. But you can't simply grab an existing word and pretend that it means what you want it to mean — at least you have to say "when I write 'law', I don't mean it in the usual sense, I mean X". I doubt that you can do this in any coherent or interesting way that systematically differentiates human-related phenomena from everything else, but you've given me no opportunity to figure this out one way or the other. ]
Bill Benzon said,

July 20, 2011 @ 11:36 am

Yes, I am LOOKING for "something stronger than a 'regular and reliable relationship among phenomena'." What I've GOT right is some very interesting clues on where and how to look for a regular and reliable relationship that hasn't been investigated yet.

This is getting tiresome. You're making misleading and exaggerated claims about what I'm doing (and not just me at this point) and then taking potshots at those exaggerated claims. It's a troll's game.
JL said,

July 20, 2011 @ 1:16 pm

This is the last I'll comment on this. Yes, you can count words in paragraphs to your heart's content, and treating them as raw data you may well discover law-like patterns. But that strikes me as a disingenuous gambit: if it's true, it's trivial. It's certainly not going to yield anything remotely law-like about Conrad, as such, or about novels, or fiction as opposed to non-fiction, and so on. The very best you could hope to get is a little extra data that might help in identifying the author of, say, an anoymous sonnet, and even that would fall well sort of certainty. If that's all you're after, have at it. As soon as you try to apply that data to things like an author's intent or a reader's response, you're necessarily going to bring in beliefs and desires. At that point, the idea of law-like statements becomes, at the very least, considerably more problematic, if not absurd. I mean, come on; even taking 'laws' in its widest sense, we don't have useful laws for why people choose Coke over Pepsi (if we did, one of them would be out of business), let alone how a work of literature gets written or read (based on one flawed analysis of one book).

Mark, you may be talking about paragraph lengths; but he's talking about literature. (And, I might add, at this point you've already done twice as much data-crunching as he has. It's like watching Tom Sawyer convince his friends to paint the fence for him.) Benzon, you talk a lot about possibilities, clues, and so on. Tell you what: get some real data for me — say, 100,000 books? — and we'll talk. Maybe. Until then, it's like claiming you've found some interesting clues about true love by counting the buttons on your girlfriend's shirt.

As for my being a "troll": no, this is simply something I care about. We're having a conversation: we're disagreeing. And I'm not exaggerating your claims, I'm merely pointing out how flawed your own exaggerated claims are. You wrote: "I'm assuming that readerly attention does follow laws. Why do I assume that? More or less as a consequence of a more general assumption that we live in a lawful universe." Unless you think 'readerly attention' and the 'universe' are similar kinds of phenomena (and they may be, but we won't know that until neuroscience gets much, much, much better than it is now), your assumption is utterly ungrounded.
Bill Benzon said,

July 20, 2011 @ 1:32 pm

First, a further clarification. If we’re looking at anything at all, I don’t think it would be like Zipf’s Law, which characterizes a distribution you get for every text. I think we’d get a variety of patterns. After all, we’ve already done three texts and we have three patterns. But they’re not wildly different. They’re all recognizably – by the method of interocular trauma – in the same universe.

So, let’s say we were to do a pilot study of, say, 30 texts. How should we choose our texts?

The idea is to sample the space so we can figure out if we should make a serious commitment and do, say, 1000 or 10,000 texts. We’d also like to learn something that would be useful in doing that larger study.

We’ve already begun work on three texts, Heart of Darkness, Nostromo, and The Golden Bowl. We need 27 more. We could do a random draw from, say, Project Gutenberg’s list. But I think there’s a more useful way to “sample the space” of possible patterns. Here’s what I’d do:

1) Eight more Joseph Conrad texts.
2) Use The Golden Bowl and pick nine more texts from 1900 plus or minus 25 years. Let’s get a mix of American and British, male and female, high art and popular.
3) Five chosen because they’d be interesting and “different” from the above, say: The King James Bible, Tristram Shandy, Pride and Prejudice, Ulysses, Naked Lunch.
4) Five at random.

The first set will give us a feel for the range of patterns in a set of texts that are “constrained” to the capabilities of a single author. The second set gives us a selection of distinctly different authors that are, however, from the same time period as the first. The last two sets go wider still.
Bill Benzon said,

July 21, 2011 @ 9:37 am

@dl or anyone who knows: “two exponential fit”?

What I’m thinking is that there’s more than one process for generating a paragraph. So, process A tends to produce short paragraphs and process B tends to produce long paragraphs. The overall distribution of paragraph lengths, then, reflects the interaction of these two processes.

I suppose to make that work you’d need to specify the distributions produced by each process independently and then the relative contributions of the two processes. Which strikes me as being messy.
ENKI-][ said,

July 22, 2011 @ 11:14 am

So, I tried to replicate this on the project gutenberg etexts I had lying around (apologies for the G+ link, to whoever can't access it). Certainly not replicating the pattern in Heart of Darkness reliably, but interesting results nevertheless. The part of the filename between 'pg' and the first non-numeric character is the etext ID, which you can look up in Gutenberg's index.
ENKI-][ said,

July 22, 2011 @ 11:15 am

Whoops. The link got pulled out. Here it is: https://plus.google.com/_/notifications/ngemlink?&emid=CIDLrNbDkKoCFUqA7AodU9shMA&path=%2Fphotos%2F107123055092301586940%2Falbum%2F5631488078459628145%3Fgpinv%3DAGXbFGyWUZOe1xP_MxrIIR648E-XbOtJFT-vLiydx0SnNIej42i3_hS3Z6qPHoI87jj_eMVp4OuKpat2oz2LbnRDgtjV3ug00WHV7hVGXzIe51jF0hB_toM%26hl%3Den_US
Bill Benzon said,

July 22, 2011 @ 4:16 pm

Thanks for that!

Now we know that we have lots of different patterns.

james Cutting is a psychologist at Cornell who's interested in film. He's analyzed shot length in film in fairly sophisticated way. I've got some PDFs I can send to anyone who's interested or you can go to his site and start downloading.
Bill Benzon said,

July 26, 2011 @ 4:21 am

There's more, though of a different kind:

http://www.thevalve.org/go/valve/article/conrads_special_k_periodicity_in_heart_of_darkness/

This time I simply plotted the appearance of the word "Kurtz" against 'time' (that is, against position in the text) and found out that it appears in 'bursts" that are periodic. "Kurtz" is, of course, the name of the central character. Moreover, Kurtz and Marlow (the main narrator) are the only characters known by name. All others are designated by their role or function, e.g. the manager, the Intended.
JL said,

September 2, 2011 @ 12:38 pm

Oh, for Christ's sake, Mark — you absolutely don't know what you're talking about and you're making stuff up. Even Wikipedia has a clearer grasp of what a 'law' is than you do. http://en.wikipedia.org/wiki/Physical_law.

RSS feed for comments on this post

Markov's Heart of Darkness

46 Comments

Dan Davies Brackett said,

Ross Presser said,

JL said,

Zythophile said,

dl said,

Bill Benzon said,

Bill Benzon said,

A said,

Brett said,

Bill Benzon said,

Bill Benzon said,

Q. Pheevr said,

Sybil said,

Jerry Friedman said,

JL said,

Bill Benzon said,

JL said,

William Allison said,

Bill Benzon said,

ENKI-][ said,

JL said,

Lucy Kemnitzer said,

Bill Benzon said,

Bill Benzon said,

JL said,

Uneasy Rider « finding.my.name said,

Xmun said,

Jerry Friedman said,

Bill Benzon said,

Ken Brown said,

Ken Brown said,

JL said,

JL said,

Bill Benzon said,

Bill Benzon said,

JL said,

JL said,

Bill Benzon said,

JL said,

Bill Benzon said,

Bill Benzon said,

ENKI-][ said,

ENKI-][ said,

Bill Benzon said,

Bill Benzon said,

JL said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta