It seems that the length of Joseph Conrad's paragraphs — unlike the length of zebra finch song bouts — is well approximated by a two-state markov process.
This starts with a post by Bill Benzon over at the Valve: "HD7: Digital Humanities Sandbox Goes to the Congo", 7/18/2011. Bill looked at the distribution of paragraph lengths in Joseph Conrad's Heart of Darkness, and was surprised to find a lawful-looking exponential distribution.
Bill's post reminded me of my recent post about the non-markovian nature of zebra finch motif repetions ("Finch linguistics", 7/13/2011. So I downloaded Heart of Darkness from Project Gutenberg, calculated and plotted a histogram of paragraph lengths, and fit the following trivial model:
1. Reach into your bag of words and pick one. Write it down.
2. Flip an unfair coin with probability P of coming up heads.
3. If the coin comes up heads, insert a paragraph boundary
4. Repeat until you've written enough paragraphs.
For a given choice of P, this predicts a certain distribution of paragraph lengths.
In this case, "enough paragraphs" is 200 — the number I found in the Project Gutenberg version of HoD.
I compared the observed and expected counts of paragraphs of lengths 1-100, 101-200, … , 1501-1600, and chose P so as to minimize the sum of squared differences. To five digits, the best value for P is .00536:
According to the method of interocular trauma (what "strikes the eye"), this is a pretty good fit.
What does it mean? I don't know. Is this an instance of a well-known phenomenon? Don't know that either.
But if you have data, models, references, etc., Bill Benzon wants to hear from you.
Update — Cosma Shalizi writes:
Because the holding-time distribution is geometric in a discrete-time Markov chain, one can use the exact MLE here, which is 1/mean(paragraph length) = 0.005192217. I attach a histogram of the actual distribution, the fitted geometric (solid blue) and the best-fitting Gaussian (which as you say here would indicating some interesting temporal structure).
Update #2 — Nostromo has a similar distribution, though the longest of its 2184 paragraphs is only 859 words:
And if you look at the distribution of paragraph lengths across the book, the >200-word paragraphs tend to cluster one after another in a few regions:
Someone who likes Henry James could probably tell us what's going on in those long-paragraph sections, for example around paragraph 600:
This does seem to be a real effect, and not a bug or artefact of Project Gutenberg's edition of this work, or of the cumulative ten minutes of processing and interpretation that I've devoted to its analysis. The start of the cluster of over-long paragraphs in the previous graph corresponds exactly to the start of Book Two, for example.