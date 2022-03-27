« previous post | next post »

About six weeks from now, I'm scheduled to give a (virtual) talk with the (provisional) title "Historical trends in English sentence length and syntactic complexity". The (provisional) abstract:

It's easy to perceive clear historical trends in the length of sentences and the depth of clausal embedding in published English text. And those perceptions can easily be verified quantitatively. Or can they? Perhaps the title should be "Historical trends in English punctuation practices", or "Historical trends in English conjunctions and discourse markers." The answer depends on several prior questions: What is a sentence? What is the boundary between syntactic structure and discourse structure? How is message structure encoded in speech (spontaneous or rehearsed) versus in text? This presentation will survey the issues, look at some data, and suggest some answers — or at least some fruitful directions for future work.

So I've started the "look at some data" part, so far mostly by extending some of the many relevant earlier LLOG Breakfast Experiment™ explorations, such as "Inaugural embedding", 9/9/2005, or "Real trends in word and sentence length", 10/31/2011, or "More Flesch-Kincaid grade-level nonsense", 10/23/2015.

In most cases, the extensions just provide more data to support the ideas in the earlier posts. But sometimes, further investigation turns up some twists.



For example, in "Death before syntax?", 10/20/2014, I quoted from Ursula K. Le Guin's essay "Introducing Myself", (as published in The Wave in the Mind, 2004). Here's the passage I quoted, and a bit more besides:

What it comes down to, I guess, is that I am just not manly. Like Ernest Hemingway was manly. The beard and the guns and the wives and the little short sentences. I do try. I have this sort of beardoid thing that keeps trying to grow, nine or ten hairs on my chin, sometimes even more; but what do I do with the hairs? I tweak them out. Would a man do that? Men don’t tweak. Men shave. Anyhow white men shave, being hairy, and I have even less choice about being white or not than I do about being a man or not. I am white whether I like being white or not. The doctors can do nothing for me. But I do my best not to be white, I guess, under the circumstances, since I don’t shave. I tweak. But it doesn’t mean anything because I don’t really have a real beard that amounts to anything. And I don’t have a gun and I don’t have even one wife and my sentences tend to go on and on and on, with all this syntax in them. Ernest Hemingway would have died rather than have syntax. Or semicolons. I use a whole lot of half-assed semicolons; there was one of them just now; that was a semicolon after “semicolons,” and another one after “now.”

And another thing. Ernest Hemingway would have died rather than get old. And he did. He shot himself. A short sentence. Anything rather than a long sentence, a life sentence. Death sentences are short and very, very manly. Life sentences aren’t. They go on and on, all full of syntax and qualifying clauses and confusing references and getting old. And that brings up the real proof of what a mess I have made of being a man: I am not even young. Just about the time they finally started inventing women, I started getting old. And I went right on doing it. Shamelessly. I have allowed myself to get old and haven’t done one single thing about it, with a gun or anything.

And in that post, I used some short samples from Le Guin and from Hemingway to suggest that their styles were less different than she suggests, at least in terms of superficial features like sentence length and semicolon usage. (Of course, her essay is not really about prose style, but never mind that for now…)

So this morning, I compared those superficial features on a larger sample — all of Le Guin's essay collection The Wave in the Mind, and all of Hemingway's 1964 memoir A Moveable Feast. At that scale, Le Guin is right about semicolon usage rates for her vs. Hemingway:

Source Semicolons Total Characters Semicolons per 100k Characters The Wave in the Mind 411 520,607 78.95 A Moveable Feast 58 319654 18.14

But it probably won't surprise you to learn that gender is not a good predictor of semicolon usage rate. There's a general historical tendency towards decreased rates, but if there's a correlation with gender, it's going to be fairly small and we'll need quite a bit of data to determine whether it even exists:

Source Date Semicolons Total Chars Semicolons/100k Chars Pamela 1740 4,676 1,126,913 414.94 Decline and Fall of the Roman Empire 1788 39907 12936452 308.48 Camilla 1796 5,786 1,975,887 292.83 Pride and Prejudice 1813 1538 680359 226.06 American Notes 1842 1464 579209 252.76 Little Men 1871 925 548,683 168.59 Middlemarch 1872 1,874 1,761,476 106.39 Tom Sawyer 1876 642 379,164 169.32 The River War 1899 629 738,261 85.20 The Wonderful Wizard of Oz 1900 194 202966 95.58 The Great Gatsby 1925 60 162,323 22.87 To the Lighthouse 1927 941 381,272 246.81 Murder Must Advertise 1933 241 639,852 37.66 Murder on the Orient Express 1934 20 338,879 5.90 V. 1963 761 1,028,507 73.99 A Moveable Feast 1964 58 319,654 18.14 Oryx and Crake 2003 362 597,829 60.55 The Wave in the Mind 2004 411 520,607 78.95

(Titles in red have female authors; those in blue have male authors. Editorial as well as authorial fashions have probably played a role in the history. And my versions of the texts have various sources and in some cases different formatting styles, but the counts and relative frequencies would not be changed much by regularization.)

There are hundreds more texts and numbers, which I'll spare you for now.

One relevant point, though, is that semicolons are not very "syntactic". Rather, they're usually rather paratactic. Semicolons can concatenate a sequence of sentences that might very well be separated by periods; they can mark the junctures of a series of conjoined phrases; and they can set off appositive phrases. Le Guin's example is the sentence-concatenating kind:

I use a whole lot of half-assed semicolons;

there was one of them just now;

that was a semicolon after “semicolons,” and another one after “now.”

Virginia Woolf's To the Lighthouse stands out in the table above as especially semicolonic for its date of publication — and the examples in that novel are in most cases also paratactic. Here are the first two semicolonized sentences:

Such were the extremes of emotion that Mr. Ramsay excited in his children's breasts by his mere presence; standing, as now, lean as a knife, narrow as the blade of one, grinning sarcastically, not only with the pleasure of disillusioning his son and casting ridicule upon his wife, who was ten thousand times better in every way than he was (James thought), but also with some secret conceit at his own accuracy of judgement.

He was incapable of untruth; never tampered with a fact; never altered a disagreeable word to suit the pleasure or convenience of any mortal being, least of all of his own children, who, sprung from his loins, should be aware from childhood that life is difficult; facts uncompromising; and the passage to that fabled land where our brightest hopes are extinguished, our frail barks founder in darkness (here Mr. Ramsay would straighten his back and narrow his little blue eyes upon the horizon), one that needs, above all, courage, truth, and the power to endure.

Here's an example from Letter I in Pamela:

God bless him! and pray with me, my dear father and mother, for a blessing upon him, for he has given mourning and a year's wages to all my lady's servants; and I having no wages as yet, my lady having said she should do for me as I deserved, ordered the housekeeper to give me mourning with the rest; and gave me with his own hand four golden guineas, and some silver, which were in my old lady's pocket when she died; and said, if I was a good girl, and faithful and diligent, he would be a friend to me, for his mother's sake.

That style certainly makes for long sentences, but the resulting depth of embedding is not generally very great. (Though the syntactic treatment of discourse-structure relations is a wild card here, as always — see e.g. "Parataxis in Pirahã", 5/19/2006.)

And for another (more offensively stereotyped) take on the gender associations of semicolons, there's Kurt Vonnegut in A Man Within a Country:

Here is a lesson in creative writing. First rule: Do not use semicolons. They are transvestite hermaphrodites representing absolutely nothing. All they do is show you've been to college.

How about Le Guin's joking references to sentence length? In The Wave in the Mind, the mean sentence length is 17.18 words, and the median is 13 words. In Hemingway's A Moveable Feast, the mean sentence length is 16.78 words, and the median is 12 words. A plot of sentence-length quantiles in the two books shows remarkable agreement throughout:

Just to show that things can be sometimes be different, let's add Charles Dickens' American Notes to the plot:

That's all I have time for today, but (some of) the loose ends will be followed later…

