Embedding depth

« previous post | next post »

In "Trends" (3/27/2022) I compared the distributions of sentence lengths in Ernest Hemingway's A Moveable Feast and Ursula K. Le Guin's The Wave in the Mind. The background, and some of the conclusions, can be found in the slides for my SHEL12 presentation. Hemingway is known for his short and simple sentences — see e.g. "Homo Hemingwayensis", 1/9/2005, for some discussion — but as I showed, his average sentence length is actually a bit on the long side for his time. And his overall distribution of sentence lengths is essentially identical that found in (later) work by Ursula K. Le Guin, despite her hilarious discussion of an alleged difference in her 1992 essay "Introducing Myself":

But in that same presentation, I discussed the traditional distinction between hypotaxis (= syntactic subordination) and parataxis (= stringing things together) — which raises the possibility that Hemingway's notorious "short sentences" are really not so much short as paratactic.

So I installed the Berkeley Neural Parser, and used it to parse all of A Moveable Feast and The Wave in the Mind. I then mapped the parser output to xml, used the lxml library to extract the paths from the sentence root to each leaf token, and counted each token's embedding depth.

To illustrate the process, here are the results for a short sentence from A Moveable Feast (with the xml paths translated into /-separated strings for easier reading):

The women drunkards were called poivrottes which meant female rummies.

(S (NP (DT The) (NNS women) (NNS drunkards)) 
(VP (VBD were) (VP (VBN called) (S (NP (NNS poivrottes))) 
(SBAR (WHNP (WDT which)) (S (VP (VBD meant) 
(NP (JJ female) (NNS rummies))))))) (. .))

/S/NP/DT The
/S/NP/NNS women
/S/NP/NNS drunkards
/S/VP/VBD were
/S/VP/VP/VBN called
/S/VP/VP/S/NP/NNS poivrottes

3 The
3 women
3 drunkards
3 were
4 called
6 poivrottes
6 which
7 meant
8 female
8 rummies
2 .

(Of course, the parser is not always right, and a different syntactic theory would also yield different numbers, but this should do to go on with…)

As it turns out, the overall distributions of embedding depths between those works are not strikingly different:

Hemingway has more tokens at depth=2, and also more at depths from 16 to 29 — which Le Guin makes up for at depths 4 to 7. (More later on where these small differences come from…)

But to see what a more genuinely hypotactic style would look like, let's add Charles Dickens' American Notes, which has fewer tokens at depths 2-9, and strikingly more at depths 10-30:

If we look only the depth of clausal embedding (i.e. how many S-nodes separate each token from the root, ignoring all other non-terminals), we see a similar result — Le Guin and Hemingway are nearly identical, while Dickens is significantly more hypotactic:

And of course, as discussed in the earlier post, Dickens has a similarly large difference in distribution of sentence lengths:

That's all I have time for this morning — more later on some broader historical data.


  1. hatsu! said,

    November 28, 2022 @ 9:13 am

    great job! I love how you explained this topic with graphs

  2. Jerry Packard said,

    November 28, 2022 @ 5:55 pm

    Fascinating. I would imagine that if we could do these measurements on spoken language over time, we might find less embedding now, say, than 50 years ago. My intuition is that humans are moving in the direction of more parataxis and less hypotaxis in their speech production.

  3. Bill Benzon said,

    November 28, 2022 @ 9:12 pm

    FWIW, I've had my doubts about the standard line on Hemingway's style, though not much in the way of evidence to back up those doubts.

    However, some years ago I undertook to make some informal comments about two rather long sentences (131 and 170 words) in his book of essays about bullfighting, Death in the Afternoon. I argued that, in these two sentences, each describing a bullfighter's motions, the syntax was such as to imitate – loosely speaking – those motions. You can find the analysis in this blog post, where I quote George Lakoff as asserting that, in some investigation (but not involving Hemingway's prose) Teenie Matlock "demonstrated that subjects actually traced metaphorical fictive motion sentences (as in 'The road runs along the cliffs above the ocean') in real time via mental simulation."

  4. Bloix said,

    November 29, 2022 @ 10:16 pm

    Here are few sentences from the opening to A Moveable Feast –
    the same paragraph that contains the sentence you quoted
    ("The women drunkards were called poivrottes which meant female rummies.")

    –Then there was the bad weather. It would come in one day when the fall
    was over. We would have to shut the windows in the night against the
    rain and the cold wind would strip the leaves from the trees in the
    Place Contrescarpe. The leaves lay sodden in the rain and the wind
    drove the rain against the big green autobus at the terminal
    and the Café des Amateurs was crowded and the windows misted
    over from the heat and the smoke inside.

    And this is the opening to A Farewell to Arms:

    — In the late summer of that year we lived in a house in a
    village that looked across the river and the plain to the
    mountains. In the bed of the river there were pebbles and
    boulders, dry and white in the sun, and the water was
    clear and swiftly moving and blue in the channels. Troops
    went by the house and down the road and the dust they
    raised powdered the leaves of the trees. The trunks of the
    trees too were dusty and the leaves fell early that year and
    we saw the troops marching along the road and the dust
    rising and leaves, stirred by the breeze, falling and the
    soldiers marching and afterwards the road bare and white
    except for the leaves.

    Any reasonably sophisticated reader who somehow had never read Hemingway would know that they were from the hand of the same author. If the characteristics of these sentences that unites them and separates them from Le Guin and Dickens isn't revealed by your graphs then you need some more graphs.

    PS- this is Cormac McCarthy, from All the Pretty Horses:

    –In his sleep he could hear the horses stepping among the rocks and he could hear them drink from the shallow pools in the dark where the rocks lay smooth and rectilinear as the stones of ancient ruins and the water from their muzzles dripped and rang like water dripping in a well and in his sleep he dreamt of horses and the horses in his dream moved gravely among the tilted stones like horses come upon an antique site where some ordering of the world had failed and if anything had been written on the stones the weathers had taken it away again and the horses were wary and moved with great circumspection carrying in their blood as they did the recollection of this and other places where horses once had been and would be again.

    Our sophisticated reader would recognize this as perhaps by the same author of the first two examples on a bad day, or perhaps by someone imitating that author. It's a very long sentence but it's nothing like Le Guin or Dickens, is it?

  5. Mark Liberman said,

    November 30, 2022 @ 6:46 am

    @Bloix: Style is multi-dimensional, and the subjective impressions of "sophisticated readers" are often wrong, at least in the factors that they identify as responsible for those impressions.

    My first observation was that one such widespread attribution is wrong, namely that Hemingway's style is characterized by short sentences. That left open the possibility that his style is characterized by more paratactic sentence structure — but the current post undermines this variant hypothesis.

    So your comment misses the point, which was not to understand Hemingway, but rather to understand the observations of many "sophisticated readers" (including James Thurber as well as Ursula K. Le Guin) who have attributed their perceptions of his style to his "short sentences".

    In the cited SHEL12 presentation, I offered one dimension that does seem to characterize Hemingway's style in contrast to that of other 20th-century authors, namely lexical diversity:

    There are many other ways to explore writing style quantitatively — do you have some to suggest, which might give us more insight into the stylistic differences among writers, Hemingway included? In particular, it's still somewhat mysterious why people think his sentences are short. Could perception of sentence length be influenced by lexical diversity? Maybe, but I suspect more is going on than that.

  6. Philip Anderson said,

    November 30, 2022 @ 8:26 am

    I’m not that sophisticated a reader, but to me your final example feels very different from Hemingway: it’s sprinkled with similes and interpretation and long Latinate words.

  7. Mark Liberman said,

    November 30, 2022 @ 9:04 am

    @Philip Anderson: Standard quantitative textual analysis includes features like word frequency, word length, age of acquisition, concreteness, and so on. Etymological source is not as commonly used, but probably should be. Quantification of metaphor, metonymy, simile, allusion, etc. is harder but is relevant where possible.

    But what "sophisticated readers" say about Hemingway is not that he avoids similes and Latinate vocabulary. And he doesn't — at least he writes things like "a face fresh as a newly minted coin", "hair black as a crow's wing", "succulent texture", etc.

    The relative frequencies of such things might be different in Hemingway compared to other writers, though I don't know where anyone has investigated that. Whatever the outcome, it doesn't seem to help us understand the issue of sentence-length perception.

  8. Bloix said,

    November 30, 2022 @ 10:33 am

    "misses the point." Well, we both have points. Your point is that the usual explanations for Hemingway's style don't hold up to statistical analysis and my point is that nonetheless he had a unique and identifiable style. So what are the characteristics of that style? Surely there's a quantifiable answer. You've asked what characteristics I might suggest. One, as many have noted – is the use of "and" to link what appear to be unrelated sentences, in a way that grade school teachers have been marking as wrong for generations, but which in Hemingway carry unstated meaning. Another is the sparing use of commas. A more difficult characteristic to quantify is, it seems to me, his abjuration of the free indirect style, which since Austen had been the hallmark of literary fiction but whose absence in Hemingway's hands seems paradoxically to give the reader a more direct – less authorially mediated – connection to the characters.

    I do think that Hemingway's style changed over time, and I suspect that his novels and other book-length writing differs somewhat from the early stories. It's likely that experts have written about this. In my wholly inexpert opinion, the collections In Our Time (1925) and Men Without Women (1927) seem to be full of very short sentences, many of them in dialogue, in which what is most important is what is not said. In his long-form work, I believe, his style became less laconic. We tend to privilege the novels, perhaps because nowadays short stories are out of fashion, but the stories established the public opinion of his style.

    A Moveable Feast was written in the 1950s but left incomplete, and published only after his death. I tend to doubt that it's a reliable example of the style that made him famous in the 1920s and 30s.

  9. Jerry Packard said,

    November 30, 2022 @ 7:19 pm

    Thank you for the Cormac McCarthy post. A quick read reveals mostly parataxis and virtually no hypotaxis.

RSS feed for comments on this post