Lexical display rates in novels

In some on-going research on linguistic features relating to clinical diagnosis and tracking, we've been looking at "lexical diversity". It's easy to measure the rate of vocabulary display — you can just use a type-token graph, which shows the count of distinct words ("types") against the count of total words ("tokens"). It's less obvious how to turn such a curve into a single number that can be compared across sources — for a survey of some alternative measures, see e.g. Scott Jarvis,  "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 2002; and for the measure that we've settled on, see Michael Covington and Joe McFall, "Cutting the Gordian knot: The moving-average type–token ratio (MATTR)", Journal of quantitative linguistics 2010. More on that later.

For now, I want to make a point that depends only on type-token graphs. Over time, I've accumulated a small private digital corpus of more than 100 English-language fiction titles, from Tristram Shandy forward to 2019. It's clear that different authors have different characteristic rates of vocabulary display, and for today's post, I want to present the authors in my collection with the highest and lowest characteristic rates.

Thomas Pynchon registers the highest rates, and Ernest Hemingway the lowest:

The unfolding of the type-token graph depends on the author's style, but also on changes in topic introducing new words. So it's worth comparing the same plots after randomly permuting each book's words:

The type-token plots for the randomly-permuted texts are smoother, but otherwise similar. We can see a more articulated effect of differences due to changing style and topic by comparing Hemingway with the Bible:

Note: In analyzing the King James Bible, I removed the verse numbers, so that

01:001:001 In the beginning God created the heaven and the earth.

01:001:002 And the earth was without form, and void; and darkness was
upon the face of the deep. And the Spirit of God moved upon
the face of the waters.

01:001:003 And God said, Let there be light: and there was light.

01:001:004 And God saw the light, that it was good: and God divided the
light from the darkness.

01:001:005 And God called the light Day, and the darkness he called
Night. And the evening and the morning were the first day.

becomes

In the beginning God created the heaven and the earth.

And the earth was without form, and void; and darkness was
upon the face of the deep. And the Spirit of God moved upon
the face of the waters.

And God said, Let there be light: and there was light.

And God saw the light, that it was good: and God divided the
light from the darkness.

And God called the light Day, and the darkness he called
Night. And the evening and the morning were the first day.

Some relevant previous posts:

"Word counts", 11/28/2006
"Britain's scientists risk becoming hypocritical laughing-stocks, research suggests", 12/16/2006
"Only 20 words for a third of what they say: A replication", 12/16/2006
"Cultural specificity and universal values", 12/22/2006
"Vicky Pollard's Revenge", 1/2/2007
"Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008
"Betting on the poor boy: Whorf strikes back", 4/5/2009
"Nick Clegg and the Word Gap", 10/16/2010
"Lexical bling: Vocabulary display and social status", 11/20/2014
"Political vocabulary display", 9/10/2015
"Vocabulary display in the CNN debate", 9/18/2015
"Lexical limits?", 12/5/2015
"Why estimating vocabulary size by counting words is (nearly) impossible", 12/8/2015
"R2D2", 3/27/2016
"More political text analytics", 4/15/2016

1. Tim Leonard said,

April 18, 2020 @ 12:57 pm

You say "It's less obvious how to turn such a curve into a single number". How about using Zipf’s law? Sort the words by frequency, and use the slope of the line plotting log(rank order) vs. log(frequency).

[(myl) See the discussion in Scott Jarvis's paper cited above, or the longer discussion in Harald Baayen's book, Word Frequency Distributions. The key problem for all measures is how to scale across text lengths, especially for short texts. Lexical frequency estimates are notoriously problematic, even where we're fitting an overall model, as discussed in this earlier post, which includes this plot:

And for short texts — e.g. lengths of 50 to 500 words — word-type frequency estimates are nearly meaningless as a prediction of the true distribution, and quite unstable as a way of quantifying lexical diversity. ]

2. ahkow said,

April 18, 2020 @ 1:13 pm

The comparison with the Bible is really interesting, since the Bible is a collection of works by different authors. I wonder what the type-token plots might look like at the next level down, such as Genesis vs. Psalms vs. Gospels. Setting aside complications due to transmission and translation, we should expect to see clear different type-token profiles.

3. Not a naive speaker said,

April 18, 2020 @ 1:22 pm

Dear Mark,

for better readability it might make sense to assign every curve a number and print this number next to the curve and the name in the legend. For me the colors in the legend are very similar.

thank you

[(myl) I tried that, but found it difficult to place the numbers in a way that didn't overlap and made it clear which number applied to which plot. The order of items in the legend is the top-to-bottom order of the plotted lines, which may help you.]

4. AntC said,

April 18, 2020 @ 7:03 pm

what @ahkow said.

How about contrasting the Books that have all those 'begat's with the more ethereal such as the Song of Songs?

Indeed is it representative to use the Bible at all? That's a translation (and the KJV is a pretty bad translation, in terms of retaining the language diversity in the original).

5. Rodger C said,

April 19, 2020 @ 12:07 pm

Genesis, I believe, has at least three major authors, or layers.