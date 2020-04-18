« previous post | next post »

In some on-going research on linguistic features relating to clinical diagnosis and tracking, we've been looking at "lexical diversity". It's easy to measure the rate of vocabulary display — you can just use a type-token graph, which shows the count of distinct words ("types") against the count of total words ("tokens"). It's less obvious how to turn such a curve into a single number that can be compared across sources — for a survey of some alternative measures, see e.g. Scott Jarvis, "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 2002; and for the measure that we've settled on, see Michael Covington and Joe McFall, "Cutting the Gordian knot: The moving-average type–token ratio (MATTR)", Journal of quantitative linguistics 2010. More on that later.

For now, I want to make a point that depends only on type-token graphs. Over time, I've accumulated a small private digital corpus of more than 100 English-language fiction titles, from Tristram Shandy forward to 2019. It's clear that different authors have different characteristic rates of vocabulary display, and for today's post, I want to present the authors in my collection with the highest and lowest characteristic rates.

Thomas Pynchon registers the highest rates, and Ernest Hemingway the lowest:

The unfolding of the type-token graph depends on the author's style, but also on changes in topic introducing new words. So it's worth comparing the same plots after randomly permuting each book's words:

The type-token plots for the randomly-permuted texts are smoother, but otherwise similar. We can see a more articulated effect of differences due to changing style and topic by comparing Hemingway with the Bible:

Note: In analyzing the King James Bible, I removed the verse numbers, so that

01:001:001 In the beginning God created the heaven and the earth. 01:001:002 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. 01:001:003 And God said, Let there be light: and there was light. 01:001:004 And God saw the light, that it was good: and God divided the light from the darkness. 01:001:005 And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

becomes

In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. And God saw the light, that it was good: and God divided the light from the darkness. And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

