Word frequencies in LOTR vs. Dickens
« previous post | next post »
Following up on "Meadow writing", I thought it might be interesting to look at LOTR-associated word frequencies, using the the "weighted log-odds-ratio, informative dirichlet prior" algorithm Monroe, Colaresi, and Quinn 2009, "Fightin' Words", as discussed in seven previous LLOG posts. In particular, I thought I'd compare The Fellowship of the Ring to 16 of Charles Dickens' works.
Given existing scripts, this was an easy half-hour Breakfast Experiment™.
And the results were mostly as expected. The Fellowship of the Ring end of the list is mostly populated with the names of LOTR proper names, like frodo, gandalf, bilbo, hobbits, pippin, etc. There are also a fair number of landscape-related words, as expected given that the plot involves a mostly-outdoor journey: mountains, trees, hills, path, forest, river, woods, etc. And the Dickens end of the list was also (mostly) not a surprise, at least in retrospect:
her 161 (862.734) 29807 (7404.64) 29968 (7114.8) -8.225
mr 158 (846.658) 28604 (7105.79) 28762 (6828.48) -8.031
she 158 (846.658) 19771 (4911.5) 19929 (4731.41) -6.253
my 487 (2609.64) 25091 (6233.1) 25578 (6072.56) -4.927
mrs 5 (26.793) 8128 (2019.15) 8133 (1930.88) -4.784
which 249 (1334.29) 16272 (4042.28) 16521 (3922.31) -4.571
sir 48 (257.213) 8392 (2084.74) 8440 (2003.77) -4.308
man 64 (342.95) 8640 (2146.35) 8704 (2066.45) -4.186
his 1569 (8407.64) 51118 (12698.7) 52687 (12508.6) -4.092
with 1115 (5974.84) 39135 (9721.9) 40250 (9555.89) -4.076
miss 12 (64.3032) 5914 (1469.15) 5926 (1406.91) -3.950
me 457 (2448.88) 20273 (5036.21) 20730 (4921.58) -3.903
's 678 (3633.13) 26201 (6508.84) 26879 (6381.43) -3.816
The differences in morpho-syntactic style might be interesting — which is 3 times more common in Dickens, and 's is almost twice as common — but Tolkien's lack of female pronouns (her is more than 8 times more common in Dickens, and she is almost six times more common) is an obvious consequence of the gender composition of the Fellowship.
As explained before, the lines in the output files have the fields
WORD XCount (XPerMillion) YCount (YPerMillion) ZCount (ZPerMillion) SCORE
…where in this case X=The Fellowship of the Ring, Y=16 Dickens books, and Z is the sum of X and Y.
Kenny Easwaran said,
April 11, 2026 @ 5:11 pm
Clicking through to the full list, I'm a bit surprised at how words like "trees" and "hills" and "they" and "ring" manage to appear so much higher in the list than words like "Isildur" and "Gondor" and "Galadriel". I guess this is some effect of the Dirichlet prior that I'm not entirely understanding, where the fact that "they" is about 1% of Tolkien's words and only 0.3% of Dickens's means that we have a lot of information to tell us that Tolkien is using that word more than Dickens, while the fact that "trees" is only 0.1% of Tolkien's and 0.008% of Dickens's gives us less information, despite the ratio being higher, and the fact that "Galadriel" is only 0.02% of Tolkien's words and 0 of Dickens's means that we're still closer to the prior and thus are not as confident that Tolkien really uses it more?
D.O. said,
April 12, 2026 @ 12:13 am
Kenny Easwaran, I don't think priors play much of a role here. The basic insight is that by the law of large numbers it is much harder to have large difference in relative frequencies just by chance if the frequencies themselves are large. The simplistic measure that I like most to compensate for that is (f1-f2)/sqrt(f1+f2). Your three examples give on my simplistic no priors scale 0.61, 0.28, and 0.14, respectively. Really, though (f1-f2)/sqrt(n1+n2) should work better if the sizes of the texts are somewhat different and informative priors should help if the sizes are very different.
I've done the calculation this simplistic way (the second one) from Prof. Liberman's data and first difference on the Dickens's side is switched 18th and 19th words and on Tolkien's side actually much earlier 7th and 8th (Aragorn and Shire), but they are very close anyway.
F said,
April 12, 2026 @ 8:53 am
You have to go a bit farther down the list to what I think of as hallmarks of Tolkien's somewhat elevated/archaizing stylization, stuff like "sprang" and "halted". (To compare, "jumped" is clearly still on the Tolkien end but much farther down, and "stopped" is in Dickens' half.) I suppose the high rank of "the" is related to this somehow.
Anyway, the nature descriptions were always what kept me coming back to LotR…