Zipf's demon
« previous post | next post »
George Kingsley Zipf is famous for his work on the power-law distribution of word frequencies, which has come to be known as Zipf's Law. And he's also known for the related "Law of Abbreviation", and the hypothesized balance between effort and efficacy.
In his 1945 paper "The repetition of words, time-perspective, and semantic balance", Zipf looks at a different distribution, which is much less famous:
In the present study we shall attempt to show in preliminary outline how the rate of repetition of words in the stream of speech may be useful not only in indicating what we shall presently define as "time-perspective" but also in elucidating what we shall presently refer to as "semantic balance" – two terms of potential significance in the understanding of personality variants.
"Personality variants?" Wait for it…
That paper's Figure 2, which presents its main empirical evidence about word-repetition intervals, gives us a clue about why the initial uptake for this idea was so slow:
Caption: The Number of Intervals of Like Sizes (in Terms of Pages) between the Repeitions of Words Occurring Five Times in James Joyce's Ulysses with Interval-Sizes Taking on Integral Values from 1 Through 50 Pages Inclusive.
Zipf could start from published word-count data in that case — M.L. Hanley's 1937 Word Index to James Joyce's Ulysses — but the analysis was still a labor-intensive addition to Hanley's labor-intensive foundation. Digital text and computer analysis make such analyses easy today, by comparison, though few have done it. More on that in a later post.
For now, I want to share with you a striking (or maybe weird) idea that occupies most of Zipf's 1945 paper, presenting a mathematical model for a demon ringing a set of bells.
Zipf introduces his bell-demon this way:
Let us take n bells that are equivalent in size and equally difficult to ring, and then let us attach them to a long straight board in such a manner that the bells are equally spaced along the board. At one end of the board we shall place a blackboard ruled with n-columns for the respective bells; and we shall also station a demon there to act as bell-ringer. The demon must ring one bell once each second of .time, and after he has finished ringing a bell once he must return to the blackboard to record that fact in the bell's column. Thus in order to ring one bell 10 times, or 10 bells once each, he will make 10 round trips down the board and back in the space of 10 seconds, and will have 10 marks therefor on the blackboard. (And we shall ask the demon to make his round trips over shortest distances).
This analogue is interesting for many reasons. First of all the demon's work, w, in terms of making a round trip to ring a given bell, will increase in direct proportion to the bell's distance, d, from the blackboard (or w = d). And since the distance of the respective bells increases integrally from the blackboard (i.e., ld, 2d, 3d, ….., nd), it follows that the bells are arranged in respect of the the demon's work, w, in getting to and from them according to the simple series, lw, 2w, 3w, ….. , nw . .
Now if we ask our demon to ring each bell with a frequency, f, that is inversely proportionate to the round-trip work involved, or in equation form, $w X f = C$, he will ring the closer (and easier) bells proportionately more often than the distant (and harder) bells. And since the ranked-frequency in decreasing order, r, with which each bell is rung will be equal to the bell's w above, we come upon the familiar equation:
(1) $r X f = C$
However if we now ask the demon to ring all bells according to Equation 1 but to stop after he has rung the nth and farthest bell once (n = C) and after he has rung all other bells their allotted times, then the n bells will have been rung approximately according to the equation
$$F \cdot Sn = \frac{F}{1} +\frac{F}{1} + \frac{F}{2}+ \frac{F}{3} + . . . . . +\frac{F}{n}$$
in which $F\cdot Sn$ represents the total of round trips made (as well as the total number of running seconds of time) and where $F$ represents the total number of times the nearest bell is rung, and where $\frac{F}{n} = 1$ (or, if you will, where $F = n$), with p omitted above because it equals 1.
This gives him his power law for the counts of individual bells, but so far, it puts no constraint on their inter-ringing intervals. As he observes:
Of course the above equation puts no restriction upon the order in which the demon rings the bells. Thus he may ring the nearest bell its allotted $F$ times before ringing the 2nd nearest bell its allotted $F / 2$ times, and so on progressively down the board until he has rung the nth and farthest bell a single time. In short he might always ring "the easiest remaining bell first," while postponing as long as possible the more distant and. hence more difficult bells. The chief drawback of ringing "the easiest first" is that the demon will be forced to run faster and faster, and therefore to work at an ever increasing rate, as he proceeds farther and farther down the board, if he is to complete each round-trip within the prescribed second. And in so doing he will be unevenly distributing his work over time with the risk of collapsing before he gets the nth bell rung.
So he adds a policy to optimize the demon's effort:
In order to correct this uneven distribution of work over time, we may ask the demon to distribute his work as evenly as possible over time while still ringing his bells according to Equation 3. Yet as soon as he does distribute his work evenly over time, he will automatically ring the bells in such a way that the sizes of the interval, $I_{f}$, between the respective repetitions of the bells will approximate the equation:
(4) $N^{p} \cdot I_{f} =$ a constant
with the exponent, $p$, equal to 1.
For more demonic mathematics about "balancing the frequency of easy acts against the rarity of difficult acts", read the paper, if you're interested. For present purposes, let's jump to Zipf's observation about the "abnormal time-perspective […] represented by the median 1.20 slope of Joyce's Ulysses […] which suggests a slightly abnormal preference for longer intervals".
Thus having once "rung a bell," Joyce tends systematically to avoid its repetition abnormally. In other words, events of the past (as represented by words) seem to be systematically more remote from the present than is actually the case with 1.00 time-perspective. Although this general type of over-long time distortion is probably not infrequent among those personalities who focus their attention primarily upon the present moment, it is interesting to note that this particular distortion of time is found in a novel that is characterized for just that attribute (if we may so interpret the words, "stream of consciousness" writing).
And now the punch line:
Other types of time-perspective — and not necessarily linear — can be defined in terms of the bell-analogy, yet there is one we mention cursorily lest it be ignored. we refer to the case in which the demon saves work and simplifies the problem of distributing his work evenly over time by simply bending the straight board into a quasi-arc. In this fashion the distant bells become nearer, and the demon can take short-cuts to them. This type of time-distortion we shall call schizophrenic unbalance and we shall treat it in greater detail in a future publication.
Time-perspective, in terms of the distribution of minimalized work over time (with all its endless ramifications) would seem to be an inviting topic for the study of the normal and abnormal of human mental behavior.
As far as I can tell, Zipf never actually treated "schizophrenic unbalance […] in greater detail in a future publication". This may be because he died in 1950 at the age of 48.
Nor did anyone else follow up on inter-word repetition statistics as a sign of "schizophrenic unbalance", at least not using the same phrase — though their are some adjacent things, like this paper, and commenters may be able to point us to others.
Update — To avoid further misunderstandings, let me point out that the cited 2013 paper (Todder et al., "Non-Linear Dynamic Analysis of Inter-Word Time Intervals in Psychotic Speech") is based on a completely different measure.
Zipf's metric was the interval (in pages) between two occurrences of the same word, e.g. the word "accurate" occurs in the novel Ulysses on pages 434, 575, 590, 605, and 615, yielding intervals of 141, 15, 15, and 10 pages.
Todder et al. measure the interval in seconds between successive words (whatever they are) in the stream of speech, so that the production in TIMIT of SA1 by speaker FLNH0
0.188 0.378 she 0.378 0.637 had 0.587 0.703 your 0.703 1.010 dark 1.010 1.339 suit 1.339 1.426 in 1.426 1.773 greasy 1.773 2.091 wash 2.149 2.478 water 2.478 2.643 all 2.643 2.938 year
yields inter-word-onset intervals in seconds of
0.190 0.209 0.116 0.307 0.329 0.087 0.347 0.376 0.329 0.165
I should also note that Todder et al. don't cite Zipf 1945, and show no signs of having been influenced in any way by that work.
Apologies for confusing people by waving a hand in the direction of the 2013 paper…
D.O. said,
October 25, 2024 @ 3:02 pm
I remember a few years ago when LLog discussed changing frequency of "the" over time, I took a longish open source text and checked the distribution of the-the distances in terms of words. It was not exponential (corresponding to no correlation), but closer to some sort of gamma, consistent with the idea that the text was not uniform in its formality throughout, but had some local peaks and troughs in terms of density of "the". Zipf's observation seems to be in the opposite direction.
Jason M said,
October 26, 2024 @ 11:49 am
I am curious about the interpretation of page-to-page intervals in examples like that of Joyce. For there to be a pattern (one that implies something about cognition or linguistics) of word appearance relative to page number in a book, the book would have to be written in some sort of sequence relative to the order in which it was eventually published. For example, how would one interpret word frequency vs page number when an author might have actually written chapter 25 first, then went back to 14 and 15, and they finished the book by finally rewriting chapters 1 to 3?
Now, the assumption with Ulysses would be that, because it is stream-of-consciousness, it might have been written in order, but we would need historical info on Joyce’s writing process (which probably exists)..
In the particular case of Ulysses, it was released as a serial, so maybe it was written in the same order as the published page numbers, but what if it was only released as a serial but wasn’t written that way?
In any case, I’d expect generalizability of frequency vs published page number would be either often confounded by these issues, or else there would have to be some other, non author-related, cognition/linguistics underlying reason for any correlations.
Philip Taylor said,
October 27, 2024 @ 5:44 am
Jason, may I ask what leads you to believe that an author might "have ]…] written chapter 25 first, then [gone] back to 14 and 15, and […] finished the book by […] rewriting chapters 1 to 3" ? I can understand why a film might be made this way, because if the opening and closing scenes were set 12 000 miles from the studio it would make perfect sense to make a single trip to film both, but why might an author elect to write a book in a non-linear manner ?
Mark Liberman said,
October 27, 2024 @ 5:46 am
@Jason M:
Good questions.
Bell-ringing demons aside, Zipf doesn't really give us an answer — just a reference to his standard idea about the joint optimization of effort and efficacy.
FWIW, the same general pattern of inter-occurrence intervals can be found in other works, as I'll show in an upcoming post. These include cases that no one would call "stream of consciousness".
Idran said,
October 27, 2024 @ 4:46 pm
@Philip Taylor: That might not be common in a first draft, and I'm not sure how common it is in nonfiction works, but it's _incredibly_ common for edits (even major edits) to be nonlinear in prose. You might realize that one chapter's ending doesn't flow as well as it could into the next chapter, or that you should set up for a later scene better in an earlier chapter, or that a less plot-relevant scene would fit the flow better if you moved it earlier, or things like that.
Benjamin E. Orsatti said,
October 28, 2024 @ 8:42 am
Ack! My comments disappeared! I'll try again:
Not so surprising, maybe. Clever fellow's Muse whispers to him to look for a relation between an observed linguistic expression (i.e., the length of words) and the internal state of a _normal_ mind using a given language within a given cultural milieu. He finds just such a relation! He builds a physical model (the demon-dulcimer thingie) that elegantly fits the statistical mathy-stuff and then wonders what happens when he alters the physical model in some way. He then re-does the math and notices — lo and behold — that this particular statistical deformation _does_ manifest itself in the wild, among schizophrenics! But, realizing he didn't have enough neuropsychiatric "chops" to carry that thought much further, he had to leave it lie.
[…]
Zipf was counting spaces between identical written words, and Todder, et al., are counting time intervals between adjacent spoken words, which, apparently, differ interestingly among schizophrenic-treated, schizophrenic-untreated, and non-schizophrenic subjects. As to exactly _how_ they differ interestingly, you "shall treat it in greater detail in a future publication" (har!).
[…]
[Y]ou can't leave me (and, perhaps, literally ones of people out there) hanging on the baffling precipice of:
. I've looked everywhere, and I got 2, _maybe_ three degrees out there, if you believe my parents who said there's one more out in the storage bin, but I don't recall ever having seen it there…