Language Log

Depopularization in the limit

April 4, 2013 @ 8:13 am · Filed by Geoffrey K. Pullum under Computational linguistics, Language and politics, Politics of language, Prescriptivist poppycock, Style and register, Usage advice, Words words words, Writing

« previous post | next post »

George Orwell, in his hugely overrated essay "Politics and the English language", famously insists you should "Never use a metaphor, simile, or other figure of speech which you are used to seeing in print." He thinks modern writing "consists in gumming together long strips of words which have already been set in order by someone else" (only he doesn't mean "long") — joining togther "ready-made phrases" instead of thinking out what to say. His hope is that one can occasionally, "if one jeers loudly enough, send some worn-out and useless phrase … into the dustbin, where it belongs." That is, one can eliminate some popular phrase from the language by mocking it out of existence. In effect, he wants us to collaborate in getting rid of the most widely-used phrases in the language. In a Lingua Franca post published today I called his program elimination of the fittest (tongue in cheek, of course: the proposal is actually just to depopularize the most popular).

For a while, after I began thinking about this, I wondered what would be the ultimate fate of a language in which this policy was consistently and iteratively implemented. I even spoke to a distinguished theoretical computer scientist about how one might represent the problem mathematically. But eventually I realized it was really quite simple; at least in a simplified ideal case, I knew what would happen, and I could do the proof myself.

For this purpose, we can take a language to be just a huge collection of sequences of words. It is customary in mathematical linguistics to assume that the collection is denumerably infinite, but that the words come from a finite dictionary, and I will make those assumptions here. I will also assume that phrases all have different freqencies. (It isn't crucial: if two phrases with identical frequency topped the frequency chart, they could be banned simultaneously.)

Without loss of generality we can ask what would happen if it were always two-word sequences that were at issue. (The effects would be similar, mutatis mutandis, for phrases of any other length, though banning two-word sequences is more radical and powerful in its effects.) Computational linguists refer to sequences of length 2 as bigrams.

Every sentence is of course made up entirely of bigrams. For example, that last sentence is made up of the bigrams (1) every sentence, (2) sentence is, (3) is of, (4) of course, (5) course made, (6) made up, (7) up entirely, (8) entirely of, and (9) of bigrams. Each of them will have some frequency of occurrence, which you can check with the Google Ngram Viewer (with its utterly inscrutable mode of frequency reporting, criticized by Mark Liberman here): the bigram is of is shown with a frequency of around 0.007% to 0.008%, while of course has a higher frequency, generally in the range of 0.010% to 0.014%.

Now, it is crucially relevant that only a finite number of bigrams exist for any finite dictionary. To be precise, if the dictionary contains N words, there are a maximum of N² bigrams. This includes bigrams that don't normally occur, like of of. (Google seems to lie about such things: it says "of of" occurs in the page at http://www.upenn.edu/, but it does not.) More generally, given N words there are exactly N^k possible k-grams for each k greater than 1.

Every sentence is made up of a selection of these N² bigrams, and what it does to the language as a whole when you ban a specific bigram wx is that all the sentences containing that bigram become impermissible. What remains is just the sentences that do not contain wx. Others may contain w or x, but not juxtaposed as wx. When another bigram is banned, say uw, other sentences become impermissible.

So fewer and fewer sentences are permissible as this banning of the most frequent is implemented. Bit by bit the language is whittled away, and the end point, in the limit, is that only one-word sentences are permissible.

To prove this, suppose (for contradiction) that all N² bigrams over some dictionary D of N words have been banned, but some multi-word sentence S is nonetheless permitted. S must have a first word, call it y, and a second word, call it z. But y and z must both be in D, so yz is a bigram of words in D. Since all bigrams have been forbidden, S is impermissible in virtue of its first two words: a contradiction. Hence there can be no such S.

There may be weaker sets of assumptions under which Orwell's depopularization in the limit does not yield this conclusion; for example, it could be stipulated that after a few decades of exile some once high-frequency phrases come back at low frequency. But under the simple procedure that you look to see which word sequence of some designated length k is currently the most frequently occurring and you ban it, and then iterate, the result is always that the language empties out of all sentences of length equal to or greater than k.

Orwell, a great protector of the English language? The man was its worst enemy! English would actually be endangered if we'd gone the way he wanted: it would eventually have emptied out of all sentences other than one-word utterances like "Go!" or "Duck!" or "Slab!". (Until the method started being used on the most frequent one-word utterances. Then the language would start being completely evacuated, and eventually there would be no more permissible utterances of any length.)

In case you were wondering, though, things didn't go the way he wanted. Take the case of two phrases he specifically claims had been killed by 1946:

Silly words and expressions have often disappeared, not through any evolutionary process but owing to the conscious action of a minority. Two recent examples were explore every avenue and leave no stone unturned, which were killed by the jeers of a few journalists.

No they weren't. You can use Google Ngram Search to check for yourself that explore every avenue first starts appearing in books around the end of the First World War, and its (fairly low) frequency rose very slightly during the period from 1944 to 1948; and while the considerably more frequent leave no stone unturned did take a slight dip in frequency around the time Orwell's essay was being written, it continued on happily through the rest of his life and onward to the present day.

These and other such facts show that Orwell's idea of elimination of the most popular phrases in the language, in addition to being an absolutely potty idea for style improvement, has been an abject failure.

Permalink

Comments are closed.

Depopularization in the limit

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta