The paucity of two-letter words

« previous post | next post »

The number of possible two-letter lower-case strings over the English alphabet (not including the apostrophe) is 262 = 676. This morning I ran a script to test which two-letter sequences show up as words included in the standard 25,143-word list of words supplied with many Unix-derived systems (usually at /usr/share/dict/words). I found the proportion of two-letter sequences that are 2-letter words is roughly 9 percent (59/676 ≈ 0.09). That is, more than 90 percent of the logically possible two-letter combinations from aa to zz do not occur as spellings of common English words. You might think a lot of the explanation lies in phonetics: vowelless combinations like pq or bn are unpronounceable. But I then did the same thing for two-letter standard Unix commands: bc (basic calculator), cp (copy files), ls (list files), mv (move or rename files), etc. These arbitrarily adopted program names do not have to be pronounceable, and usually aren't. And I found that the ratio of two-letter Unix commands (more precisely, two-letter commands that have manual entries on Apple OS X version 10.6.8.) to two-letter sequences that are not Unix commands is almost exactly the same (62/676 ≈ 0.09). Why? Could it be that some kind of natural law discourages packing too many meanings into character strings (or phoneme sequences) of a given length, because it is likely to give rise to confusion or mnemonic problems? Does every language waste (as it were) at least 90 percent of the space available in the length-N sequences of letters or sounds that it uses, possibly for every N > 1?

Just something I was wondering about today, for no particular reason. Perhaps statistical linguists have considered the question somewhere. It's not important; ignore me if you wish. The best way to ignore me, of course, would be to refuse to enter any comments in the area below this post. You have the right to remain silent, but anything you do say will be taken down and kept on the Internet forever and may be used against you in a court of law or mocked by morons on fark.com or answered here or elsewhere.

[Update: The first few versions of the above post made several different mistakes said about the numbers. For example, I wrote 2626 by mistake for 262, and the text sometimes talked about ratios when I wanted to talk about proportions. Sorry about that. My bad! These things are (I think) fixed in the above post now, and the comments pointing them out (thank you, especially Chris Hunt) are being removed to avoid confusing searches for things that aren't there.]



42 Comments

  1. Adam Funk said,

    September 3, 2014 @ 10:36 am

    That reminded me of the obscure words acceptable in Scrabble. The Scrabble wordlist for the UK has 124 two-letter words (18%), the US wordlist has 101 (15%), and the "Enable1" public-domain list has a mere 96 (14%).

  2. wally said,

    September 3, 2014 @ 10:53 am

    why n > 1? It seems true for n=1 in English. A and I are used, but E and U and maybe O are also wasted,

  3. Cass said,

    September 3, 2014 @ 10:55 am

    I was just reading about the incidence of these morphemes in Polynesian languages (I think it was ( C )V rather than strictly two-letter words, but Polynesian languages generally only allow syllables of type V or CV) and it was much higher than the numbers you cite. When I get my hands on my book again I'll report back with the numbers cited, assuming someone else doesn't get there first. But well over 50% of the one-syllable morphemes are filled in most of the Polynesian languages, as I recall

  4. Brett said,

    September 3, 2014 @ 11:04 am

    @wally: The first sentence of the Just So Stories includes the work "O": "In the sea, once upon a time, O my Best Beloved, there was a Whale, and he ate fishes." Would we be less inclined to usually spell that word "oh" if there weren't the possibility of confusion with the numeral zero?

  5. Adam Funk said,

    September 3, 2014 @ 11:06 am

    @wally, @Brett: don't forget Thurber's The Wonderful O!

  6. RW said,

    September 3, 2014 @ 11:29 am

    There are 85 two letter words in the scrabble word list for Spanish, compared to 124 in the English list. 56 of the words occur in both lists.

  7. dw said,

    September 3, 2014 @ 11:44 am

    I would attribute this to English spelling conventions. There are many open monosyllables that, for one reason or another, are spelled with more than two letters.

    For example, of the following 7 monosyllables beginning with /l/, only one has two letters:

    LEE
    LAY
    LA
    LAW
    LOW
    LOO (British English)
    LIE

    Compare French

    LIT
    LAIT
    LE
    LA
    LU
    LOUE (usually a monosyllables nowadays)

    which has three out of six: slightly more efficient.

  8. Piyush said,

    September 3, 2014 @ 11:45 am

    On my Linix Mint system, after eliminating duplicates, I get 52 two letter commands (across all levels of man pages) and 64 two letter words in the word list. However, the latter include such gems as 'rs', 'ow', 'ti', 'ms', 'mi' and 'ma'. So I am not sure that the 10% bound is significant, the actual percentage is likely lower.

  9. Ted Powell said,

    September 3, 2014 @ 11:54 am

    bc is rather more than a "basic calculator" (as described above). It began life as "a preprocessor for dc providing infix notation and a C-like syntax which implements functions and reasonable control structures for programs" (quoting from the Seventh Edition Programmer's Manual). dc is a reverse-polish desk calculator; bc's user interface is considerably more user-friendly, resulting in its becoming known as Better Calculator.

  10. David L said,

    September 3, 2014 @ 12:00 pm

    Surely the number of plausible two-letter words in English is much smaller than 676. There are 5 x 20 x 2 combinations of vowel+consonant or consonant+vowel, excluding y, then another 50 combinations of y with any other letter except itself, for a total of 250. But that includes several that would likely be homophones, e.g. *c/*k/*q, where * is any vowel. So if there are 100 or more legit two-letter words, according to the Scrabble gods, around half of the possible combinations are in use.

    This excludes oddities such as aa and hm, which are Scrabble-permitted, but there are not many like that.

  11. Ted Powell said,

    September 3, 2014 @ 12:10 pm

    @Brett: I don't consider "O" and "oh" to be different spellings of the same word (perhaps because of doing five years of Latin and three of Greek sixty years ago). To me, "O" is for forming the vocative case, as in addressing one's Best Beloved; "oh" is an exclamation. Since the latter can be used to attract a person's attention, some people conflate it with the former.

  12. rosie said,

    September 3, 2014 @ 1:29 pm

    There's a curious rule of English spelling — cited by Vivian Cook in The English Writing System — that every content word must be spelt with at least 3 letters. (I say "content word" to exclude e.g. prepositions and pronouns.) Hence the doubled consonants in the words add, ebb, egg, inn, odd. This doesn't account for ox, but perhaps for the purpose of this rule x counts as two because it represents a sequence of two sounds.

    I'm surprised that the proportion of Unix commands among two-letter strings is that low.

  13. Gunnar H said,

    September 3, 2014 @ 1:31 pm

    In Norwegian (with 29 letters), the fill-rate is about 19% (ca. 159/841), depending on the language variety and the dictionary you use. (I combined several to get adequate coverage.)

    The higher proportion is partly due to umlauts in irregular verbs, e.g. "se, så" ("see, saw") and "le, lo" ("laugh, laughed"), partly to the existence of three additional vowels (which increases the number of possible words significantly), and partly simpler spelling conventions.

  14. Bloix said,

    September 3, 2014 @ 2:07 pm

    Although Prof Pullum is on record as comparing an interest in determining the number of words in English with certain other unnatural obsessions, the answer may have relevance in this case.

    Let's say there are a million different words in English, counting all versions as separate words – plurals, tenses, prefixes and suffixes, etc. That seems to be a generally accepted estimate. Presumably it's correct within an order of magnitude.

    If we could use letter strings of any length, had no concerns about matching orthography to sound or sense, and randomly assigned strings to words, what would be the longest string of letters required?

    1 letter strings: 26^1= 26 combinations
    2 letter strings: 26^2 = 626
    3 letter strings: 26^3 = 17,576
    4 letter strings: 26^4 = 456,976
    5 letter strings: 26^5 = 11,881,376
    6 letter strings: 26^6 = 308,915,776

    If all words were five letters or fewer, we would have about 12 times the number of combinations needed for all the words in the language. If all words were six letters or fewer, we would have over 300 times more combinations than we need to write all words. And for more than six letters, the percentage of strings that we use becomes very small.

    Prof. Pullum informs us that for 2-letter strings, we use about 10% of the available strings. This seems likely to be higher than for any other string length, although perhaps the 3's give the 2's a run for the money.

    So it does appear a reasonable hypothesis that languages in general "waste" at least 90% of the available space in writing for every string length, and generally much more – although I suppose there might be a language that has a lot of very short words.

  15. Ted Powell said,

    September 3, 2014 @ 2:28 pm

    @Bloix: So it does appear a reasonable hypothesis that languages in general "waste" at least 90% of the available space in writing for every string length …

    A single-letter typo can invert the meaning of a sentence: e.g. "it is now true that", "it is not true that". Perhaps there's a good reason for 90% of the string space consisting of recognizably-invalid sequences.

  16. D.O. said,

    September 3, 2014 @ 2:48 pm

    27 letters (including the white space, but excluding the apostrophe and all other punctuation marks) have maximum entropy of 4.76 bits/letter. Second-letter entropy of an English text is estimated by Shannon as 4.03 bits/letter, which makes 11 letters to be redundant. His estimate for next-letter-entropy-after-a-word is 2.14, which makes necessary the retention of 4.4 letters. Thus, its about 84% redundancy.

  17. Tim Leonard said,

    September 3, 2014 @ 3:02 pm

    How could the fraction of words stay the same as the string length increases? The number of strings increases exponentially, and that's surely not true for words beyond length 7, say. So I just counted the number of words of each length in the same Unix dictionary (though I didn't merge spellings that differed in capitalization). Here are the counts for lengths 2 through 8: 160, 1420, 5272, 10230, 17706, 23869, 29989. That's roughly linear growth, if you omit length 2, not exponential growth.

    [Good point! You must be right. —GKP]

  18. J. W. Brewer said,

    September 3, 2014 @ 3:04 pm

    There's probably some way to generate a decent list (combining spelling conventions and phonotactics) of potential "natural" (i.e. not either orthographically or phonologically weird) monosyllabic English words, although they would include lots of four- and five-letter words. Quite a lot of them don't mean anything (and/or are recent coinages, like "spork," which is perfectly cromulent in terms of orthography/phonotactics but had just been lying around unused in inventory until its referent came along and needed a name — you can even find within-my-lifetime coinages drawn from the same drawer of the inventory, i.e., unused monosyllables that rhyme with fork – Mork and SLORC would be proper-name examples). I assume the number of such phonologically/orthographically well-formed strings that are in active use is substantially >10%, but it might still be <50%. English is rather unlike e.g. Mandarin where virtually all phonologically possible monosyllables (a much smaller set due to different phonotactics) are in use and most of them are homophonous.

  19. Eric P Smith said,

    September 3, 2014 @ 3:09 pm

    Like Ted Powell, I like to distinguish between O and oh, though I think the former is not just for contexts answering to Latin or Greek vocative case, but for exclamations of certain types: O praise God, O for the wings of a dove. The distinction seems to be disappearing though, and modern hymn books are quite capable of writing for example "Oh, for a closer walk with God".

  20. J. W. Brewer said,

    September 3, 2014 @ 3:16 pm

    Someone could do a bit of fieldwork asking an appropriately selected sample of The Young People to write out in full the underlying fixed phrase they thought was represented by OMG (or OMFG!!! etc etc), and seeing what ratio of O to Oh you got in the responses.

  21. KeithB said,

    September 3, 2014 @ 3:32 pm

    I would think that two letter unix commands are "older", and less-cryptic commands were desireable when its popularity increased.

  22. Jonathan Mayhew said,

    September 3, 2014 @ 3:32 pm

    Content words of two letters include: am, is, go, ax, ox, ad, do, id… I don't think it's a rule that content words cannot have two letters.

  23. Ethan said,

    September 3, 2014 @ 3:55 pm

    @rosie @jonathan: I don't think the Vivian Cook website claimed that "content words have more than 2 letters" was a "rule". I took it as an observation that in the case of homonyms (so/sew by/buy or/ore) the shorter one was a functional word (whatever that means) rather than a content word. But even that weaker claim falls apart in the case of noun pairs like ad/add, do/due,

  24. Bloix said,

    September 3, 2014 @ 4:04 pm

    And Gunnar H proves that what I said is wrong even before I said it! Blog threads- now with time travel.

  25. leoboiko said,

    September 3, 2014 @ 8:09 pm

    What about Japanese kanji, a character set that's orders of magnitude larger? Quick experiment: I took the set of 2136 characters taught in school (the Jōyō set), plus the 5 that were removed in 2010, for a total of 2141. So there's 2141×2141 = 4,583,881 possible two-character words. I then grepped the EDICT file for two-character entries where both characters are in the set, and eliminated homographs (duplicates). The result: 39963 two-kanji words, that is, about 0.87%.

    But of course: kanji typically represent morphemes, and they don't combine as readily as phonemes – many morpheme combinations don't make sense or have little use (we have "blueberries" and "cranberries", but we don't expect to find "woodberries", "antiberries" or "pirateberries" in a dictionary).

    What about phonetic kana? This is harder to test because many words may be spelt in either kanji or kana, so the conjectured restriction need not apply. Let's test it anyway. In modern spelling there are 70 basic hiragana=4900 possibilities. Add 33 special digraphs representing onset glides for a total of 4933 possibilities. I looked for them in EDICT, again eliminating duplicates, and found:

     316 (~ 6.4%) words written exclusively in kana;
    1716 (~34.8%) words with kanji spellings;
     398 (8.06%) of them having the EDICT "usually in kana" tag.

    I then tried adding categories 1 and 3 (that is, to estimate words "always or usually" written in kana) and eliminating duplicates; this gave me 655 words, or ~13.28%.

  26. Robert Coren said,

    September 3, 2014 @ 10:28 pm

    Lo, I have beheld a two-letter "content" word that others seem to have overlooked.

  27. Peter said,

    September 4, 2014 @ 4:10 am

    One pressure against the existence of two-letter words is surely the homophony/ambiguity they give rise to?

    Of course, homophony occurs to some extent. But there is presumably some pressure against languages having too much of it; and words of this form would seem to contribute substantially to it.

    It’s like a fuzzier version of the entropy bound on prefix-free codes. When a code (for some distribution of messages) contains short codewords, then they rule out large chunks of the code space: other codewords have to be longer, to avoid ambiguity.

    On a related note, two-letter words aren’t arbitrary single-syllable words, they’re particularly simple ones. They can have either an initial or final consonant, but not both; and with the exception of x, it really is limited to being a single consonant, not a cluster. And similarly, they are limited mostly to single vowels, with afaics by, my… the only diphthong available.

  28. Rodger C said,

    September 4, 2014 @ 6:51 am

    @Piyush: Ow, Ma, you threw out my MS!

  29. flow said,

    September 4, 2014 @ 7:07 am

    i'm doing research on Chinese characters, and according to my data, you need about 500…800 components to adequately describe the 10'000 most commonly used CJK characters in use in China, Taiwan, Japan, and Korea. with 'adequately' i here mean 'in a way that you can describe the composition of a given complex character in terms of recurring, mid-sized chunks', that means without going down to the stroke-level, and without using highly comples characters in the decomposition. example: 握 is analyzed as ⿰扌屋, but 屋 is in itself a composite, so the preference is to say that 握 is ⿰扌⿸尸至, and 'has' 3 components.

    for a total of 9855 characters, i found 3504 glyphs with 2 guides, 3868 glyphs with 3 guides, 1582 glyphs with 4 guides (and a minority of characters with 5 or more). currently, there are 759 components on record, 526 of which also appear as 'stand-alone' characters. the exact figures are highly contentious as the Chinese script does not have a fixed and agreed-upon inventory of letterlike symbols; there is not even agreement on exactly what stroke types there are.

    now, there are 759^n possible characters with n components; the ratios to those actually encountered are 526/759 = 1/1.44, 3504/576081 = 1/164, 3868/437245479 = 1/113041, and 1582/3e11 = 1/213e6. when you go from 10'000 to 88'000 characters, you get a ratio of 1/160 for binary and 1/198424 for ternary compositions, which is slightly better but very far off from any 10%. i guess there may be be strong ties between components to or not to co-appear in a single character, but i don't have data on that as yet.

    it does look like CJK characters employ a rather large set of individually shaped constructs to form highly distinct composites where the Latin alphabet recycles a small set of components at the price of having less distinct perceptual units (words). i guess that's the reason we don't write in bits: it's possible but the perceptual recurrence is so high that you can't make out the patterns.

    as for the coverages of *sounds* of Mandarin, it also depends a lot on the particular analysis you do: in any language the theoretical space of phone combinations is much greater than the theoretical space of phonemic combinations. in Chinese, there are some obvious gaps in the syllabary, like there are zhai, chai, and zhei, but no *chei. BUT do you count the missing *tü, *dü? there seems to be a phonotactic rule that ü only combines with n- and l-, also to the exclusion of m-.

    do you count the missing *ki, *king, *kian, *küan? there is a very obvious rule that prohibits dorsovelar onsets with palatal medials/nuclei which has led many (but not all) analysts to posit phonemes like /K/ which are realized as [kʰ] before non-palatal and as [tɕʰ] before other vowels. all of a sudden there is no gap here, as all of /Ka/, /Ki/, /Kɤ/, /Ku/, /Kü/ etc do exist. other solutions a well possible; in fact, one can criticize phonological theory for not coming up with solutions that really cover the gaps even where inconvenient for alphabetic notation (and people like Karlgren have done that). just saying, but to me the missing syllables *iei, *uou, *iai*, *uau are not so much gaps, they are a priori excluded (there is a rare beast yuanyai 懸崖, but it has only survived in Taiwan, it seems).

  30. leoboiko said,

    September 4, 2014 @ 8:21 am

    @flow: I'm interested on this topic (hànzì components); if you're publishing anything, could you please send me a link on leoboiko@namakajiri.net ? Also, I once blogged a simple analysis on phonetic components in Japanese kanji; if any of those days you'd like to read it, I'd welcome any criticism.

  31. Piyush said,

    September 4, 2014 @ 10:52 am

    @Rodger C:

    Do you know it cost me 200 Rs., and I had to sell of all my TI stock for it?

    But jokes apart, it does seem that the default dict word list includes a lot of two letter strings that most people would call abbreviations rather than words.

  32. BZ said,

    September 4, 2014 @ 11:45 am

    This got me thinking about one-letter words. In Russian, about 1/3 of all available letters are words with a good chunk of consonants, while in English I can think of 3 (A, I, O), which is 12%. I wonder why this would be. I thought it might be due to some sort of aversion to consonants as words in English, but the sample size is too low to test this.

    As for two-letter words in English, I can think of some that are double consonents, although their wordhood might be debated: mm (variant of umm?), ss (hissing sound), and zz (sleeping)

  33. maidhc said,

    September 4, 2014 @ 1:46 pm

    There are a number of two-letter personal names: Al, Sy, Vi, Jo, Mo.

    I don't know if it counts as English, but there are quite a few two-letter Asian personal names that we use in English: Oh, Li, Ly, Ng, Ma, etc.

  34. Robert said,

    September 4, 2014 @ 6:05 pm

    @Wally:
    >> why n > 1? It seems true for n=1 in English. A and I are used, but E and U and maybe O are also wasted,

    Who says one-letter words must consist of a single vowel? Polish has w, meaning in, which I think is the cognate of в in Russian. Russian also has к and a few others.

  35. John Coleman said,

    September 4, 2014 @ 6:09 pm

    Many years ago I enumerated all phonotactically legal English monosyllables. The proportion of them that are attested words was similarly very low, somewhat surprisingly. (We briefly considered registering all the others as trademarks!)

  36. Doug said,

    September 4, 2014 @ 7:11 pm

    Cass said:

    "I was just reading about the incidence of these morphemes in Polynesian languages (I think it was ( C )V rather than strictly two-letter words, but Polynesian languages generally only allow syllables of type V or CV) and it was much higher than the numbers you cite"

    I'm tempted to hypothesize a correlation: a language with very simple syllable structure, and thus very few allowable short words, can't afford to "waste" as many phonologically possible words. (Since if it did, the speakers would have to use inconveniently long words.)

  37. Jerry Friedman said,

    September 4, 2014 @ 10:05 pm

    Jonathan Mayhew: I don't think "am", "is", and "do" are content words. The "rule" does have exceptions, as you point out, but they're very few.

    John Coleman: Ha!

  38. Milan said,

    September 5, 2014 @ 6:39 am

    Re:'O'/'Oh'
    I remember from my high school days (EFL, reading Shakespeare in excerpts) that 'Oh' is supposed to be stressed, while 'O' is unstressed, but that might have been only a convention of the edition we used

  39. Jonathan Mayhew said,

    September 5, 2014 @ 10:00 am

    It's not a "rule." i don't know why you don't think that verbs are content words? The claim I was responding to was that only prepositions and pronouns had two-letter spellings, yet three very common verbs, go, do, and be (and some forms of the verb to be) are digraphs. At best there is a tendency to disambiguate by spelling some words with three letters to distinguish them from homonyms (aw or awe, in or inn, etc…). i think Ethan cleared up that in a comment on Sept. 3.

  40. Jerry Friedman said,

    September 5, 2014 @ 3:11 pm

    Jonathan Mayhew and Ethan: Rosie's version of the claim included "(I say "content word" to exclude e.g. prepositions and pronouns.)" (Emphasis added.) As I understand it, content words are those that are not function words or grammar words in the sense of this Wikipedia article. The article excludes auxiliary verbs from content words. Likewise Vivian Scott's Web site includes be as a grammar word, not a content word.

    I agree that it's only a tendency, but it may be stronger than you thought.

  41. mollymooly said,

    September 9, 2014 @ 7:59 am

    Wikipedia's three letter rule article could do with expansion. And maybe a hyphen.

  42. Ran Ari-Gur said,

    September 13, 2014 @ 10:29 pm

    > Does every language waste (as it were) at least 90 percent of the space available in the length-N sequences of letters or sounds that it uses, possibly for every N > 1?

    At least for the case of "letters", the answer is "no"; I went through a Hebrew–English dictionary I have handy, and checked for an entry for each possible two-letter word. Hebrew has 22 letters, so 22² = 484 letter pairs; I found entries for 240 of them, or 49.6%. (And that approach skips some two-letter words that don't have entries; for example, פי /pi/ "(the) mouth of; my mouth" doesn't have an entry because it's an inflected form of the noun פה /pe/ "mouth". So a fully comprehensive figure would be more than 50%.)

RSS feed for comments on this post