Language Log

In favor of the microlex

November 14, 2012 @ 6:44 am · Filed by Mark Liberman under Computational linguistics

Bruce Schneier quotes Stubborn Mule citing R.A. Howard:

Shopping for coffee you would not ask for 0.00025 tons (unless you were naturally irritating), you would ask for 250 grams. In the same way, talking about a 1/125,000 or 0.000008 risk of death associated with a hang-gliding flight is rather awkward. With that in mind. Howard coined the term “microprobability” (μp) to refer to an event with a chance of 1 in 1 million and a 1 in 1 million chance of death he calls a “micromort” (μmt). We can now describe the risk of hang-gliding as 8 micromorts and you would have to drive around 3,000km in a car before accumulating a risk of 8μmt, which helps compare these two remote risks.

This reminds me of the Google Ngram Viewer's habit of citing word frequencies as percentages, with uninterpretably large numbers of leading zeros after the decimal point:

Talk about "naturally irritating" — asking someone to bag up 0.00025 tons of coffee for you is a shining beacon of cooperative communication, compared to telling some random internet pilgrim that the word "uncommunicative" had a frequency of 0.0000173478% in books published in 1921.

Ever since linguists started counting the frequency of words and phrases in text collections, they've been citing such frequencies in terms of the convenient and easily-interpretable unit of occurrences per million words. In this case,

0.0000173478% = 0.000000173478
0.000000173478*1000000 = 0.173478 per million words

Compare

… from which we conclude that in books published in 1921, "communication" had a frequency of

0.0031511613% = 0.000031511613
0.000031511613*1000000 = 31.51161 per million words

Removing some of the probably-meaningless extra digits, we get the fact that in books publishing in 1921, "uncommunicative" had a frequency of about 0.17 per million words, while "communication" had a frequency of about 31.5 per million words.

You can compare approximations to these quantities easily in your head — 30 is about 150 times greater than 0.2 — and you can easily transfer the numbers to a calculator, without obnoxious and error-prone counting of leading zeros, to learn that 185 is a more accurate ratio. Easy, right?

Come on, guys, enough nerdview. Present the results in a form that human beings can understand.

[The unit designations "N per million words" (or "N/MW") are clear enough, so the idea of introducing the "microlex" is just a convenient headline. Though it might help that μl is only two characters long and seems kind of science-y — it worked for the chemists. And you could say that "uncommunicative" had a frequency of 170 nanolex in 1921. And "the", which had a frequency of 5.795557836% in 1921, would come in at 58 millilex (ml)…

Seriously, what I'm really recommending is "N per million words" and its relatives, such as "per thousand words", "per billion words", and so on.]

November 14, 2012 @ 6:44 am · Filed by Mark Liberman under Computational linguistics

Permalink

27 Comments

Avinor said,

November 14, 2012 @ 7:10 am

Nerdview would be writing "1.73478e-7". Actually, I suspect this is a failed attempt to adapt to non-technical users. Somebody thought: "Ordinary people want percentages."
Eric TF Bat said,

November 14, 2012 @ 7:13 am

The microlex is a contrary unit for one extra reason: because the lex itself is largely useless. SI base units are usually useful — the gram, the metre and so on, all are sensible, human-manageable quantities of whatever they measure. What word has a frequency of 1 lex? That is, what word is used 100% of the time? I can only think of one example in one particular instance: the word Malkovich, during the scene in Being John Malkovich when the actor uses the portal to go inside his own head. (My apologies for the impenetrability of that description to anyone who hasn't seen that movie.) So: frequency of the word "Malkovich" = 1 lex. Given that this is the only time that value is valid, perhaps the lex should be renamed the malkovich.

[(myl) In fact, some animal communication elements arguably have a frequency not much below 1 lex:

You could regard this as being the communicative (and entropic) equivalent of absolute zero in temperature.

Alternatively, we could name the microlex after some corpus-linguistics pioneer(s), like Kucera or Francis. Then a frequency of 1 per million words would be (say) 1 kucera; 1 per 10 million words would be 1 deci-kucera; etc. But frankly, I think that N per hundred/thousand/million/billion words is simple and clear enough…]
Avinor said,

November 14, 2012 @ 7:17 am

I'd call it the buffalo.
Jukka Kohonen said,

November 14, 2012 @ 7:24 am

Why reinvent the wheel? "Parts per million" (ppm) already exists.

"In books published in 1921, 'communication' had a frequency of 31 ppm."

[(myl) "Parts per million" seems most appropriate for continuously-measurable quantities, rather than for things that occur as discrete events. Maybe "instances per million" (ipm) would be more appropriate.
Andy Averill said,

November 14, 2012 @ 7:46 am

Still wondering whether Google Books is an acceptable corpus for heavy-duty linguistics research. Their materials before about 1930 are particularly unreliable due to numerous OCR errors and the like. Not to mention that they seem to have a disproportionately large number of copies of magazines like Boy's Life and Popular Mechanics.

Project Gutenberg, a much better curated collection, also has a search engine (called Anacleto), but unfortunately the results don't seem to be sortable by date. "Uncommunicative" yields hits from 431 books, which are almost all, presumably, in the public domain.
janwo said,

November 14, 2012 @ 8:38 am

Would you mind if I enter these terms, credited to you, into glottopedia?
Rohan F said,

November 14, 2012 @ 9:31 am

@Eric TF Bat:

"The microlex is a contrary unit for one extra reason: because the lex itself is largely useless."

That's not without precedent, though. In genetics, the measure of distance between two genes on a chromosome is measured in centimorgans, which are units equivalent to one hundredth of a morgan, but because of the definition of a centimorgan (the distance between two genes for which one meiotic product in 100 undergoes a dividing recombinant crossover event within that distance), in practice one can only measure up to 50 centimorgans (because genes at opposite ends of a chromosome have a one in two chance of having a dividing recombinant crossover event: either the number of crossovers in the intervening space is even – in which case they will wind up on the same chromosome and not be divided by the crossover event – or it will be odd, in which case they will be).
Cameron said,

November 14, 2012 @ 9:35 am

The microlex reminds me of the millihelen, the practical unit of beauty. A millihelen is the amount of beauty required to launch one ship.

[(myl) We should not forget, in this context, the Lenat:

The unit of bogosity, derived from the fictional field of Quantum Bogodynamics. The Lenat is seldom used, as it is understood that it is too large for normal conversation. Its most common form is the microlenat.

A similar joke used to be current, involving a different name associated with the international unit of insincerity, but I believe that it has been withdrawn.]
Circe said,

November 14, 2012 @ 9:50 am

I think the undisputed leader among units which are only useful in their dimunitive forms in the standard international unit for magnetic field: the tesla(T). From wikipedia:

100.75 T: "strongest (pulsed) magnetic field yet obtained non-destructively in a laboratory (National High Magnetic Field Laboratory, Los Alamos National Laboratory, USA."

730T: "strongest pulsed magnetic field yet obtained in a laboratory, destroying the used equipment, but not the laboratory itself" (Institute for Solid State Physics, Tokyo).

(emphasis mine)

Surely that last qualification ("destroying the used equipment but not the laboratory itself") intrigues you? Wikipedia is at our help again:

2.8kT: "strongest (pulsed) magnetic field ever obtained (with explosives) in a laboratory (VNIIEF in Sarov, Russia, 1998)."

I wonder what is the highest lexical density (in µl) that can be obtained in the laboratory in the three scenarios above: non-destructive, destroying the equipment but not the laboratory, and destroying the laboratory itself.

[(myl) Because 0 Lex and 1 Lex are limiting values, since these are really just measures of proportion, you'd want to use some sort of transform, like the logit, in your pursuit of records. And I don't think that explosives would be helpful, unfortunately.]
Henning Makholm said,

November 14, 2012 @ 10:46 am

How about measuring rarity instead — as the negative logarithm of frequency? We could call it "pLex".
Adam said,

November 14, 2012 @ 11:57 am

MYL wrote " In fact, some animal communication elements arguably have a frequency not much below 1 lex"

It's been a while, but you forgot the Chicken language!

[(myl) Nice catch.]
Boyang said,

November 14, 2012 @ 12:10 pm

0.000000173478 = 0.000000173478*1000000 ???
Really?
MattF said,

November 14, 2012 @ 12:24 pm

Still, technology can change things– when I was a lad, a 1 Farad capacitor was a room-sized object that one tiptoed around, just in case a bit of the stored charge decided to go rogue. But in the modern world of ultracapacitors, you can buy a multi kiloFarad item that fits in your pocket for under $100. And it won't always be on the verge of exploding.

[(myl) Yes, and some TED talks have been estimated to produce brief bogosity bursts in the kiloLenat range…]
D.O. said,

November 14, 2012 @ 2:22 pm

Results larger than 1lex are certainly possible. If 1lex means per word than some letters would have frequencies in specially constructed texts of more than 1 lex (heck, I know a couple of Russian poems, all of questionable aesthetic merit of course, with each word beginning with the same letter). If letters are not good enough, we can go for a number of vertical strokes for the texts written in some type.
The other Mark P said,

November 14, 2012 @ 4:25 pm

= 0.173478 per million words

Six significatnt figures? For that last digit to be accurate the corpus would need to be accurate to the last word for every million-million. Even if the word count is in the quadrillion category, I doubt the count is that accurate.

= 0.17 pm is the correct statistic, as it doesn't imply a totally unfounded level of accuracy.

————————————————–

I sometimes try to introduce the concept of long distances being measured in megametres (Mm) but it never seems to catch on, despite being pure metric logic. A long drive seems much more manageable if it is only 0.8 Mm!
Joe Fineman said,

November 14, 2012 @ 4:38 pm

Similarly, no-one (that I know of) would refer to a factor of 10 in power as a bel; it is always 10 dB.
word of the day: bogosity « Dadge said,

November 14, 2012 @ 4:55 pm

[…] Today's Language Log post about Google's n-gram viewer posits the term "microlex", which reminded a commenter of the milliHelen, which reminded the author of the Lenat, which is, of course, the international unit of bogosity.[…]
Ken Brown said,

November 14, 2012 @ 6:04 pm

There is a notorious scene in the first episode of "The Wire" that approaches one lex. OK its probably nearer 0.3 if you count variant forms of the same word as different.
Ken Brown said,

November 14, 2012 @ 6:05 pm

Nuts. I meant first series, not first episode.
Jonathan D said,

November 15, 2012 @ 12:31 am

The other Mark P, I was reenergised in my use of Mm when my brother use the "K's", meaning thousands to refer to thousands of kilometres despite the term "K's" being standard here meaning simply kilometres.
Jon Orwant said,

November 15, 2012 @ 10:17 am

I'm responsible for the nerdview percentages in the Ngram Viewer, and I agree with Mark. I don't know when I'll get to it (certainly not in 2012), but I'll make this change.

The words microlex, nanolex, etc. are tempting…
Erez Lieberman Aiden said,

November 15, 2012 @ 11:40 am

Mark: you're right about everything, of course, but you failed to cite the milliDarwin!

http://www.sciencemag.org/site/feature/misc/webfeat/gonzoscientist/episode14/index.xhtml
Anonymous said,

November 15, 2012 @ 4:49 pm

Mark,

Parts per million doesn't only refer to continuous distributions. It is already widely used in manufacturing to refer to the incidence of defects and the like.
Dan M. said,

November 15, 2012 @ 8:55 pm

Am I the only one who immediately thought of using a log scale? And then given that all the measurements are small fractions of a natural maximum, it seems natural to think of them an an attenuation, decibels is the obvious unit to use.

Uncommunicative has an occurance of -225dB, while communicative has -150dB. Obviously, those are 75dB apart, which is 1.4 times 128, or 180. If you want to think in parts per million, you just add 200 to the raw dB values.
Matt McIrvin said,

November 16, 2012 @ 10:55 am

Scientific notation is how scientists and engineers dealt with the uninterpretable-string-of-zeroes problem long, long ago, and in principle it removes the need even for most SI prefixes.

Unfortunately, it seems to have been deemed threatening to laypeople.
Sili said,

November 18, 2012 @ 8:53 am

Am I the only one who immediately thought of using a log scale? And then given that all the measurements are small fractions of a natural maximum, it seems natural to think of them an an attenuation, decibels is the obvious unit to use.

Uncommunicative has an occurance of -225dB, while communicative has -150dB. Obviously, those are 75dB apart, which is 1.4 times 128, or 180. If you want to think in parts per million, you just add 200 to the raw dB values.

Yes. You're nerdviewing.

I think the undisputed leader among units which are only useful in their dimunitive forms in the standard international unit for magnetic field: the tesla(T). From wikipedia:

I think the farad and the coulomb are better bets. Most capacitors in daily use are in the μF range.
Tim J said,

November 25, 2012 @ 10:12 pm

My favourite nano-unit is the nanolightsecond. This is of course the distance travelled by light in one nanosecond. It's about 0.2998 metres, which comes to 0.9836 ft (11.80 inches).

I propose we adopt this unit for everyday use. It should be called the astronomical foot.

RSS feed for comments on this post

In favor of the microlex

27 Comments

Avinor said,

Eric TF Bat said,

Avinor said,

Jukka Kohonen said,

Andy Averill said,

janwo said,

Rohan F said,

Cameron said,

Circe said,

Henning Makholm said,

Adam said,

Boyang said,

MattF said,

D.O. said,

The other Mark P said,

Joe Fineman said,

word of the day: bogosity « Dadge said,

Ken Brown said,

Ken Brown said,

Jonathan D said,

Jon Orwant said,

Erez Lieberman Aiden said,

Anonymous said,

Dan M. said,

Matt McIrvin said,

Sili said,

Tim J said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta