Language Log

Q. Pheevr's Law

May 17, 2016 @ 6:59 am · Filed by Mark Liberman under Computational linguistics

In a comment on one of yesterday's posts ("Adjectives and Adverbs"), Q. Pheevr wrote:

It's hard to tell with just four speakers to go on, but it looks as if there could be some kind of correlation between the ADV:ADJ ratio and the V:N ratio (as might be expected given that adjectives canonically modify nouns and adverbs canonically modify verbs). Of course, there are all sorts of other factors that could come into this, but to the extent that speakers are choosing between alternatives like "caused prices to increase dramatically" and "caused a dramatic increase in prices," I'd expect some sort of connection between these two ratios.

So since I have a relatively efficient POS tagging script, and an ad hoc collection of texts lying around, I thought I'd devote this morning's Breakfast Experiment™ to checking the idea out.

This turned out to be one of the easiest experiments that I've done — it required a five-line shell script that took about 10 minutes to run, and a five-line R script to make this plot, which suggests that Q. is on to something:

The correlation is r=0.870.

The point in the upper-right corner is the U.S. Constitution. The point in the lower-left corner is Peter Pan.

That plot used the count of "true verbs", i.e. the count of verbs minus the count of forms of to be.

If we do the same thing including the to be counts, we get this, with a slightly lower correlation of r=0.853:

Adding the four politicians to the first plot, we get this (Trump=T, Clinton=C, Sanders=S, Cruz=Z):

If you're curious about the rest of the distribution, I've linked to the list of works sorted by Noun/TrueVerb ratios, and alternatively sorted by Adj/Adv ratios. And the (slightly massaged) output of my POS-counting script is here.

As far as I know, this relationship has not previously been noted, so I tentatively name it Q. Pheevr's Law.

Overall, this makes me suspect that there are some interesting stylistic dimensions lurking in the distributions of word types, including simple semantic categories as well as parts of speech — along the lines of Biber 1991, but offering a finer-grained of prose styles. But breakfast is over, and duty calls…

May 17, 2016 @ 6:59 am · Filed by Mark Liberman under Computational linguistics

Permalink

17 Comments

Guy said,

May 17, 2016 @ 7:21 am

There's something very satisfying about seeing a correlation that clean show up in a chart. What's impressive is that even an extreme outlier like the Constitution still seems to obey the linear relationship quite well, suggesting that there's very little going on I one of these measures that isn't in the other.

Off topic, but I'd guess that another measure on which the Constitution is an outlier is the shall/be ratio, which appears to exceed one at least among the first few paragraphs.
cs said,

May 17, 2016 @ 8:00 am

I looked at the lists to see what were the outliers. The noun-adverb-heavy outlier at around (0.50,1.4) is apparently On Bullshit. The verb-adjective-heavy outlier at around (0.58,1.0) is apparently Dorian Grey.
cs said,

May 17, 2016 @ 8:05 am

Correction, not On Bullshit, Pepy's Diary
D.O. said,

May 17, 2016 @ 8:42 am

Is there anything to the observation that all four politicians are on the lower side of the apparent line? Would it mean that in spoken (informal) language adjectives are more prevalent over adverbs than nouns over (true) verbs? There is more than one way to view a/b=c/d hypothesis. The other alternative being a/c=b/d. It shouldn't affect correlation (not much), but it might be more easily interpreted as "correlation because grammar".
languagehat said,

May 17, 2016 @ 8:42 am

Though Pepys would have been pleased by On Bullshit, I suspect.

[(myl) Agreed. The three apparent outliers in that segment of the plot are:
```
             Adj/ADV    N/TV
BleedingEdge  1.150    1.551
OnBullShit    1.055    1.679
PepysDiary    0.984    1.554
```
Pepys would probably also have liked Pynchon, and Bleeding Edge in particular.]
Mark Meckes said,

May 17, 2016 @ 9:00 am

The Constitution looks like a major outlier, not in terms of the correlation between the two ratios, but in terms of their actual values. But looking at the list of texts, it's natural to guess that it's because the Constitution is rather an outlier there in terms of the purpose of the text. I wonder whether other English-language legal texts would be similarly noun- and adjective-heavy.
D.O. said,

May 17, 2016 @ 9:21 am

I think, I was wrong. If the main effect of the correlation is that adjectives are attached to nouns and adverbs are attached to verbs than adj/noun and adv/verb should be largely uncorrelated, but very stable independently of the proportion of nouns and verbs. So, maybe in this case correlation is just a distraction and we should look directly on adj/noun vs. % noun and adv/verb vs. % verb….

[(myl) Not much signal there:

]
Q. Pheevr said,

May 17, 2016 @ 11:24 am

Well, this is very exciting! I'd be very surprised if this hasn't been noticed before, so I don't expect the name to catch on, but in any case, it's nice to see how well the correlation holds up. Thanks for taking the time to check it out!
Rubrick said,

May 17, 2016 @ 3:52 pm

The point in the upper-right corner is the U.S. Constitution. The point in the lower-left corner is Peter Pan.

Taken out of context, my favorite quote so far today.
Brett said,

May 17, 2016 @ 4:55 pm

@Rubrick: Second point to the lower left, and straight on til morning.
Jonathon Owen said,

May 17, 2016 @ 8:50 pm

I wonder what this means for Strunk and White's advice to "write with nouns and verbs, not with adjectives and adverbs" (other than, you know, the fact that it's terrible advice).
Jerry Friedman said,

May 17, 2016 @ 11:56 pm

Jonathon Owen: An interesting longer-than-breakfast experiment would be to correlate adverb and adjective frequency with readers' judgements of how good a text was, or quiz results showing how well they remembered it or to what degree it affected their opinions.
D.O. said,

May 18, 2016 @ 8:02 am

Prof. Liberman, thank you for the graphs!

So, if adj/noun ~ 0.37+/-0.05 (roughly from the graph) and adv/verb ~ 0.35+/-0.05 and both ratios do not depend on counts of verbs and nouns then (adj/noun)/(adv/verb) =(adj/adv)/(noun/verb) ~ 1.06+/-0.2, which is about what you've got. In other words, there is nothing special about this 4 counts correlation.

[(myl) Makes sense — but the relations are not nearly as impressive:

The correlations are r=0.526 and r=0.545.]
D.O. said,

May 19, 2016 @ 10:33 am

Prof. Liberman, it is an impressive relation. The "null hypothesis" is that ratios of adv/verb and adj/noun are some stable quantities, not very much related to anything else. But your last graphs show that there is a correlation between these 2 ratios, which we can stereotypically call "floweriness" : someone writing in more embellished style (once again, maybe it's just a prejudice and underlying reasons are completely different) tends to increase both relative numbers of adjectives and adverbs. The correlation is not very strong, but there is something to it.

Basically, your original graphs show that there are more "nouny" texts and more "verby" ones and adjectives and adverbs probably just march along, but with a few more graphs we see that there is that, but also a bit of something else.
Bob Ladd said,

May 19, 2016 @ 11:06 am

I agree with D.O.: the added graphs suggest that "floweriness" (or perhaps what we might call more boringly the "modification index") is a property of texts that is at least partially independent of their nouniness or verbiness. I also agree with Jerry Friedman that it would be interesting to see if there's any correlation between readers' ratings of written style and any of these indices.

It would probably be harder to assess automatically, but is there any easy way to distinguish whether the adverbs are being used to modify verbs or adjectives (in phrases like extremely interesting data, etc.)?
D.O. said,

May 19, 2016 @ 12:55 pm

Let me try to put this discussion in more academically looking frame. So, we have 4 types of counts (nouns, verbs, adjectives, adverbs) . Together they comprise about 60% of words in a typical text. That seems like there are 2-3 more other significant part of speech chunks, which probably allows us not to normalize the total percentages of these groups.

Now, we may try to do the PCA with 5 components (nouns, verbs, adjectives, adverbs, other) obviouslly adding up to 1 or with 4 components without centering. The prediction is that the first two components will look like (nouns-verbs+adjectives-adverbs) and (nouns+verbs-adjectives-adverbs). Judging from just 4 points in the third graph of the OP it would be also interesting to see whether "adverbs die first when spoken" hypothesis has any merit if we can get decent corpus of spoken language. In fact, if I can just induce COCA to give me overall POS counts in its various subcorpora or at least convince myself that the first hundred most frequent words is a good enough approximation, I'll look into it later.

[(myl) I'll be interested to see what you come up with. For my part, I'm about to head off to Slovenia for LREC2016, and if I have any spare time while I'm there I need to finish an ICA2016 paper on Spanish phonetics from audiobooks, and then I have three overdue papers on other topics, but at some point in the next few weeks I'll try to compile a larger-scale POS table from a few hundred works of different types, and see what PCA, CCA, and/or other dimensionality-reduction methods come up with…]
D.O. said,

May 22, 2016 @ 3:49 pm

I looked at the Brown corpus and results are more or less as expected with 3 notable observations. Their fiction samples are much heavier on true verbs (no be, do, have or modals) then the rest. Fiction part is also less diverse in, roughly, (adj+adv)/(noun+verb) ratios. And it doesn't seem like proportions of adverbs is very informative in a sense that the second most dispersed dimension is, roughly speaking, measures relative percentage of adjectives. I'll try to post more precise results later (maybe a few days later).