Language Log

Nth Xest

September 1, 2014 @ 6:03 pm · Filed by Mark Liberman under Computational linguistics

In the course of writing about the "fourth highest of five levels", I looked around at how the pattern "Nth Xest" is used in general. I found that uses of such expressions overwhelmingly count from the "top" where X names a top-oriented scale (high, big, long, etc.), and count from the "bottom" where X names a bottom-oriented scale (low, small, short, etc.) In other words, unsurprisingly, "Nth Xest" normally counts (up or down) from whatever end of the scale "Xest" names.

Another (less logically necessary but still unsurprising) thing I noticed is that top-oriented counts are always a lot bigger than corresponding bottom-oriented counts, and that counts decrease almost-proportionately as N increases. Thus from Google Books ngrams:

	second	third	fourth	fifth	sixth
highest	34447	9692	3148	1411	784
lowest	6006	1455	491	293	138

The numbers from COCA are pretty much in proportion, though lower:

	second	third	fourth	fifth	sixth
highest	305	95	33	23	12
lowest	55	9	4	3	2

Here are the Google Books counts for a larger set of values of X (values of 0 generally reflect cases where the count didn't reach the threshhold of 40 required for retention of ngram counts):

	second	third	fourth	fifth	sixth
highest	34447	9692	3148	1411	784
lowest	6006	1455	491	293	138
biggest	6001	1402	608	264	156
largest	124598	50022	20712	10595	6246
greatest	8333	1762	423	209	162
smallest	2703	605	200	92	49
most	114727	28723	8192	4028	2163
least	988	302	57	58	0
best	55695	7009	2337	649	426
worst	2417	501	142	95	0
oldest	14955	3041	661	202	128
youngest	2772	454	92	0	0
longest	3739	1660	713	412	171
strongest	3087	735	151	46	45
richest	1486	683	228	136	91
poorest	598	196	82	82	0

Adding them all up column-wise:

The left-hand figure below plots the counts on a log scale. And on the right, I've normalized the top-oriented and bottom-oriented counts, normalized by the count for "second Xest":

The same things for COCA counts:

It would be nice if the recently-developed distributional semantics methods could induce patterns of this type — but I don't think that they can do so yet.

September 1, 2014 @ 6:03 pm · Filed by Mark Liberman under Computational linguistics

Permalink

1 Comment

D.O. said,

September 2, 2014 @ 11:51 am

Raw counts of the ordinal number words (without any coöcurrences) also show approximately exponential fall with somewhat diminishing exponent. Data from Google ngrams averaged for years 2000-2008 (they are really pretty stable over many decades) in words per million
first       815.7
second 264.4
third      130.5
fourth     36.3
fifth        21.9
sixth       13.3
seventh 10.7
eighth      8.8
ninth       6.5
tenth       7.8

I also included "first" which is not in Prof. Liberman counts for obvious reasons. Counts for "first" through "fourth" fall with exponent of 1 (that is, by the factor of e for any subsequent number), quite close to what happens with Nth Xest. So far, excluding the obvious case of the first Xest, there is no evidence that the use of ordinals with rankings is any different from the use of the ordinals overall.

RSS feed for comments on this post

Nth Xest

1 Comment

D.O. said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta