Language Log

More models of binomial order

December 29, 2009 @ 8:17 am · Filed by Mark Liberman under Computational linguistics

Following up on "The order of ancestors" (12/24/2009) and "Sexual orders" (12/27/2009), I need to note one other important recent paper: Sarah Benor and Roger Levy, "The Chicken or the Egg? A Probabilistic Analysis of English Binomials", Language 82(2): 233-278, 2006. And several readers have pointed me to an older tradition of corpus linguistics that comes to a different set of conclusions about binomial ordering: Mishnah Keritot 6:9, etc.

Here's the abstract of the Benor and Levy paper:

Why is it preferable to say salt and pepper over pepper and salt? Based on an analysis of 692 binomial tokens from online corpora, we show that a number of semantic, metrical, and frequency constraints contribute significantly to ordering preferences, overshadowing the phonological factors that have traditionally been considered important. The ordering of binomials exhibits a considerable amount of variation. For example, although principal and interest is the more frequent order, interest and principal also occurs. We consider three frameworks for analysis of this variation: traditional optimality theory, stochastic optimality theory, and logistic regression. Our best models—using logistic regression—predict 79.2% of the binomial tokens and 76.7% of types, and the remainder are predicted as less frequent—but not ungrammatical—variants.

B & L take their examples from a number of tagged corpora, using a method described as follows:

The corpus search was conducted on three tagged corpora: the Switchboard (spoken), Brown (varied genres, written), and Wall Street Journal (WSJ; newspaper) sections of the Penn Treebank III, available from the Linguistic Data Consortium (Marcus et al. 1993).1 These corpora were searched for constructions of N and N, V and V, Adj and Adj, and Adv and Adv, where both X and X were part of the same XP. The search yielded 3,680 distinct binomials. Using the beginnings and ends of each corpus’s search results, we took a total of 411 input binomial TYPES—distinct sets A, B for some binomial sequence A and B—for analysis. This total consisted of 120 nouns, 103 verbs (including gerunds and participals), 118 adjectives, and 70 adverbs. We did not include binomials formed from personal names, because idiosyncratic factors frequently determine the ordering of names in a conjunction (however, we did not exclude the names of political entities such as countries or states). We discarded binomials formed with extender phrases, such as and stuff, as they are not in theory reversible (i.e. politics and everything cannot be everything and politics). For each of these binomials, we noted whether we considered each to be frozen (for example, by and large and north and south are frozen; honest and stupid and slowly and thoughtfully are not). We then searched for all occurrences of each binomial and its reverse in all three corpora, and included all such occurrences in our final corpus, yielding 692 tokens. Like Gustafsson (1976), we found that very few of the binomials occurred more than once in the three corpora. Most of those that did are frozen binomials, such as back and forth, which occurred forty-nine times.

Their technique has several important advantages. For one thing, the use of parsed corpora allows them to avoid apparent binomials like dogs and desserts from the string "…selling hamburgers, hot dogs and desserts", or dogs and columns from the string "a most unique newspaper, one that carries no headlines, photographs of cats and dogs and columns with names like 'The Downieville Dragnet.'". And this approach provides a valid sample of the binomials (common or otherwise) that happen to occur in a chosen chunk of text.

It also has an important disadvantage: the amount of text analyzed is only about three million words. 692 binomial tokens is thus a rate of about 231 per million. This is pretty common — it's about the same frequency as the word America, or the sequence "from a". But their observation that "very few of the [individual] binomials occurred more than once in the three corpora" is both expected, and telling. The nature of LNRE ("large numbers of rare events") distributions guarantees that the resulting sample will present a very noisy picture of the population frequency and the population order statistics for individual binomials. And this guarantee is honored by the facts, as can be seen in the following table, which compares a random selection of their 411 binomial types with counts from some larger corpora:

	B&S	COCA	LDC News
English and Americans	1 0	7 6	10 8
Connecticut and Massachusetts	1 0	15 23	140 190
slowly and thoughtfully	1 0	7 0	3 0
abused and neglected	1 0	86 18	336 57
acute and correct	1 0	0 0	0 0
approved and commended	1 0	0 0	0 0
strawberries and bananas	1 0	2 4	10 9
oranges and grapefruit	1 0	9 8	59 19
warm and fuzzy	1 0	154 5	1121 6
fruits and nuts	1 0	54 14	192 27
T-ball and soccer	2 0	1 2	2 2
pinks and greens	2 0	13 1	18 10
gold and silver	4 0	428 165	3287 548
principal and interest	5 2	55 33	980 787

(In each cell, the first number is the count for the cited order of the binomial, and the second number is the count for the reversed order.)

Given that their model assigns weights to 20 "semantic, pragmatic, metrical, phonological, and word-frequency factors that may affect the ordering of binomials", and that the patterning of these factors in their 411 binomial types is far from a factorial design (as expected in real-world linguistic data), this amount of noise in type-token relations will certainly degrade the predictive power of the result.

As they observe, "Because our full logistic-regression model uses a large number of constraints relative to the size of the dataset, it is not possible to draw detailed conclusions from the specific values of resulting constraint weights". This would be true even if the estimated frequencies of binomial types were reasonably accurate — it's much more of a problem given that their counts are nearly all 1, and thus almost meaningless as a basis for predicting population frequency. (This is especially true if the model is tested via cross-validation — as far as I can tell, though, they tested on their training set, making the reported 77% performance surprisingly low. )

At the start of this post, I mentioned an older corpus-linguistics tradition that also must deal with the problem of binomial order in a small corpus (about half a million words). This older tradition, without access to generalized linear models, draws a different sort of conclusion from the fact that binomial order is hard to predict and apparently variable. Thus

“This is the same Aaron and Moshe to whom G-d told, ‘Take the Jewish people, all of their hosts, out of Egypt.’” (Shemot 6:26)

The Tosefta at the end of Masekhet Keritot asks: Why does Aaron precede Moshe in this verse, whereas Moshe usually precedes Aaron? […]

[T]he Torah, one verse after another, switches the order of their names. When it speaks about the actual Exodus – “to whom G-d told, ‘Take the Jewish people, all of their hosts, out of Egypt” – where Moshe was central, it lists Aaron first – “Aaron and Moshe.” (Shemot 6:26) Then, in the next verse when it talks of speaking to Pharaoh – “They are the ones who speak to Pharaoh the king of Egypt . . .” – it lists Moshe first – “this is Moshe and Aaron.” (Shemot 6:27) This switching of the names actually teaches a lesson. By listing Aaron first concerning the area where Moshe was central and listing Moshe first in the area where Aaron was central, it makes it clear that both had an equal role in the mission.

Or again:

Dealing with the duties and the relationship of the child to its parents:

a) Honor your father and your mother, (Exodus 20:12; Deut. 5:16)

b) Ye shall fear every man his mother and his father (Levit.19:3)

[In the matter of honor due to parents, the father is mentioned first; in the matter of reverence due to them, the mother is mentioned first. From this we infer that both are to be equally honored and revered. …]

And:

4. "You shall revere every man his mother, and his father"

Rabbi Yosi says that whoever fears their mother and father observes the Shabbat. He wonders why the mother is mentioned first, and Rabbi Shimon explains that the mother does not have the power to instill fear that the father does, therefore she is mentioned first. Rabbi Yehuda says that just as heaven and earth were created simultaneously, both parents are equal in fear and honor. Rabbi Shimon tells us about the sanctification below during mating and the supernal mating above.

Some similar arguments are advanced about sheep and goats, pigeons and doves, and perhaps other binomials. But here, I think, we have an even more problematic instance of testing on a training set with small type and token counts.

December 29, 2009 @ 8:17 am · Filed by Mark Liberman under Computational linguistics

Permalink

8 Comments

J. W. Brewer said,

December 29, 2009 @ 3:02 pm

As to the Mishnah there's perhaps a very specific presupposition that the Author of the Torah had no free variation in His idiolect — every detail of the text is the way it is and not some other way for a specific semantic-nuance reason potentially discernable by the right interpreter. That's an unhelpful (although certainly not uncommon) way to think about ordinary human language as ordinarily used. Whether it's a sensible way to think about this particular text, assuming arguendo certain extralinguistic claims about its authorship, is perhaps a question outside the competence of modern secular Sprachwissenschaft.

It would I suppose be interesting to see if there are similarly obsessive parsings of merely semi-divine texts (e.g. Homer, Shakespeare) which discern deliberate semantic nuance in 100% of all apparent instances of free variation (or variation seemingly motivated by non-semantic considerations like phonology or meter).

[(myl) The New Critics' practice of close reading could be seen as treating all poetry this way (even if they also wanted to remove the poets from the poems). There's a mystical/religious expression of this idea in I.A. Richards' poem The Daughter Thought, about King Acrisius, Danae in the bronze tower, and the conception of Perseus:

But, but within all let or lour,
There rules an order without bound:
Cyclic, unsearchable, but found
When the unthinkable junctures flower.

And startling as the fore-felt rime
The sense resists and would refuse
Justly and when it's due to lose
The step denied steps in in time.

[..]

The Golden Shower; the Virgin Birth;
The Infant Voyage: what Moses in you all
But draws his power thence? his Call
Clearer; firmer his title to the Earth.

]
Charles Belov said,

December 30, 2009 @ 3:23 am

As one might expect, things are not necessarily the same in other languages. One of my favorite Chinese dishes is (literally) pepper salt frog (or tofu, depending) and they prefer eastnorth (79 million Google hits) over northeast (16 million), westnorth (33 million) over northwest (16 million) and westsouth (39 million) over southwest (16 milliion). Oddly, southeast (16 million) trumped eastsouth (4 million).

[(myl) In my limited understanding, the Chinese cardinal directions are

東 dōng east
西 xī west
南 nán south
北 běi north

and in combination

東北 dōng běi northeast (= east north)
東南 dōng nán southeast (= east south)
西北 xī běi northwest (= west north)
西南 xī nán southwest (= west south)

南北 nán běi south and north
東西 dōng xī east and west

Where are you finding 南東, and what is it used to mean? CEDICT claims no knowledge. Could it be part of a transliteration? or perhaps this is another instance of the unreliability of Google counts?]
rootlesscosmo said,

December 30, 2009 @ 2:30 pm

Intersection binomials seem to have preferred order as well, from a (purely arbitrary) search:

"seventh avenue and 14th street" 95,600 ghits
"14th street and seventh avenue" 153,000 ghits

"24th and mission" 14,700 ghits
"mission and 24th" 96,700 ghits

"4th and townsend" 576 ghits*
"fourth and townsend" 113,000 ghits

"townsend and fourth" 176,000 ghits

"broadway and 96th" 191,000 ghits
"96th and broadway" 146,000 ghits

"hollywood and vine" 142,000 ghits
"vine and hollywood" 19,000 ghits

"mason and california" 259,000 ghits
"california and mason" 397,000 ghits

*This seemed suspiciously low but a new search got the same result.
Dan Lufkin said,

December 30, 2009 @ 11:09 pm

A Scandinavian oddity of order: to say "thank God" Swedes say "tack och lov" [thanks and praise] but Norwegians say "lov og takk" [other way round]. It sounds strange the first time you hear it, from either direction.

Then there's the Royal Order of Adjectives. Is that due to David Crystal or is it old?
LovinRoman said,

January 2, 2010 @ 2:30 am

I'm surprised that no one has mentioned the Prague School/MIT linguist Roman Jakobson in connection with this discussion. His work on markedness relations, semantic ordering and the poetic function of language (immanent in the utterance constrasted with poetic discourse, or "poetry") might shed some light on this discussion. One question is what people are saying more often, another is why the pairs are ordered in the way that they are. His article "Linguistics and Poetics" might be interesting for y'all to look at.

Also…perhaps the Norwegians do it to precisely mark that they are not Swedes. Wouldn't be surprising, given their history. That's supposedly why Americans started switching their hands to cut food and eat it–to mark themselves off and to be able to detect British spies in their midst (American English was not so far from British English at that point…).
Sili said,

January 9, 2010 @ 6:09 pm

I realise that parsing is not altogether trivial, but I wonder if it could not be done in such a way as to harvest some citizen science.

Unfortunately I don't know how to make sentences as interesting as galaxies and supernovae, but it must at least be possible to set up similar software to let random strangers parse a selected corpus in such a manner that consensus wins out in a selfcorrecting manner. The real problem then lies in getting people to use such a 'GrammarZoo' as their time-sink of choice.
Eli Anne said,

February 16, 2010 @ 10:44 am

Dan Lufkin: Where have you heard that? It sounds terrible and not idiomatic at all :-/
Varför sötsur inte sursöt? « ÖVERSÄTTARBLOGGEN said,

May 26, 2010 @ 5:09 pm

[…] läsning om den saken här. Och en uppsats om motsatsord […]

RSS feed for comments on this post

More models of binomial order

8 Comments

J. W. Brewer said,

Charles Belov said,

rootlesscosmo said,

Dan Lufkin said,

LovinRoman said,

Sili said,

Eli Anne said,

Varför sötsur inte sursöt? « ÖVERSÄTTARBLOGGEN said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta