In a couple of earlier posts, I noted a gradual change in the tendency of American newspapers and U.S. Supreme Court opinions to use the phrase "the United States" as a syntactic subject  ("The United States as a subject", 10/6/2009; "'The United States' as a subject at the Supreme Court", 10/20/2009). Thus in a small sample of instances of "the United States" in SCOTUS opinions from each of 6 years from 1800 to 2000, the percentage of instances in subject position increased from 1.8% to 19%:

YEAR Rate per 100

It's now possible to parse unrestricted text automatically but fairly accurately, and I expect to see large collections of automatically-parsed text become generally available soon (see e.g. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme, "Annotated Gigaword",  Proc. of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extraction, ACL-HLT 2012).  And I was recently trying to persuade some colleagues that parsing a large historical books collection would be a Good Thing, even for people who aren't interested in syntactic structure per se. So for this morning's Breakfast Experiment™, I decided to take a look at the proportion of subject positioning for three country names in three geographically diverse news sources.

The news sources that I chose were The New York Times, the English-language news service of Agence France Presse, and the English-language news service of Xinhua (the official press agency of the People's Republic of China). I used stories from these news services as reproduced in the English Gigaword Fifth Edition corpus from LDC; took the first 100 instances of each country-name in text from each news service published corpus (from 1994 or 1995, depending on the service); and classified the 900 resulting examples by hand.

The results were as I expected:

"China" "France" "the United States"
Xinhua 19% 11% 14%
Agence France Presse 26% 34% 22%
The New York Times 2% 14% 26%

I'm looking forward to trying this out on a larger scale when I have the parsed corpus in hand (which should be soon).  For example, in the AFP stories in the English Gigaword corpus, "France" occurs 238,631 times, "China" occurs 372,420 times, and "the United States" occurs 315,590 times. Given the scale of the collection, it will be possible to test hypotheses about differential effects of types of verb (e.g. active vs. passive, verbs where the subject is agent vs. verbs where the subject is experiencer); about the effects of tense and aspect; about changes over time from 1994 to 2010; and so forth.

(Note: There are a few empirical wrinkles in this experiment, all of which I ignored. For example, a small number of instances of the word china in my sample referred to the pottery rather than to the country; and some of the nation-name instances in the Xinhua sample were part of reports of sports scores. And the sample is far too small for any reliable conclusions to be drawn. Still, it came out the way I expected.)



  1. Chad Nilep said,

    July 22, 2012 @ 10:31 pm

    I'm left wondering what the confirmed expectation was. The earlier posts suggested that "the United States" as subject was becoming more common over time, but this data is synchronic. I would guess the expectation might be either a) country-name occurs as subject around 20-plus percent of the time, in line with the trend from SCOTUS data, or b) the name of the country where the news source is located appears as subject more often than the names of the other countries. I have very little confidence, though, that either of these actually was the expectation in question. Could it be some affect of the quasi-animacy condition on thinking about China? a transfer from French syntax? rough correspondence between automatic parsing and hand-coding?

    [(myl) Sorry for not making it clearer -- the (I thought) obvious prediction is that in French news, "France" is in subject position more often than "China" or "the U.S." is; in Chinese news, "China" is in subject position more often than than "France" or "the U.S." is; and in U.S. news, "the United States" is in subject position more often than "France" or "China" is.]

  2. Per Stinchcombe said,

    July 26, 2012 @ 3:12 pm

    "The sample is far too small for any reliable conclusions to be drawn"? My back-of-the-envelope calculations suggest that in the absence of a real relationship, the probability of predicting the right highest-percentage-of-subject-usages three times out of three is (1/3)^3, which I think most folks would call statistically significant.

    [(myl) But I have higher standards than that -- I'd like to know that the pattern is maintained over longer periods of time, for example.]

