In a couple of earlier posts, I noted a gradual change in the tendency of American newspapers and U.S. Supreme Court opinions to use the phrase "the United States" as a syntactic subject ("The United States as a subject", 10/6/2009; "'The United States' as a subject at the Supreme Court", 10/20/2009). Thus in a small sample of instances of "the United States" in SCOTUS opinions from each of 6 years from 1800 to 2000, the percentage of instances in subject position increased from 1.8% to 19%:
|YEAR||Rate per 100|
It's now possible to parse unrestricted text automatically but fairly accurately, and I expect to see large collections of automatically-parsed text become generally available soon (see e.g. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme, "Annotated Gigaword", Proc. of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extraction, ACL-HLT 2012). And I was recently trying to persuade some colleagues that parsing a large historical books collection would be a Good Thing, even for people who aren't interested in syntactic structure per se. So for this morning's Breakfast Experiment™, I decided to take a look at the proportion of subject positioning for three country names in three geographically diverse news sources.
The news sources that I chose were The New York Times, the English-language news service of Agence France Presse, and the English-language news service of Xinhua (the official press agency of the People's Republic of China). I used stories from these news services as reproduced in the English Gigaword Fifth Edition corpus from LDC; took the first 100 instances of each country-name in text from each news service published corpus (from 1994 or 1995, depending on the service); and classified the 900 resulting examples by hand.
The results were as I expected:
|"China"||"France"||"the United States"|
|Agence France Presse||26%||34%||22%|
|The New York Times||2%||14%||26%|
I'm looking forward to trying this out on a larger scale when I have the parsed corpus in hand (which should be soon). For example, in the AFP stories in the English Gigaword corpus, "France" occurs 238,631 times, "China" occurs 372,420 times, and "the United States" occurs 315,590 times. Given the scale of the collection, it will be possible to test hypotheses about differential effects of types of verb (e.g. active vs. passive, verbs where the subject is agent vs. verbs where the subject is experiencer); about the effects of tense and aspect; about changes over time from 1994 to 2010; and so forth.
(Note: There are a few empirical wrinkles in this experiment, all of which I ignored. For example, a small number of instances of the word china in my sample referred to the pottery rather than to the country; and some of the nation-name instances in the Xinhua sample were part of reports of sports scores. And the sample is far too small for any reliable conclusions to be drawn. Still, it came out the way I expected.)