As part of an exercise/demonstration for a course, last night I ran Neville Ryant's second-best speech activity detector (SAD) on Barack Obama's Weekly Radio Addresses for 2010 (50 of them), and George W. Bush's Weekly Radio Addresses for 2008 (48 of them). The distributions of speech and silence durations, via R's kernel density estimation function, look like this:
Archive for Computational linguistics
I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.
So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.
And in a comment on a comment, I noted that the corresponding data from the Corpus of Historical American English, which is a balanced corpus collected from sources largely or entirely distinct from the Google Books dataset, shows similar unexpected correlations.
So today I'd like to point out that much simpler data — frequencies of a few of the commonest words — shows some equally strong correlations over time in these same datasets.
This is a brief progress report on "The case of the disappearing determiners", which I've continue to poke at in my spare time.
As the red line in the plot below shows, the proportion of nouns immediately preceded by THE decreased over the course of the 20th century, from an average of 18.9% for books published in 1900-1910 to 13.5% for books published in 1990-2000. The blue line shows that the proportion of adjective+noun sequences immediately preceded by THE was higher, overall, but followed a remarkably similar falling trajectory, from 29.1% in 1900-1910 to 21.2% in 1990-2000:
Following up on yesterday's post "The case of the disappearing determiners", Gosse Bouma sent me some data from the CGN ("Corpus Gesproken Nederlands"), about determiner use in spoken Dutch by people born between 1914 and 1987. According to the CGN website,
The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. […] In version 1.0, the results are presented that have emerged from the project. The total number of words available here is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands.
It's not clear to me exactly when the recordings were made, but the project ran from 1998 to 2004.
Gosse sent data focused on the word de, which is the definite article for masculine and feminine ("common") nouns in Dutch, cognate with English the. (The definite article for neuter nouns, het, is less frequent and also can be used as a pronoun.)
The results are similar to those that I reported earlier for English: Older people use the definite article more frequently than younger people (at least for people born in the 1950s onwards), and at every age, men use the definite article more than women.
For the past century or so, the commonest word in English has gradually been getting less common. Depending on data source and counting method, the frequency of the definite article THE has fallen substantially — in some cases at a rate as high as 50% per 100 years.
At every stage, writing that's less formal has fewer THEs, and speech generally has fewer still, so to some extent the decline of THE is part of a more general long-term trend towards greater informality. But THE is apparently getting rarer even in speech, so the change is more than just the (normal) shift of writing style towards the norms of speech.
There appear to be weaker trends in the same direction, at overall lower rates, in German, Italian, Spanish, and French.
I'll lay out some of the evidence for this phenomenon, mostly collected from earlier LLOG posts. And then I'll ask a few questions about what's really going on, and why and how it's happening. [Warning: long and rather wonky.]
Randy Olson and Ritchie King, "How The Internet* Talks [*Well, the mostly young and mostly male users of Reddit, anyway]", fivethirtyeight.com 11/18/2015. The interactive viewer reveals some interesting trends:
Alberto Acerbi , Vasileios Lampos, Philip Garnett, & R. Alexander Bentley, "The Expression of Emotions in 20th Century Books", PLOSOne 3/20/2013:
We report here trends in the usage of “mood” words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in roughly 4% of all books published up to the year 2008. We find evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, we show that, in books, American English has become decidedly more “emotional” than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language.
Christiaan H Vinkers et al., "Use of positive and negative words in scientific PubMed abstracts between 1974 and 2014: retrospective analysis", BMJ 2015:
Design Retrospective analysis of all scientific abstracts in PubMed between 1974 and 2014.
Methods The yearly frequencies of positive, negative, and neutral words (25 preselected words in each category), plus 100 randomly selected words were normalised for the total number of abstracts. […]
Results The absolute frequency of positive words increased from 2.0% (1974-80) to 17.5% (2014), a relative increase of 880% over four decades.
Dyami Hayes writes to point out that there has been a change over the past century in the relative popularity (at least in printed text) of constructions like these:
What this book sets out to do is to provide some tools, ideas and suggestions for tackling non-verbal reasoning questions.
What it attempts to do is provide a framework for understanding how local governments are organized.
Yesterday I explained why the long-tailed ("Zipf's Law") distribution of word frequencies makes it almost impossible to estimate vocabulary size by counting word types in samples of writing or speaking ("Why estimating vocabulary size by counting words is (nearly) impossible"). In a comment on that post, "flow" suggested that similar problems might afflict attempts to estimate vocabulary size by checking someone's knowledge of random samples from a dictionary.
But in fact this worry is groundless. There are many problems with the method — especially defining the list to sample from, and defining what counts as "knowing" an item in the sample — but the nature of word-frequency distributions is not one of them.
A few days ago, I expressed skepticism about a claim that "the human lexicon has a de facto storage limit of 8,000 lexical items", which was apparently derived from counting word types in various sorts of texts ("Lexical limits?", 12/5/2015). There are many difficult questions here about what we mean by "word", and what it means to be "in" the lexicon of an individual or a language — though I don't see how you could answer those questions so as to come up with a number as low as 8,000. But today I'd like to focus on some of the reasons that even after settling the "what is a word" questions, it's nearly hopeless to try to establish an upper bound by counting "word" types in text.