Randy Olson and Ritchie King, "How The Internet* Talks [*Well, the mostly young and mostly male users of Reddit, anyway]", fivethirtyeight.com 11/18/2015. The interactive viewer reveals some interesting trends:
Archive for Computational linguistics
Alberto Acerbi , Vasileios Lampos, Philip Garnett, & R. Alexander Bentley, "The Expression of Emotions in 20th Century Books", PLOSOne 3/20/2013:
We report here trends in the usage of “mood” words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in roughly 4% of all books published up to the year 2008. We find evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, we show that, in books, American English has become decidedly more “emotional” than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language.
Christiaan H Vinkers et al., "Use of positive and negative words in scientific PubMed abstracts between 1974 and 2014: retrospective analysis", BMJ 2015:
Design Retrospective analysis of all scientific abstracts in PubMed between 1974 and 2014.
Methods The yearly frequencies of positive, negative, and neutral words (25 preselected words in each category), plus 100 randomly selected words were normalised for the total number of abstracts. […]
Results The absolute frequency of positive words increased from 2.0% (1974-80) to 17.5% (2014), a relative increase of 880% over four decades.
Dyami Hayes writes to point out that there has been a change over the past century in the relative popularity (at least in printed text) of constructions like these:
What this book sets out to do is to provide some tools, ideas and suggestions for tackling non-verbal reasoning questions.
What it attempts to do is provide a framework for understanding how local governments are organized.
Yesterday I explained why the long-tailed ("Zipf's Law") distribution of word frequencies makes it almost impossible to estimate vocabulary size by counting word types in samples of writing or speaking ("Why estimating vocabulary size by counting words is (nearly) impossible"). In a comment on that post, "flow" suggested that similar problems might afflict attempts to estimate vocabulary size by checking someone's knowledge of random samples from a dictionary.
But in fact this worry is groundless. There are many problems with the method — especially defining the list to sample from, and defining what counts as "knowing" an item in the sample — but the nature of word-frequency distributions is not one of them.
A few days ago, I expressed skepticism about a claim that "the human lexicon has a de facto storage limit of 8,000 lexical items", which was apparently derived from counting word types in various sorts of texts ("Lexical limits?", 12/5/2015). There are many difficult questions here about what we mean by "word", and what it means to be "in" the lexicon of an individual or a language — though I don't see how you could answer those questions so as to come up with a number as low as 8,000. But today I'd like to focus on some of the reasons that even after settling the "what is a word" questions, it's nearly hopeless to try to establish an upper bound by counting "word" types in text.
Greg Corrado, "Computer, respond to this email", Google Research Blog 11/3/2015:
I get a lot of email, and I often peek at it on the go with my phone. But replying to email on mobile is a real pain, even for short replies. What if there were a system that could automatically determine if an email was answerable with a short reply, and compose a few suitable responses that I could edit or send with just a tap? […]
Some months ago, Bálint Miklós from the Gmail team asked me if such a thing might be possible. I said it sounded too much like passing the Turing Test to get our hopes up… but having collaborated before on machine learning improvements to spam detection and email categorization, we thought we’d give it a try. […]
We’re actually pretty amazed at how well this works. We’ll be rolling this feature out on Inbox for Android and iOS later this week, and we hope you’ll try it for yourself! Tap on a Smart Reply suggestion to start editing it. If it’s perfect as is, just tap send. Two-tap email on the go — just like Bálint envisioned.
A couple of great posts by Ben Schmidt at Bookworm: "Vector space models for the digital humanities", 10/25/2015; and "Rejecting the gender binary: a vector-space operation", 10/30/2015.
Update — A quick experiment by a Penn grad student, which confirms that somewhat plausible things emerge from fairly small and fairly noisy datasets…
I read Ancillary Justice, the first book in Ann Leckie's Imperial Radch series, at some point in the spring of 2014, and so I was not at all surprised to find Brad DeLong referring to her as "an extremely sharp observer […] author of the devastatingly-good Ancillary Justice", in a blog post "Ann Leckie on David Graeber's "Debt: The First 5000 Mistakes": Handling the Sumerian Evidence Smackdown", 11/24/2014, where he quotes at length from her blog post "Debt", 2/24/2013.
And if you haven't read Ann Leckie's trilogy, you should do yourself a favor and start doing so right away. But this is Language Log, not Science Fiction Book Review Log or Unreliable Economic History Log, so why am I bringing up Ann Leckie now?
Yogi Berra may or may not have said that "You can observe a lot just by watching". He didn't add that you can learn a lot just by counting — but as a baseball person, he surely knew the power of simple statistics.
You can learn a lot about G.K. Chesterton from the Wikipedia article about him, including his observation that "The whole modern world has divided itself into Conservatives and Progressives. The business of Progressives is to go on making mistakes. The business of the Conservatives is to prevent the mistakes from being corrected." But Wikipedia won't tell you that his fiction writing had a striking, perhaps unique, statistical property: he hardly ever uses feminine pronouns.
Brad DeLong linked to a paywalled Financial Times article by Lisa Pollack about problems with spreadsheet usage, and observed that
[C]onsiderations like these make me extremely hesitant when I think of asking my students in Econ 1 next spring to do problems sets in Excel. Shouldn’t I be asking them to do it in R via R Studio or R Commander instead? Audit trails are very valuable. Debuggability is very valuable. Excel ain’t got it…
The first comment, from "Captaindomestic":
I'm biased as a MathWorks employee, but you may want to look into MATLAB. It is really strong in the kinds of data analysis and plotting that econ students need to do. MATLAB has a pretty non-programmer friendly editor and model that helps new users.
Alexander Spangher, "Building the Next New York Times Recommendation Engine", NYT 8/11/2015:
The New York Times publishes over 300 articles, blog posts and interactive stories a day.
Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format.