Last month, I taught a short course on "Corpus-based Linguistic Research" at the LSA Institute in Ann Arbor, in which the participants were asked to do individual projects. One of the undergraduates in the class, Alex R., undertook to examine the time-course of variability in English spelling, starting with the Paston Letters, which are "a collection of letters and papers consisting of the correspondence of members of the Paston family of Norfolk gentry, and others connected with them in England, between the years 1422 and 1509".
Archive for Computational linguistics
The social-science journal Emerging Adulthood recently invited Jean M. Twenge and one of her most prominent critics, Jeffrey Arnett, to debate “whether today’s emerging adults are excessively ‘narcissistic,’ ” as Dr. Twenge asserts. Both wrote papers outlining their positions, then each wrote a reply to the other.
Worth reading if you were interested in the computational stylometric analysis by Patrick Juola that helped to unmask J. K. Rowling as the author of The Cuckoo's Calling: an article in The Chronicle of Higher Education about Juola's work.
Read the rest of this entry »
Read the rest of this entry »
At the recent Language Diversity Congress in Groningen, one of many interesting presentations was Martijn Wieling and John Nerbonne's "Inducing and using phonetic similarity". More than a thousand LL readers played a role in the creation of this work, by responding to a request back in May ("Rating American English Accents", 5/19/2012) to participate in an online experiment.
The Sunday (UK) Times recently revealed that J.K. Rowling wrote the detective novel The Cuckoo's Calling under the pen name Robert Galbraith. The newspaper explained that, as part of their investigation, they sought the assistance of two scholars who have developed software to help with authorship attribution: Peter Millican of Oxford University and Patrick Juola of Duquesne University. Given the public interest in the Rowling revelation, I asked Patrick to write a guest post describing the authorial analysis that he conducted. (For more on the story, see my post on the Wall Street Journal's Speakeasy blog.)
This is an illustrative Breakfast Experiment™ for my course at the LSA Institute (on "Corpus-Based Linguistic Research"). It starts from an earlier LL post, "When men were men, and verbs were passive", 8/4/2006, where I observed that Winston Churchill, often cited as a model of forceful eloquence, used the passive voice for 30-50% of his verbs in various passages from his 1899 memoir The River War — several times the rate noted in statistical usage studies from the 1960s and later.
So I thought I'd do a quick historical survey of passive-voice rates, as a example of what can be done with Mark Davies' COHA corpus.
From Charles Stross, Neptune's Brood. It's 7000 AD, and Krina Alizond-114 has this to say about a not-very-helpful piece of interactive software:
[T]hese things bore only a thin veneer of intelligence: Once you crack the ice and tumble into the howling void of thoughtlessness beneath, the illusion ceases to be comforting and becomes a major source of irritation.
Tyler Schnoebelen, "The Weirdest Languages", Idibon blog 6/21/2013:
The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total. [...]
The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).
Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.
Continuing in another connection with the exploration of real-estate listings that I discussed earlier ("Long is good, good is bad, nice is worse, and ! is questionable", 6/12/2013; "Significant (?) relationships everywhere", 6/14/2013), I stumbled on this curious factoid about the use of and and or in trulia.com's listings for the ten cities I've harvested so far:
The judge in the Zimmerman case has recently decided to let the jury decide for themselves about the source of the screams in the 911 tape ("Jury to decide whose voice on 911 call in Zimmerman case"). This decision is a stinging rebuke to the "expert" testimony of Tom Owen and Alan Reich, and supports the testimony of Peter French, George Doddington, and Hirotaka Nakasone. For a summary of the dueling experts, see Andrew Branca, "Zimmerman Case: Dr. Hirotaka Nakasone, FBI, and the low-quality 3-second audio file", Legal Insurrection 6/7/2013, "Zimmerman Prosecution’s Voice Expert admits: 'This is not really good evidence'", 6/8/2013, and "Zimmerman Case: Experts Call State’s Scream Claims 'Absurd' 'Ridiculous' and 'Imaginary Stuff'", 6/9/2013.
I don't have time this morning to discuss the issues at greater length, but it's clear that the judge's evaluation of the situation was correct. Read the rest of this entry »
Read the rest of this entry »
Sanette Tanaka, "Fancy Real-Estate Listing, Fancier Verbiage", WSJ 6/6/2013:
Savvy real-estate agents know it's not just what you say. It's how long it takes you to say it.
More-expensive homes go hand-in-hand with longer real-estate agents' remarks—the language written by the agent that supplements the house description and photos in a listing. Agents use a median 250 characters for homes listed under $100,000, according to an analysis for The Wall Street Journal by real-estate listings company Zillow. For homes priced over $1 million, they go nearly twice as long, with a median 487 characters. (That's about the length of this paragraph.)
"Generally, what you find is that regardless of the region, the more expensive the home is, the more characters are used to describe that home," says Stan Humphries, chief economist at Zillow.
That's not from the chorus of a postmodern country song — it's the title of a National Geographic piece discussing Morgan R. Frank, Kameron Decker Harris, Peter Sheridan Dodds, and Christopher M. Danforth, "The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place", PLoS ONE 5/29/2013.
David Brooks has found a congenial story in Google ngrams — or rather, in three papers about ngrammatical history, which he interprets to show that virtue, discipline, and concern for the common good have been declining, while subjectivity and concern for self-esteem have increased ("What Our Words Tell Us", NYT 5/20/2013)).
Brooks doesn't cite or link to the papers, which in my opinion is a form of journalistic malpractice, so here they are:
Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012
Pelin Kesebir and Selin Kesebir, "The Cultural Salience of Moral Character and Virtue Declined in Twentieth Century America", Journal of Positive Psychology, Forthcoming
Daniel B. Klein, "Ngrams of the Great Transformations", GMU Working Paper in Economics, 2013
From Simon King:
I am pleased to announce that the English section of this year's Blizzard Challenge listening test is now live. Please help us out by taking part, and encouraging your colleagues, students, friends, contacts, etc. to take part too. It's your chance to hear a range of speech synthesisers, including some really good ones. Please circulate this message widely – for example, on mailing lists, forums and using social media – we need to reach as many people as possible in the coming month or so.