Archive for Computational linguistics

More on the statistics of real-estate listings

Early last summer, an inquiry from Sanette Tanaka at the WSJ led me to do a Breakfast Experiment™ on the relationship between the language of real-estate listings and the price of the associated properties ("Long is good, good is bad, nice is worse, and ! is questionable", 6/12/2013; "Significant (?) relationships everywhere", 6/14/2013; "City of the big disjunctions", 6/20/2013).

Read the rest of this entry »

Comments (3)

Speaker-change offsets

In Meg Wilson's post on marmoset vs. human conversational turn-taking,  I learned about Tanya Stivers et al., "Universals and cultural variation in turn-taking in conversation", PNAS 2009, which compared response offsets to polar ("yes-no") questions in 10 languages. Here's their plot of the data for English:

Based on examination of a Dutch corpus, they argue that "the use of question–answer sequences is a reasonable proxy for turn-taking more generally"; and in their cross-language data, they found that "the response timings for each language, although slightly skewed to the right, have a unimodal distribution with a mode offset for each language between 0 and +200 ms, and an overall mode of 0 ms. The medians are also quite uniform, ranging from 0 ms (English, Japanese, Tzeltal, and Yélî-Dnye) to +300 ms (Danish, ‡Ākhoe Hai‖om, Lao) (overall cross-linguistic median +100 ms)."

Read the rest of this entry »

Comments (20)

Non-projective flavor

From a current Starbucks ad, a nice example of a non-projective English sentence:

Read the rest of this entry »

Comments (29)

On Interdisciplinary Collaboration and "Latent Personas"

This is a guest post by David Bamman, in response to the post by Dan Garrette ("Computational linguistics and literary scholarship", 9/12/2013).


The critique by Hannah Alpert-Abrams and Dan Garrette of our recent ACL paper ("Learning Latent Personas of Film Characters") and the ensuing discussion is raising interesting questions on the nature of interdisciplinary research, specifically between computer science and literary studies. Garrette frames our paper as "attempting to … answer questions in literary theory" and Alpert-Abrams argues that for a given work of this kind to be truly interdisciplinary, it "must be cutting edge in the field of literary scholarship too." To do truly meaningful work at the intersection of computer science and literary studies, they argue, parties from both sides need to be involved.

While I disagree with how Garrette and Alpert-Abrams have characterized our paper (as attempting to address literary theory), I fundamentally agree with their underlying point. I have a different understanding of how we get to that point, however; to illustrate this, let me offer here a different framing of our paper.

Read the rest of this entry »

Comments (28)

Computational linguistics and literary scholarship

Email from Dan Garrette:

I am a Computer Science PhD student at UT-Austin working with Jason Baldridge, but I've recently been collaborating with my colleague Hannah Alpert-Abrams in the Comparative Literature department here at UT.  We've been talking about the intersection of NLP and literary study and we are interested in looking at ways in which researchers can collaborate to do work that is valid scholarship in both fields.

There has been a flurry of writing recently about the relationship between the sciences and the humanities (see: Ted Underwood, Steven Pinker, Ross Douthat's response to Steven Pinker, etc), and a particularly interesting paper at ACL (David Bamman, Brendan O’Connor, & Noah A. Smith, "Learning Latent Personas of Film Characters") that attempts to use modern NLP techniques to answer questions in literary theory.  Unfortunately, much of this discussion has failed to actually understand or recognize the scholarship that is really happening in the humanities, and, instead, seems to assume that people in the sciences are able to simply walk in and provide answers for another field.

We would like to see truly interdisciplinary work that combines contemporary ideas from both fields, and we see the ACL paper as the perfect point of entry for a public conversation about this kind of work. Because Language Log attracts readers from many different disciplines, and because computational linguistics has played an important part of the developing field of 'digital humanities,' we thought it might be a good forum for this conversation.

We have written a short response to the ACL paper which we think might make an interesting Language Log post, and Jason suggested I send it to you to see if you were interested.  We'd be very interested to hear your thoughts and the thoughts of the greater Language Log readership. Perhaps it could even spark a conversation.

Read the rest of this entry »

Comments (30)

Reassuring parables

The most recent xkcd:

Mouseover title:

'At least humans are better at quietly amusing ourselves, oblivious to our pending obsolescence' thought the human, as a nearby Dell Inspiron contentedly displayed the same bouncing geometric shape screensaver it had been running for years.

Read the rest of this entry »

Comments (29)

Proportion of adjectives and adverbs: Some facts

Adam Okulicz-Kozaryn, "Cluttered writing: adjectives and adverbs in academia", Scientometrics 2013:

[H]ow do we produce readable and clean scientific writing? One of the good elements of style is to avoid adverbs and adjectives (Zinsser 2006). Adjectives and adverbs sprinkle paper with unnecessary clutter. This clutter does not convey information but distracts and has no point especially in academic writing, say, as opposed to literary prose or poetry.

If you've seen my earlier discussion of this paper ("'Clutter' in (writing about) science writing", 8/30/2013), you'll recall that Dr. O-K goes on to count adjectives and adverbs in some word lists from samples of scientific writing. He asserts that "social science" writing uses about 15% more adjectives and adverbs than "natural science" writing — although he doesn't tell us enough about his methods to dispel concerns about several likely sources of artifact — and he concludes by asking "Is there a reason that a social scientist cannot write as clearly as a natural scientist?"

In the interests of science of all kinds, I decided to devote this morning's Breakfast Experiment™ to the relations between text quality and the proportion of adjectives and adverbs. I wrote a python script using NLTK to calculate the proportions of various parts of speech in a document; and then I tried this script out on samples of various sorts of writing. Here's some of what I found.

Read the rest of this entry »

Comments (23)

"Did you mean: 艺轩国"

Searching for a Chinese name in my gmail archive this morning, I was interested to see that Helpful Google is now transliterating between pinyin and hanzi:

I didn't mean 艺轩国, as it happens, but it's nice to know that if that's what I had wanted, gmail would have been ready to help.

 

Comments (5)

The message

This year's Penn Reading Project book is Adam Bradley's Book of Rhymes: The Poetics of Hip Hop.  In my discussion group yesterday afternoon, several participants complained that some important things about the "poetics" of rap are lost in a purely textual presentation of the lyrics. One student observed that in pieces he knows, the rhythm is there in the written form — but the lyrics for pieces that he doesn't know seem flat and lifeless in comparison.

There are good reasons that this is more true for the works of Melle Mel or Jay Z than for Elizabeth Barrett Browning or W.H. Auden, I think.

One of the advantages of the weblog format is the combination of text, images, and audio or video clips, so for this morning's Breakfast Experiment™ I decided to present a small exploration of the "poetics of hip hop" in a multimedia — and somewhat quantitative — framework.

This exercise will clarify why transcriptions of the lyrics, even with bold-face indications of stress, are missing an important dimension. The lines' scansion depends not only on the syllable sequence and on where the performer puts phrasal stresses, but also on the alignment of the syllables with the musical meter. This alignment is not automatic or always obvious — it has artistically-relevant degrees of freedom beyond those available in most other genres of text setting.

For those whose appraisal of Bradley's book was (interpreting freely) "not enough vampires and car chases", this will probably make things worse — you have been warned.

Read the rest of this entry »

Comments (20)

The culturomic psychology of urbanization

Patricia Greenfield, "The Changing Psychology of Culture From 1800 Through 2000", Psychological Science 2013 (pdf):

The Google Books Ngram Viewer allows researchers to quantify culture across centuries by searching millions of books. This tool was used to test theory-based predictions about implications of an urbanizing population for the psychology of culture. Adaptation to rural environments prioritizes social obligation and duty, giving to other people, social belonging, religion in everyday life, authority relations, and physical activity. Adaptation to urban environments requires more individualistic and materialistic values; such adaptation prioritizes choice, personal possessions, and child-centered socialization in order to foster the development of psychological mindedness and the unique self. The Google Ngram Viewer generated relative frequencies of words indexing these values from the years 1800 to 2000 in American English books. As urban populations increased and rural populations declined, word frequencies moved in the predicted directions. Books published in the United Kingdom replicated this pattern. The analysis established long-term relationships between ecological change and cultural change, as predicted by the theory of social change and human development (Greenfield, 2009).

Read the rest of this entry »

Comments (19)

Linguistic Diversity and Traffic Accidents

An important new paper (Sean Roberts & James Winters, "Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits", PLOS ONE 2013, is explained clearly in a blog post by one of the authors, "Uncovering spurious correlations between language and culture", a replicated typo 8/15/2013:

James and I have a new paper out in PLOS ONE where we demonstrate a whole host of unexpected correlations between cultural features. These include acacia trees and linguistic tone, morphology and siestas, and traffic accidents and linguistic diversity.

We hope it will be a touchstone for discussing the problems with analysing cross-cultural statistics, and a warning not to take all correlations at face value.  It’s becoming increasingly important to understand these issues, both for researchers as more data becomes available, and for the general public as they read more about these kinds of study in the media (e.g. recent coverage in National Geographic, the BBC and TED).

Read the rest of this entry »

Comments (19)

Words, letters, and an unusual Scrabble turn

Last month, I taught a short course on "Corpus-based Linguistic Research" at the LSA Institute in Ann Arbor, in which the participants were asked to do individual projects. One of the undergraduates in the class, Alex R., undertook to examine the time-course of variability in English spelling, starting with the Paston Letters, which are "a collection of letters and papers consisting of the correspondence of members of the Paston family of Norfolk gentry, and others connected with them in England, between the years 1422 and 1509".

Read the rest of this entry »

Comments (35)

Narcissism in Emerging Adulthood

Yesterday the NYT had a feature on Jean Twenge's work — Douglas Quenqua, "Seeing Narcissists Everywhere", 8/5/2013 — and also "A Back and Forth About Narcissism":

The social-science journal Emerging Adulthood recently invited Jean M. Twenge and one of her most prominent critics, Jeffrey Arnett, to debate “whether today’s emerging adults are excessively ‘narcissistic,’ ” as Dr. Twenge asserts. Both wrote papers outlining their positions, then each wrote a reply to the other.

Read the rest of this entry »

Comments (8)