Language Log

Archive for Computational linguistics

"Did you mean: 艺轩国"

September 3, 2013 @ 10:13 am· Filed by Mark Liberman under Computational linguistics

Searching for a Chinese name in my gmail archive this morning, I was interested to see that Helpful Google is now transliterating between pinyin and hanzi:

I didn't mean 艺轩国, as it happens, but it's nice to know that if that's what I had wanted, gmail would have been ready to help.

Permalink Comments (5)

The message

August 27, 2013 @ 7:14 am· Filed by Mark Liberman under Computational linguistics, Language and culture, Prosody

This year's Penn Reading Project book is Adam Bradley's Book of Rhymes: The Poetics of Hip Hop. In my discussion group yesterday afternoon, several participants complained that some important things about the "poetics" of rap are lost in a purely textual presentation of the lyrics. One student observed that in pieces he knows, the rhythm is there in the written form — but the lyrics for pieces that he doesn't know seem flat and lifeless in comparison.

There are good reasons that this is more true for the works of Melle Mel or Jay Z than for Elizabeth Barrett Browning or W.H. Auden, I think.

One of the advantages of the weblog format is the combination of text, images, and audio or video clips, so for this morning's Breakfast Experiment™ I decided to present a small exploration of the "poetics of hip hop" in a multimedia — and somewhat quantitative — framework.

This exercise will clarify why transcriptions of the lyrics, even with bold-face indications of stress, are missing an important dimension. The lines' scansion depends not only on the syllable sequence and on where the performer puts phrasal stresses, but also on the alignment of the syllables with the musical meter. This alignment is not automatic or always obvious — it has artistically-relevant degrees of freedom beyond those available in most other genres of text setting.

For those whose appraisal of Bradley's book was (interpreting freely) "not enough vampires and car chases", this will probably make things worse — you have been warned.

Read the rest of this entry »

Permalink Comments (20)

The culturomic psychology of urbanization

August 18, 2013 @ 8:50 am· Filed by Mark Liberman under Computational linguistics, Language and culture

Patricia Greenfield, "The Changing Psychology of Culture From 1800 Through 2000", Psychological Science 2013 (pdf):

The Google Books Ngram Viewer allows researchers to quantify culture across centuries by searching millions of books. This tool was used to test theory-based predictions about implications of an urbanizing population for the psychology of culture. Adaptation to rural environments prioritizes social obligation and duty, giving to other people, social belonging, religion in everyday life, authority relations, and physical activity. Adaptation to urban environments requires more individualistic and materialistic values; such adaptation prioritizes choice, personal possessions, and child-centered socialization in order to foster the development of psychological mindedness and the unique self. The Google Ngram Viewer generated relative frequencies of words indexing these values from the years 1800 to 2000 in American English books. As urban populations increased and rural populations declined, word frequencies moved in the predicted directions. Books published in the United Kingdom replicated this pattern. The analysis established long-term relationships between ecological change and cultural change, as predicted by the theory of social change and human development (Greenfield, 2009).

Read the rest of this entry »

Permalink Comments (19)

Linguistic Diversity and Traffic Accidents

August 15, 2013 @ 2:02 pm· Filed by Mark Liberman under Computational linguistics, Language and culture

An important new paper (Sean Roberts & James Winters, "Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits", PLOS ONE 2013, is explained clearly in a blog post by one of the authors, "Uncovering spurious correlations between language and culture", a replicated typo 8/15/2013:

James and I have a new paper out in PLOS ONE where we demonstrate a whole host of unexpected correlations between cultural features. These include acacia trees and linguistic tone, morphology and siestas, and traffic accidents and linguistic diversity.

We hope it will be a touchstone for discussing the problems with analysing cross-cultural statistics, and a warning not to take all correlations at face value. It’s becoming increasingly important to understand these issues, both for researchers as more data becomes available, and for the general public as they read more about these kinds of study in the media (e.g. recent coverage in National Geographic, the BBC and TED).

Read the rest of this entry »

Permalink Comments (19)

Words, letters, and an unusual Scrabble turn

August 8, 2013 @ 9:37 am· Filed by Mark Liberman under Computational linguistics

Last month, I taught a short course on "Corpus-based Linguistic Research" at the LSA Institute in Ann Arbor, in which the participants were asked to do individual projects. One of the undergraduates in the class, Alex R., undertook to examine the time-course of variability in English spelling, starting with the Paston Letters, which are "a collection of letters and papers consisting of the correspondence of members of the Paston family of Norfolk gentry, and others connected with them in England, between the years 1422 and 1509".

Read the rest of this entry »

Permalink Comments (35)

Narcissism in Emerging Adulthood

August 6, 2013 @ 4:37 am· Filed by Mark Liberman under Computational linguistics

Yesterday the NYT had a feature on Jean Twenge's work — Douglas Quenqua, "Seeing Narcissists Everywhere", 8/5/2013 — and also "A Back and Forth About Narcissism":

The social-science journal Emerging Adulthood recently invited Jean M. Twenge and one of her most prominent critics, Jeffrey Arnett, to debate “whether today’s emerging adults are excessively ‘narcissistic,’ ” as Dr. Twenge asserts. Both wrote papers outlining their positions, then each wrote a reply to the other.

Read the rest of this entry »

Permalink Comments (8)

More on Juola's stylometry

July 29, 2013 @ 6:22 am· Filed by Geoffrey K. Pullum under Computational linguistics, Language and technology, Style and register, Writing

Worth reading if you were interested in the computational stylometric analysis by Patrick Juola that helped to unmask J. K. Rowling as the author of The Cuckoo's Calling: an article in The Chronicle of Higher Education about Juola's work.

Read the rest of this entry »

Permalink Comments off

The fruits of your labors

July 26, 2013 @ 6:01 am· Filed by Mark Liberman under Computational linguistics

At the recent Language Diversity Congress in Groningen, one of many interesting presentations was Martijn Wieling and John Nerbonne's "Inducing and using phonetic similarity". More than a thousand LL readers played a role in the creation of this work, by responding to a request back in May ("Rating American English Accents", 5/19/2012) to participate in an online experiment.

Read the rest of this entry »

Permalink Comments (7)

Rowling and "Galbraith": an authorial analysis

July 16, 2013 @ 7:35 am· Filed by Ben Zimmer under Computational linguistics, Language and technology, Linguistics in the news

The Sunday (UK) Times recently revealed that J.K. Rowling wrote the detective novel The Cuckoo's Calling under the pen name Robert Galbraith. The newspaper explained that, as part of their investigation, they sought the assistance of two scholars who have developed software to help with authorship attribution: Peter Millican of Oxford University and Patrick Juola of Duquesne University. Given the public interest in the Rowling revelation, I asked Patrick to write a guest post describing the authorial analysis that he conducted. (For more on the story, see my post on the Wall Street Journal's Speakeasy blog.)

Read the rest of this entry »

Permalink Comments (17)

American Passivity

July 15, 2013 @ 8:12 am· Filed by Mark Liberman under Computational linguistics, Linguistic history

This is an illustrative Breakfast Experiment™ for my course at the LSA Institute (on "Corpus-Based Linguistic Research"). It starts from an earlier LL post, "When men were men, and verbs were passive", 8/4/2006, where I observed that Winston Churchill, often cited as a model of forceful eloquence, used the passive voice for 30-50% of his verbs in various passages from his 1899 memoir The River War — several times the rate noted in statistical usage studies from the 1960s and later.

So I thought I'd do a quick historical survey of passive-voice rates, as a example of what can be done with Mark Davies' COHA corpus.

Read the rest of this entry »

Permalink Comments (11)

That howling void of thoughtlessness beneath

July 10, 2013 @ 7:49 am· Filed by Mark Liberman under Computational linguistics

From Charles Stross, Neptune's Brood. It's 7000 AD, and Krina Alizond-114 has this to say about a not-very-helpful piece of interactive software:

[T]hese things bore only a thin veneer of intelligence: Once you crack the ice and tumble into the howling void of thoughtlessness beneath, the illusion ceases to be comforting and becomes a major source of irritation.

Read the rest of this entry »

Permalink Comments (18)

Weird languages?

July 2, 2013 @ 1:49 pm· Filed by Mark Liberman under Computational linguistics

Tyler Schnoebelen, "The Weirdest Languages", Idibon blog 6/21/2013:

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total. […]

The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

Read the rest of this entry »

Permalink Comments (57)

City of the big disjunctions

June 29, 2013 @ 5:36 am· Filed by Mark Liberman under Computational linguistics

Continuing in another connection with the exploration of real-estate listings that I discussed earlier ("Long is good, good is bad, nice is worse, and ! is questionable", 6/12/2013; "Significant (?) relationships everywhere", 6/14/2013), I stumbled on this curious factoid about the use of and and or in trulia.com's listings for the ten cities I've harvested so far: