Archive for Computational linguistics

"… repeated violations of an act"

Brian Mahoney, "NBA Sets Flopping Penalties; Players May Be Fined", AP 10/3/1012:

Stop the flop.

The NBA will penalize floppers this season, fining players for repeated violations of an act a league official said Wednesday has "no place in our game."

Those exaggerated falls to the floor may fool the referees and fans during the game, but officials at league headquarters plan to take a look for themselves afterward.

Read the rest of this entry »

Comments (21)

Lexical loops

David Levary Jean-Pierre Eckmann, Elisha Moses, and Tsvi Tlusty, "Loops and Self-Reference in the Construction of Dictionaries", Phys. Rev. X 2, 031018 (2012):

ABSTRACT: Dictionaries link a given word to a set of alternative words (the definition) which in turn point to further descendants. Iterating through definitions in this way, one typically finds that definitions loop back upon themselves. We demonstrate that such definitional loops are created in order to introduce new concepts into a language. In contrast to the expectations for a random lexical network, in graphs of the dictionary, meaningful loops are quite short, although they are often linked to form larger, strongly connected components. These components are found to represent distinct semantic ideas. This observation can be quantified by a singular value decomposition, which uncovers a set of conceptual relationships arising in the global structure of the dictionary. Finally, we use etymological data to show that elements of loops tend to be added to the English lexicon simultaneously and incorporate our results into a simple model for language evolution that falls within the “rich-get-richer” class of network growth.

Read the rest of this entry »

Comments (22)

Historical culturomics of pronoun frequencies

Jean M. Twenge, W. Keith Campbell and Brittany Gentile, "Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900–2008", Sex Roles published online 8/7/2012. The abstract:

The status of women in the United States varied considerably during the 20th century, with increases 1900–1945, decreases 1946–1967, and considerable increases after 1968. We examined whether changes in written language, especially the ratio of male to female pronouns, reflected these trends in status in the full text of nearly 1.2 million U.S. books 1900–2008 from the Google Books database. Male pronouns included he, him, his, himself and female pronouns included she, her, hers, and herself. Between 1900 and 1945, 3.5 male pronouns appeared for every female pronoun, increasing to 4.5 male pronouns during the postwar era of the 1950s and early 1960s. After 1968, the ratio dropped precipitously, reaching 2 male pronouns per female pronoun by the 2000s. From 1968 to 2008, the use of male pronouns decreased as female pronouns increased. The gender pronoun ratio was significantly correlated with indicators of U.S. women’s status such as educational attainment, labor force participation, and age at first marriage as well as women’s assertiveness, a personality trait linked to status. Books used relatively more female pronouns when women’s status was high and fewer when it was low. The results suggest that cultural products such as books mirror U.S. women’s status and changing trends in gender equality over the generations.

Read the rest of this entry »

Comments (20)

Noisily channeling Claude Shannon

There's a passage in James Gleick's "Auto Crrect Ths!", NYT 8/4/2012, that's properly spelled but in need of some content correction:

If you type “kofee” into a search box, Google would like to save a few milliseconds by guessing whether you’ve misspelled the caffeinated beverage or the former United Nations secretary-general. It uses a probabilistic algorithm with roots in work done at AT&T Bell Laboratories in the early 1990s. The probabilities are based on a “noisy channel” model, a fundamental concept of information theory. The model envisions a message source — an idealized user with clear intentions — passing through a noisy channel that introduces typos by omitting letters, reversing letters or inserting letters.

“We’re trying to find the most likely intended word, given the word that we see,” Mr. [Mark] Paskin says. “Coffee” is a fairly common word, so with the vast corpus of text the algorithm can assign it a far higher probability than “Kofi.” On the other hand, the data show that spelling “coffee” with a K is a relatively low-probability error. The algorithm combines these probabilities. It also learns from experience and gathers further clues from the context.

The same probabilistic model is powering advances in translation and speech recognition, comparable problems in artificial intelligence. In a way, to achieve anything like perfection in one of these areas would mean solving them all; it would require a complete model of human language. But perfection will surely be impossible. We’re individuals. We’re fickle; we make up words and acronyms on the fly, and sometimes we scarcely even know what we’re trying to say.

Read the rest of this entry »

Comments (7)

It's all about who?

Sharon Jayson, "What's on Americans' mind? Increasingly, 'me'", USA Today 7/10/2012:

An analysis of words and phrases in more than 750,000 American books published in the past 50 years finds an emphasis on "I" before "we" — showing growing attention to the individual over the group.

This is actually true as stated. If we take the counts from the "American English" unigram dataset in the Google Books ngram collection, and extract the year-by-year counts for the letter strings in question, the frequency of "I" has increased relative to the frequency of "we" over the period since 1960 — to the point where the ratio of frequencies is almost as high as it was in 1900:

Read the rest of this entry »

Comments (11)

Geo-political agency

In a couple of earlier posts, I noted a gradual change in the tendency of American newspapers and U.S. Supreme Court opinions to use the phrase "the United States" as a syntactic subject  ("The United States as a subject", 10/6/2009; "'The United States' as a subject at the Supreme Court", 10/20/2009). Thus in a small sample of instances of "the United States" in SCOTUS opinions from each of 6 years from 1800 to 2000, the percentage of instances in subject position increased from 1.8% to 19%:

YEAR Rate per 100
1800
1.8
1810
3.5
1850
7
1900
7
1950
12
2000
19

It's now possible to parse unrestricted text automatically but fairly accurately, and I expect to see large collections of automatically-parsed text become generally available soon (see e.g. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme, "Annotated Gigaword",  Proc. of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extraction, ACL-HLT 2012).  And I was recently trying to persuade some colleagues that parsing a large historical books collection would be a Good Thing, even for people who aren't interested in syntactic structure per se. So for this morning's Breakfast Experiment™, I decided to take a look at the proportion of subject positioning for three country names in three geographically diverse news sources.

Read the rest of this entry »

Comments (2)

Textual narcissism, replication 2

Yesterday, I tried replicating one of the experiments in Jean M. Twenge et al., "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012, and got results that seem to be significantly at variance with their conclusions ("Textual narcissism", 7/13/2012).

This morning, I thought I'd try getting a replication with word counts from a different source of historical data.  I used the Corpus of Historical American English (Mark Davies, The Corpus of Historical American English: 400 million words, 1810-2009., 2010). Some of the problems with the Google Books source are removed here: the COHA collection is balanced by genre, and a detailed list of its 107,000 sources is available.

And the results remain hard to square with Twenge et al.'s main conclusion, which they expressed like this:

This study demonstrates that language use in books reflects increasing individualism in the U.S. since 1960. Language use in books reflects the larger cultural ethos, and that ethos has been increasingly characterized by a focus on the self and uniqueness.

Read the rest of this entry »

Comments (7)

Textual narcissism

Tyler Cowen, "I wonder if this is actually true", Marginal Revolution 7/12/2012.

Researchers who have scanned books published over the past 50 years report an increasing use of words and phrases that reflect an ethos of self-absorption and self-satisfaction.

"Language in American books has become increasingly focused on the self and uniqueness in the decades since 1960,” a research team led by San Diego State University psychologist Jean Twenge writes in the online journal PLoS One. “We believe these data provide further evidence that American culture has become increasingly focused on individualistic concerns.”

Their results are consistent with those of a 2011 study which found that lyrics of best-selling pop songs have grown increasingly narcissistic since 1980. Twenge’s study encompasses a longer period of time—1960 through 2008—and a much larger set of data.

That 2011 study was not very convincing — for details, see "Lyrical Narcissism?", 4/9/2011; "'Vampirical' hypotheses", 4/28/2011; "Pop-culture narcissism again", 4/30/2011;  "Let me count the ways", 6/9/2011.

On the face of it, however, the new study (Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012) looks more plausible. But I thought  that for this morning's Breakfast Experiment™ I'd take a closer look. And what I found diverges pretty seriously from the conclusions of the cited paper.

Read the rest of this entry »

Comments (22)

Not raising hogs

Following on from Barbara Partee's example of Kruschev not banging his shoe, I just came across a great example of chained hypothetical negative events. It was during Bonnie Webber's plenary address here in Austin yesterday, at the NASSLLI Summer School. (BTW, if you'll be in the Austin area on Saturday, I have an announcement for you: NASSLLI is hosting a big event commemorating the centenary of Turing's birth, and it's free and open to the public.) But without more ado, here's the "Not raising hogs" text, a good Texas story of how to get something from nothing:

THE NOT RAISING HOGS BUSINESS

To: Mr. Clayton Yeutter
Secretary of Agriculture
Washington, D.C.

Dear Sir,
My friends, Wayne and Janelle, over at Wichita Falls, Texas, received a check the other day for $1,000 from the government for not raising hogs. So, I want to go into the "not raising hogs" business myself next year.


Read the rest of this entry »

Comments (24)

Your typical sentence

Today's xkcd:

Mouseover title: Although the Markov chain-style text model is still rudimentary; it recently gave me "Massachusetts Institute of America". Although I have to admit it sounds prestigious.

Read the rest of this entry »

Comments (11)

Big Inaccessible Data

John Markoff, "Troves of Personal Data, Forbidden to Researchers", NYT 5/21/2012:

When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

(I wish this were generally true…)

At least that is how the system is supposed to work. But lately social scientists have come up against an exception that is, true to its name, huge.

It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.

For those who don't care much about science, and oppose data publication on the basis of some combination of beliefs in corporate secrecy, personal privacy, and researchers' "sweat equity", here's a stronger argument: lack of broad access to representative data is also a recipe for bad engineering.  Or rather, it's a recipe for slow to non-existent development of workable solutions to the the technical problems of turning recorded data into useful information.

At the recent DataEDGE workwhop in Berkeley, as well as at the recent LREC 2012 conference in Istanbul, I was unpleasantly surprised by the widespread lack of awareness of this (in my opinion evident) fact.

Read the rest of this entry »

Comments (7)

Big Data in the humanities and social sciences

I'm in Berkeley for the DataEDGE Conference, where I'm due to participate in a "living room chat" advertised as follows:

Size Matters: Big Data, New Vistas in the Humanities and Social Sciences
Mark Liberman, Geoffrey Nunberg, Matthew Salganik
Vast archives of digital text, speech, and video, along with new analysis technology and inexpensive computation, are the modern equivalent of the 17th-century invention of the telescope and microscope. We can now observe social and linguistic patterns in space, time, and cultural context, on a scale many orders of magnitude greater than in the recent past, and in much greater detail than before. This transforms not just the study of speech, language, and communication but fields ranging from sociology and empirical economics to education, history, and medicine — with major implications for both scholarship and technology development.

Read the rest of this entry »

Comments (22)

Help Wanted: Sharing Data for Research on Reading and Writing

On Friday, July 20, at the 2012 meeting of the Council of Writing Program Administrators in Albuquerque NM, there will be a session called "Help Wanted: Sharing Data for Research on Reading and Writing".  Here's the proposal that was submitted for this session:

Read the rest of this entry »

Comments (5)