Archive for Computational linguistics

Literary moist aversion

Over the years, we've viewed the phenomenon of word aversion from several angles — a recent discussion, with links to earlier posts, can be found here. What we're calling word aversion is a feeling of intense, irrational distaste for the sound or sight of a particular word or phrase, not because its use is regarded as etymologically or logically or grammatically wrong, nor because it's felt to be over-used or redundant or trendy or non-standard, but simply because the word itself somehow feels unpleasant or even disgusting.

Some people react in this way to words whose offense seems to be entirely phonetic: cornucopia, hardscrabble, pugilist, wedge, whimsy. In other cases, it's plausible that some meaning-related associations play a role: creamy, panties, ointment, tweak. Overall, the commonest object of word aversion in English, judging from many discussions in web forums and comments sections, is moist.

One problem with web forums and comments sections as sources of evidence is that they don't tell us what fraction of the population experiences the phenomenon of word aversion, either in general or with respect to some particular word like moist. Dozens of commenters may join the discussion in a forum that has at most thousands of readers, but we can't tell whether they represent one person in five or one person in a hundred; nor do we know how representative of the general population a given forum or comments section is.

Pending other approaches, it occurred to me that we might be able to learn something from looking at usage in literary works. Authors who are squicked by moist, for example, will plausibly tend to find alternatives. (Well, in some cases the effect might motivate over-use; but never mind that for now…)

So for this morning's Breakfast Experiment™, I downloaded the April 2010 Project Gutenberg DVD, and took a quick look.

Read the rest of this entry »

Comments (27)

Translation as cryptography as translation


Warren Weaver, 1947 letter to Norbert Wiener, quoted in "Translation", 1949:

[K]nowing nothing official about, but having guessed and inferred considerable about, powerful new mechanized methods in cryptography – methods which I believe succeed even when one does not know what language has been coded – one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography.

Mark Brown, "Modern Algorithms Crack 18th Century Secret Code", Wired UK 10/26/2011:

Computer scientists from Sweden and the United States have applied modern-day, statistical translation techniques — the sort of which are used in Google Translate — to decode a 250-year-old secret message.

The original document, nicknamed the Copiale Cipher, was written in the late 18th century and found in the East Berlin Academy after the Cold War. It’s since been kept in a private collection, and the 105-page, slightly yellowed tome has withheld its secrets ever since.

But this year, University of Southern California Viterbi School of Engineering computer scientist Kevin Knight — an expert in translation, not so much in cryptography — and colleagues Beáta Megyesi and Christiane Schaefer of Uppsala University in Sweden, tracked down the document, transcribed a machine-readable version and set to work cracking the centuries-old code.

Read the rest of this entry »

Comments (22)

In favor of the microlex

Bruce Schneier quotes Stubborn Mule citing R.A. Howard:

Shopping for coffee you would not ask for 0.00025 tons  (unless you were naturally irritating), you would ask for 250 grams. In the same way, talking about a 1/125,000 or 0.000008 risk of death associated with a hang-gliding flight is rather awkward. With that in mind. Howard coined the term “microprobability” (μp) to refer to an event with a chance of 1 in 1 million and a 1 in 1 million chance of death he calls a “micromort” (μmt). We can now describe the risk of hang-gliding as 8 micromorts and you would have to drive around 3,000km in a car before accumulating a risk of 8μmt, which helps compare these two remote risks.

This reminds me of the Google Ngram Viewer's habit of citing word frequencies as percentages, with uninterpretably large numbers of leading zeros after the decimal point:

Read the rest of this entry »

Comments (27)

Speech-to-speech translation

Rick Rashid, "Microsoft Research shows a promising new breakthrough in speech translation technology", 118/2012:

A demonstration I gave in Tianjin, China at Microsoft Research Asia’s 21st Century Computing event has started to generate a bit of attention, and so I wanted to share a little background on the history of speech-to-speech technology and the advances we’re seeing today.

In the realm of natural user interfaces, the single most important one – yet also one of the most difficult for computers – is that of human speech.

Read the rest of this entry »

Comments (29)

Pundits were confused and inaccurate

Also, the sky turns out to have been blue much of the time, and early returns are strongly suggesting that water is often wet. John Sides, "2012 Was the Moneyball Election", The Monkey Cage 11/7/2012:

Barack Obama’s victory tonight is also a victory for the Moneyball approach to politics.  It shows us that we can use systematic data—economic data, polling data—to separate momentum from no-mentum, to dispense with the gaseous emanations of pundits’ “guts,” and ultimately to forecast the winner.

Read the rest of this entry »

Comments (25)

The he's and she's of Twitter

My latest column for the Boston Globe is about some fascinating new research presented by Tyler Schnoebelen at the recent NWAV 41 conference at Indiana University Bloomington. Schnoebelen's paper, co-authored with Jacob Eisenstein and David Bamman, is entitled "Gender, styles, and social networks in Twitter" (abstract, full paper, presentation).

Read the rest of this entry »

Comments (6)

'lololololol' ≠ Tagalog

Ed Manley, "Detecting Languages in London's Twittersphere", UrbanMovements 10/22/2012:

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I've been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I've been handling the data generation side, and the method really is quite simple.  Just like some similar work carried out by Eric Fischer, I've employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website's language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer. […]

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language.  On further investigation, I found that many of these classifications included just uses of English terms such as 'hahahahaha', 'ahhhhhhh' and 'lololololol'.  I don't know much about Tagalog but it sounds like a fun language.  Nevertheless, Tagalog was excluded from our analysis.

Read the rest of this entry »

Comments (10)

Nurbling

Comments (30)

A new chapter for Google Ngrams

When Google's Ngram Viewer was launched in December 2010 it encouraged everyone to be an amateur computational linguist, an amateur historical lexicographer, or a little of both. Today, the public interface that allows users to plumb the Google Books megacorpus has been relaunched, and the new version makes it even more enticing to researchers, both scholarly and nonscholarly. You can read all about it in my online piece for The Atlantic, as well as Jon Orwant's official introduction on the Google Research blog.

Read the rest of this entry »

Comments (13)

"… repeated violations of an act"

Brian Mahoney, "NBA Sets Flopping Penalties; Players May Be Fined", AP 10/3/1012:

Stop the flop.

The NBA will penalize floppers this season, fining players for repeated violations of an act a league official said Wednesday has "no place in our game."

Those exaggerated falls to the floor may fool the referees and fans during the game, but officials at league headquarters plan to take a look for themselves afterward.

Read the rest of this entry »

Comments (21)

Lexical loops

David Levary Jean-Pierre Eckmann, Elisha Moses, and Tsvi Tlusty, "Loops and Self-Reference in the Construction of Dictionaries", Phys. Rev. X 2, 031018 (2012):

ABSTRACT: Dictionaries link a given word to a set of alternative words (the definition) which in turn point to further descendants. Iterating through definitions in this way, one typically finds that definitions loop back upon themselves. We demonstrate that such definitional loops are created in order to introduce new concepts into a language. In contrast to the expectations for a random lexical network, in graphs of the dictionary, meaningful loops are quite short, although they are often linked to form larger, strongly connected components. These components are found to represent distinct semantic ideas. This observation can be quantified by a singular value decomposition, which uncovers a set of conceptual relationships arising in the global structure of the dictionary. Finally, we use etymological data to show that elements of loops tend to be added to the English lexicon simultaneously and incorporate our results into a simple model for language evolution that falls within the “rich-get-richer” class of network growth.

Read the rest of this entry »

Comments (22)

Historical culturomics of pronoun frequencies

Jean M. Twenge, W. Keith Campbell and Brittany Gentile, "Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900–2008", Sex Roles published online 8/7/2012. The abstract:

The status of women in the United States varied considerably during the 20th century, with increases 1900–1945, decreases 1946–1967, and considerable increases after 1968. We examined whether changes in written language, especially the ratio of male to female pronouns, reflected these trends in status in the full text of nearly 1.2 million U.S. books 1900–2008 from the Google Books database. Male pronouns included he, him, his, himself and female pronouns included she, her, hers, and herself. Between 1900 and 1945, 3.5 male pronouns appeared for every female pronoun, increasing to 4.5 male pronouns during the postwar era of the 1950s and early 1960s. After 1968, the ratio dropped precipitously, reaching 2 male pronouns per female pronoun by the 2000s. From 1968 to 2008, the use of male pronouns decreased as female pronouns increased. The gender pronoun ratio was significantly correlated with indicators of U.S. women’s status such as educational attainment, labor force participation, and age at first marriage as well as women’s assertiveness, a personality trait linked to status. Books used relatively more female pronouns when women’s status was high and fewer when it was low. The results suggest that cultural products such as books mirror U.S. women’s status and changing trends in gender equality over the generations.

Read the rest of this entry »

Comments (20)

Noisily channeling Claude Shannon

There's a passage in James Gleick's "Auto Crrect Ths!", NYT 8/4/2012, that's properly spelled but in need of some content correction:

If you type “kofee” into a search box, Google would like to save a few milliseconds by guessing whether you’ve misspelled the caffeinated beverage or the former United Nations secretary-general. It uses a probabilistic algorithm with roots in work done at AT&T Bell Laboratories in the early 1990s. The probabilities are based on a “noisy channel” model, a fundamental concept of information theory. The model envisions a message source — an idealized user with clear intentions — passing through a noisy channel that introduces typos by omitting letters, reversing letters or inserting letters.

“We’re trying to find the most likely intended word, given the word that we see,” Mr. [Mark] Paskin says. “Coffee” is a fairly common word, so with the vast corpus of text the algorithm can assign it a far higher probability than “Kofi.” On the other hand, the data show that spelling “coffee” with a K is a relatively low-probability error. The algorithm combines these probabilities. It also learns from experience and gathers further clues from the context.

The same probabilistic model is powering advances in translation and speech recognition, comparable problems in artificial intelligence. In a way, to achieve anything like perfection in one of these areas would mean solving them all; it would require a complete model of human language. But perfection will surely be impossible. We’re individuals. We’re fickle; we make up words and acronyms on the fly, and sometimes we scarcely even know what we’re trying to say.

Read the rest of this entry »

Comments (7)