Archive for Computational linguistics

The 2016 Blizzard Challenge

The Blizzard Challenge needs you!

Every year since 2005, an ad hoc group of speech technology researchers has held a "Blizzard Challenge", under the aegis of the Speech Synthesis Special Interest Group (SYNSIG) of the International Speech Communication Association.

The general idea is simple:  Competitors take a released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer are then evaluated through listening tests.

Why "Blizzard"? Because the early competitions used the CMU ARCTIC datasets, which began with a set of sentences read from James Oliver Curwood's novel Flower of the North.

Anyhow, if you have an hour of your time to donate towards making speech synthesis better, sign up and be a listener!

Comments (2)

Q. Pheevr's Law

In a comment on one of yesterday's posts ("Adjectives and Adverbs"), Q. Pheevr wrote:

It's hard to tell with just four speakers to go on, but it looks as if there could be some kind of correlation between the ADV:ADJ ratio and the V:N ratio (as might be expected given that adjectives canonically modify nouns and adverbs canonically modify verbs). Of course, there are all sorts of other factors that could come into this, but to the extent that speakers are choosing between alternatives like "caused prices to increase dramatically" and "caused a dramatic increase in prices," I'd expect some sort of connection between these two ratios.

So since I have a relatively efficient POS tagging script, and an ad hoc collection of texts lying around, I thought I'd devote this morning's Breakfast Experiment™ to checking the idea out.

Read the rest of this entry »

Comments (17)

Scientific prescriptivism: Garner Pullumizes?

The publisher's blurb for the fourth edition of Garner's Modern English Usage introduces a new feature:

With more than a thousand new entries and more than 2,300 word-frequency ratios, the magisterial fourth edition of this book — now renamed Garner's Modern English Usage (GMEU)-reflects usage lexicography at its finest. […]

The judgments here are backed up not just by a lifetime of study but also by an empirical grounding in the largest linguistic corpus ever available. In this fourth edition, Garner has made extensive use of corpus linguistics to include ratios of standard terms as compared against variants in modern print sources.

The largest linguistic corpus ever available, of course, is the Google Books ngram collection. And "word-frequency ratio" means, for example, the observations that in pluralizing corpus, corpora outnumbers corpuses by 69:1.

Read the rest of this entry »

Comments (19)

Data journalism and film dialogue

Hannah Anderson and Matt Daniels, "Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age", A Polygraph Joint 2016:

Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.

But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

To begin answering these questions, we Googled our way to 8,000 screenplays and matched each character’s lines to an actor. From there, we compiled the number of lines for male and female characters across roughly 2,000 films, arguably the largest undertaking of script analysis, ever.

Read the rest of this entry »

Comments (7)

Some phonetic dimensions of speech style

My posts have been thin recently, mostly because over the past ten days or so I've been involved in the preparation and submission of five conference papers, on top of my usual commitments to teaching and meetings and visitors. Nobody's fault but mine, of course. Anyhow, this gives me some raw material that I'll try to present in a way that's comprehensible and interesting to non-specialists.

One of the papers, with Neville Ryant as first author, was an attempt to take advantage of a large collection of audiobook recordings to explore some dimensions of speaking style. The paper is still under review, so I'll wait to post a copy until its fate is decided — but there are some interesting ideas and suggestive results that I can share. And to motivate you to read the somewhat wonkish explanation that follows, I'll start off with a picture:

Read the rest of this entry »

Comments (3)

I'm learning… something?

Google Translate renders "Tanulok Magyarul" as "I'm learning English":


Read the rest of this entry »

Comments (24)

Poetic sound and silence

Following up on "Political sound and silence", 2/8/2016, here's a level plot of speech segment durations and immediately-following silence segment durations from William Carlos Williams' poetry reading at the Library of Congress in May of 1945:


Read the rest of this entry »

Comments (3)

Political sound and silence

As part of an exercise/demonstration for a course, last night I ran Neville Ryant's second-best speech activity detector (SAD) on Barack Obama's Weekly Radio Addresses for 2010 (50 of them), and George W. Bush's Weekly Radio Addresses for 2008 (48 of them). The distributions of speech and silence durations, via R's kernel density estimation function, look like this:

Read the rest of this entry »

Comments (3)

Totally Word Mapper

Jack Grieve Twitter-based Word Mapper (see "Geolexicography", 1/27/2016) is now available as a web app — like totally:

Read the rest of this entry »

Comments (18)

Style or artefact or both?

In "Correlated lexicometrical decay", I commented on some unexpectedly strong correlations over time of the ratios of word and phrase frequencies in the Google Books English 1gram dataset:

I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and  it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.

So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.

And in a comment on a comment, I noted that the corresponding data from the Corpus of Historical American English, which is a balanced corpus collected from sources largely or entirely distinct from the Google Books dataset, shows similar unexpected correlations.

So today I'd like to point out that much simpler data — frequencies of  a few of the commonest words — shows some equally strong correlations over time in these same datasets.

Read the rest of this entry »

Comments (9)

Correlated lexicometrical decay

This is a brief progress report on "The case of the disappearing determiners", which I've continue to poke at in my spare time.

As the red line in the plot below shows, the proportion of nouns immediately preceded by THE decreased over the course of the 20th century, from an average of 18.9% for books published in 1900-1910 to 13.5% for books published in 1990-2000.  The blue line shows that the proportion of adjective+noun sequences immediately preceded by THE was higher, overall, but followed a remarkably similar falling trajectory, from 29.1% in 1900-1910 to 21.2% in 1990-2000:

Read the rest of this entry »

Comments (12)

Dutch DE

Following up on yesterday's post "The case of the disappearing determiners", Gosse Bouma sent me some data from the CGN ("Corpus Gesproken Nederlands"), about determiner use in spoken Dutch by people born between 1914 and 1987. According to the CGN website,

The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. […] In version 1.0, the results are presented that have emerged from the project. The total number of words available here is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands.

It's not clear to me exactly when the recordings were made, but the project ran from 1998 to 2004.

Gosse sent data focused on the word de, which is the definite article for masculine and feminine ("common") nouns in Dutch, cognate with English the.  (The definite article for neuter nouns, het, is less frequent and also can be used as a pronoun.)

The results are similar to those that I reported earlier for English: Older people use the definite article more frequently than younger people (at least for people born in the 1950s onwards), and at every age, men use the definite article more than women.

Read the rest of this entry »

Comments (5)

The case of the disappearing determiners

For the past century or so, the commonest word in English has gradually been getting less common. Depending on data source and counting method, the frequency of the definite article THE has fallen substantially — in some cases at a rate as high as 50% per 100 years.

At every stage, writing that's less formal has fewer THEs, and speech generally has fewer still, so to some extent the decline of THE is part of a more general long-term trend towards greater informality. But THE is apparently getting rarer even in speech, so the change is more than just the (normal) shift of writing style towards the norms of speech.

There appear to be weaker trends in the same direction, at overall lower rates, in German, Italian, Spanish, and French.

I'll lay out some of the evidence for this phenomenon, mostly collected from earlier LLOG posts. And then I'll ask a few questions about what's really going on, and why and how it's happening. [Warning: long and rather wonky.]

Read the rest of this entry »

Comments (54)