The fruits of your labors

« previous post | next post »

At the recent Language Diversity Congress in Groningen, one of many interesting presentations was Martijn Wieling and John Nerbonne’s “Inducing and using phonetic similarity“. More than a thousand LL readers played a role in the creation of this work, by responding to a request back in May (“Rating American English Accents“, 5/19/2012) to participate in an online experiment.

A longer explanation of the experiment and its outcome can be found in Martijn Wieling et al., “Automatically measuring the strength of foreign accents in English“:

We measure the differences between the pronunciations of native and non-native American English speakers using a modified version of the Levenshtein (or string edit) distance applied to phonetic transcriptions. Although this measure is well understood theoretically and variants of it have been used successfully to study dialect pronunciations, the comprehensibility of related varieties, and the atypicalness of the speech of the bearers of cochlear implants, it has not been applied to study foreign accents. We briefly present an appropriate version of the Levenshtein distance in this paper and apply it to compare the pronunciation of non-native English speakers to native American English speech. We show that the computational measurements correlate strongly with the average “native-like” judgments given by more than 1000 native U.S. English raters (r = -0.8, p < 0.001). This means that the Levenshtein distance is qualified to function as a measurement of “native-likeness” in studies of foreign accent.

One thing that still remains to be done is to compare these results to the distribution of correlations among the human judges — given the diversity of opinion on the nature of the task, it would not surprise me to find that the automatic algorithm agreed with average human judgments as well as or better than individual human subjects agreed with the average. A related question is how much of the unexplained variance (1.0-0.8^2 = 36%) is noise, and how much is due to systematic effects that are missing from the phonetic transcriptions that are input to their MPI-weighted string-edit distance, such as sub-IPA phonetic variation, or prosodic differences, or …  It might be true that their algorithm agrees with average human judgments better than individual human judges do, and at the same time be true that there are factors influencing human judgments that their algorithm doesn’t pay attention to.

Martijn is good at finding creating ways to recruit experimental subjects, as this video indicates:

Martijn plays the role of the subject.



7 Comments

  1. peter said,

    July 26, 2013 @ 6:24 am

    Sorry to nitpick, but progress in science requires it.

    “More than a thousand LL readers played a role . . “. How is it known how many distinct readers there were who played a role? IIRC, the survey was optionally anonymous, and even non-anonymous responders could have responded multiple times with different false email addresses. So, as far as I can tell, the number of responders could have been any integer number between 1 and the number of responses received. It would have even been possible for someone to write a script to respond automatically, something possibly – but only possibly, not necessarily – detectable from the server logs, or perhaps from subtle analysis of the results obtained.

    [(myl) If money or politically-fraught issues were involved, I would be more concerned about this question. There’s no obvious motivation in this case for multiple participation, much less construction and deployment of a bot. The respondents are obviously not a demographically random sample, but participants in psychological experiments almost never are.]

  2. Ted said,

    July 26, 2013 @ 11:58 am

    Fascinating.

    As a participant, my subjective impression was that “super-segmental” information (i.e., prosody) was a very important aspect of the foreignness of many of the samples. I suspect that accounts for a high fraction of the unexplained variance. But to test this hypothesis, someone will have to come up with a quantifiable measure of prosodic difference analogous to the Levenshtein procedure.

  3. M.N. said,

    July 26, 2013 @ 12:15 pm

    “More than a thousand LL readers played a role…”
    Also, four thousand ships passed through the lock. :)

  4. peter said,

    July 26, 2013 @ 3:35 pm

    Mark — Just to be clear, my comment was not about the degree of randomness of the sample of respondents, and hence about the statistical validity of the study results for making claims about some population. That is another issue to the one I was raising, which was about the factual accuracy of the statements describing the size of the sample.

    The paper you link to states (p. 13): “A total of 1143 native U.S. English participants filled in the questionnaire.” As far as I can tell, the authors have no way to know whether this statement is true or false. As I said, the true number may be any integer between 1 and 1143. That uncertainty may, as it happens, have no impact at all on what wider conclusions one may draw from the paper’s results, and may be no bar to the validity of the paper’s claims.

    It always fascinates me that issues ignored – or discussed only loosely – in one discipline may be the central core of entire other disciplines. Traditional pure mathematicians ignore the constructability of mathematical entities they believe to exist, a feature which could be seen to define computer science. Mainstream economists ignore the imperfections and limitations of human decision-makers, aspects which marketing theorists focus their attention on. And here some people interested in language talk sloppily about numbers in a way that would not be accepted if they were talking about words.

  5. Theo Vosse said,

    July 27, 2013 @ 2:10 pm

    @peter: yes, it’s odd that they write that instead of the more factual “we received 1143 responses”. I guess the reviewers accepted it because it was close enough and –to put it simply– nobody cares about it. Nor did anyone care about the fact that 1143 LL readers is obviously not a representative sample, although they go out of their way to describe geological distribution. For the study itself, that doesn’t really matter. Levenshtein predicts human judgement pretty well. Period.

    The part that does worry me a bit is this from the conclusions: “allowing us to study this phenomenon in a replicable and easily analyzable way without incurring the expense of human judgments.” That’s obviously not proven, but something that could (and should) be done in a replication study.

  6. Zubon said,

    July 30, 2013 @ 7:18 am

    Mainstream economists ignore the imperfections and limitations of human decision-makers, aspects which marketing theorists focus their attention on. [citation needed]

    Many mainstream economists talk of little else. To take the first common economic example that comes to mind, see search costs. If you can get more mainstream that the Nobel laureates, one of the first notes on the Coase Theorem is Ronald Coase addressing the problems of imperfect information and transaction costs.

    Which perhaps reinforces your point: “people interested in [this topic] talk sloppily about [other disciplines] in a way that would not be accepted if they were talking about [their own].”

  7. peter said,

    July 30, 2013 @ 5:23 pm

    Thanks, Zubon. Yes, some (not all) mainstream economists do now talk of such issues, but only after more than a century of prodding from marketers and other social scientists. As the grudging response from mainstream economists to Kelvin Lancaster’s theory of product attributes in the 1960s showed, they rarely come willingly to this table.

RSS feed for comments on this post