Asian Speech and Italian Text

« previous post | next post »

One of the most interesting papers at an interesting conference was Michael Newman, "Identifying native English speaking Pacific Asian Americans by voice".

His abstract begins:

Research on racial identification by voice in the US has concentrated on determining rates of success with speakers from groups with well-known dialectal differences. These mainly involve distinguishing African Americans and European Americans but some work examined Latinos and Native Americans (Thomas 2004). By contrast, I report the results of a dialect identification experiment of a differentiation that is rarely discussed, that involving native English speaking Asian Americans. This group has only been found to show subtle quantitative differentiation from other groups in use of features shared by those groups (e.g. Wong 2009). In the experiment, 116 judges, all raised in New York City listened to eight men, and 111 judges listened to eight women all, also all raised in New York. Each set included two Chinese Americans, two Korean Americans, two European Americans, a Latino, and an African American, reading the same short passage. Judges identified speakers’ race, and if Asian, whether the speaker was of Korean or Chinese heritage. Finally, judges answered six subjective response items focused on stereotypes of Asian Americans. Judges’ successfully identified speakers’ race at rates highly significantly above chance for all racial groups

His handout quotes a since-deleted entry on the Chinese Facebook Group Board:

"Do you sound Asian when you speak English?"

I don't mean an accent like when FOBs try to speak English. I've just noticed that Asian Americans tend to have a certain quality to their talking. It's kind of how you can tell when an white person is talking or when an African American or Latino American person is talking.

They might even be using the same vocabulary, I don't mean slang or anything, but heir voice inflections and vowel pronunciations. (…) For instance, listen to Daniel Dae Kim speak English (not on LOST, but in real life lol) when he talks and compare to like … (insert famous white actor).

The "short passage" that Michael's speakers read was this:

A wily coyote led sharpshooters armed with tranquilizer guns on a merry chase through Central Park before being captured on Wednesday. At one point, authorities tried to corner the animal in the southeast corner of the park, by Wollman Rink. The clever creature jumped into the water, ducked under a bridge, then scampered through the rink ground and ran off.

The listeners were 227 CUNY students, whose responses were collected via Blackboard's survey module. As his abstract says, the judges "successfully identified speakers’ race at rates highly significantly above chance for all racial groups". The Asian American judges were especially successful in recognizing the Asian American speakers, scoring 65.3% for the two male speakers and 78.1% for the two female speakers (out of four possible response categories).

Michael reports instrumental tests for a variety of possible cues (breathiness, "syllable timing", vowel quality, voice onset time).  The results were suggestive, especially given the small number of speakers.

However, none of the cues tested were determinative — at best it seems that the Asian speakers have distributions of certain linguistic behaviors that are somewhat different from those in other identified groups, while still overlapping substantially with the other distributions. This is the normal situation in such cases.

Many people assume that cues to category membership should be individually determinative, or at least should form steps in a chain of determinate logical steps, and are puzzled when subjects are able to classify stimuli reliably (or at least significantly better than chance) in absence of clear cues. So I thought it might be instructive to explain a general method for making decisions by combining multiple pieces of weak evidence.

Suppose, for example, we pick a letter at random from the newspaper, and find that it's an 'R'. Was the newspaper in English or in Italian? Well, we obviously can't know for sure, since there are plenty of common words in both languages that contain this letter. We can determine the odds by taking samples of English and Italian newpaper text, and estimating for each language the probability of the 26 letters of the standard Latin alphabet (ignoring case differences, spaces, punctuation, accented characters and so on).

[This is not a plausible method for doing language identification from text, since we'd surely want to look at letter sequences and common words, not individual letters; but it'll do as a thought experiment.]

On the basis of this exercise, we'd conclude that our 'R' is about 10% more likely in Italian than in English. More exactly,

P(R|Italian)/P(R|English) ≅1.0555

If Italian and English are the only options, so that P(English) = (1 – P(Italian)), a little algebra then yields

P(Italian) = 1.0555/(1+1.0555) ≅ .514

So if we guess "Italian" at this point, we've got about a 51% chance of being correct.

We draw another random letter, and this time it's an 'L'. By the same method,

P(L|Italian)/P(L|English) ≅1.4988

This is a bit better, since

P(Italian) = 1.4988/(1+1.4988) ≅ .60

But in fact, we can combine these two independent pieces of evidence. One way to do it is to multiply the odds: 1.0555*1.4988 ≅ 1.582, and so

P(Italian) = 1.582/2.582 = .613

So let's keep going. The next eight randomly-selected letters are 'O', 'A', 'E', 'T',  'L', 'R', 'O', 'A'. The corresponding ratios are 1.2756, 1.3545, 0.9334, 0.7159, 1.0555, 1.4988, 1.2756, 1.3545.  The product of all ten  ratios is 4.9915, meaning that given these ten randomly-selected letters, the odds of the language being Italian are about 5 to 1, and now

P(Italian) = 4.9915/(1+4.9915) ≅ .833

Now, none of the letters that we drew gave us, on its own, any especially solid evidence for this conclusion. The strongest evidence came from 'L', and guessing on that basis alone we would expect to be correct only a bit less than 60% of the time — and two of the ten letters — 'E' and 'T' — count individually as evidence in favor of English. Nevertheless, we've gotten our probability of guessing correctly up to about 83%. If we keep drawing letters, the chances are that the odds will get even better.

A generalization of this method was explored by Alan Turing during WW II, and is now standard in many technical applications. More recently, it's been argued that (much) animal (and human) decision-making works in a similar way, as discussed in Joshua Gold and Michael Shadlen, "Banburismus and the Brain: Decoding the Relationship between Sensory Stimuli, Decisions, and Reward", Neuron 36:299-308, 2002:

In the early 1940’s, Alan Turing and his colleagues at Bletchley Park broke the supposedly unbreakable Enigma code used by the German navy. They succeeded by finding in the encoded messages the barest hints of evidence to support or refute various hypotheses about the encoding scheme that they could exploit to determine the contents of the message. Their success rested, in part, on a mathematical framework with three critical components: a method of quantifying the weight of evidence provided by individual clues toward the alternative hypotheses under consideration, a method of up-dating this quantity given multiple pieces of evidence, and a decision rule to determine when the evidence was sufficient to render a judgment on the most likely hypothesis (Good, 1979).

If there are two alternate hypothesis h1 and h2, and we have a piece of evidence ei, and we can estimate the probability of ei given that h1 is true, and the probability of ei given that h2 is true — P(ei |h1) and P(ei |h2) — then Turing defined the "weight of evidence" provided by ei in favor of h1 as the "log odds": log(P(ei| h1)/P(ei| h2)).

[The role of the logarithm here is just to turn products into sums, so that we can calculate the combined weight of evidence by adding up the individual weights contributed by of multiple independent pieces of (perhaps weak) evidence e1, e2, … , eN,  We can then place our bet when the (log) odds (or the number of tries) reach some pre-determined threshold.

Mathematically-inclined readers will want to remember that by Bayes' Rule,

If we assume that Italian and English are equally likely, then

since the prior probabilities P(Italian)/P(English) and P(English)/P(Italian) are both equal to 1.]

Here's a graph of the accumulated odds resulting from running this experiment 1,000 times with 1 to 30 random letter-selections:

This shows that our random experiment was a bit optimistic — on average, the probability of a correct guess (in this set-up) after 10 draws from the urn is only about 73%:

Number of selections Odds Probability of guessing right
5 1.6171 0.6179
10 2.7169 0.7310
15 4.1119 0.8044
20 7.2068 0.8782
25 12.5071 0.9260
30 18.5046 0.9487

As I observed earlier, this is not a good way to do language identification. (By which I mean that relative unigram frequencies are an impoverished source of evidence, not that Turing's method for combining pieces of evidence is inappropriate.) The point is just that separate pieces of individually-weak evidence can combine to yield a judgment that has a high probability of being correct.

In the same way, we can make a reliable linguistic judgment even if there is no cue that provides strong evidence on its own. It's even easier to explain how someone might be able to make a judgment that's unreliable — but still much better than chance guessing — by combining a number of weak sources of evidence in a case like the perception of "Asian" speech.


  1. MattF said,

    November 15, 2010 @ 12:09 pm

    I can see how the argument works in the (weak evidence -> strong conclusion) direction, but how does it work in the other direction? How do you decide what collection of weak factors leads to the observed strong discrimination? And how are the factors weighted?

  2. JM said,

    November 15, 2010 @ 12:16 pm

    With the Asian-American "voice" I believe the ability to distinguish is akin to being able to "hear" a musical passage & recognize the composer. Or, in art identifying a painting as "from the school of" ______.

    Clearly, with voice a distinguishing feature is intonation. However, as a psychologist, not a linguist, I would be interested in learning how one could measure more specific variables.

    I know in the Japanese-American community in California one could distinguish Nisei vs Sansei intonation and possibly mainland vs Hawaiian intonation. I don't recall if anyone has studied this.

  3. JM said,

    November 15, 2010 @ 12:57 pm

    Are you sure you got Bayes' Rule right there? I think you want to leave off the extra divisions in the numerator and denominator. Otherwise you seem to be squaring your priors.

    [(myl) Oops. That's what I get for wrestling with Google Docs' equation editor while jet lagged. Fixed now, I think.]

  4. Outis said,

    November 15, 2010 @ 1:08 pm

    As an Asian American, I've always felt that there was an "Asian American voice" that is quite independent of the accent. In fact, on the few occasions I could not detect this, I've always found it to be slightly unsettling. I've felt the same to be true in native Asian French, but less so with Asian British and Asian German. The latter perhaps due to the stronger dialectal viaration (to my ears anyway) that possibly overwhelms the "Asianness".

    I wonder if bilingualism on either the part of the speaker or the listener has any effect in this.

  5. Spell Me Jeff said,

    November 15, 2010 @ 1:42 pm

    I should think that controlling for bilingualism would be essential to further research of this kind, and would be easy enough to do.

  6. J. W. Brewer said,

    November 15, 2010 @ 2:12 pm

    The other things I think you would want to control for are: (a) whether or not the parents or other immediate household/family members of the U.S.-raised speaker were themselves native English speakers (which will to some extent correlate with bilingualism but not perfectly); and (b) the extent to which the U.S.-raised speaker's childhood peers (neighborhood + school, if not the same group) were or were not members of the same ethnic group.

    NYC is also full of U.S-raised speakers (and potential judges) of South Asian ancestry, so you could put them into the mix, as well.

  7. Bayes, lingustics, and Alan Turing « The Home for Wayward Statisticians said,

    November 15, 2010 @ 2:13 pm

    […] Log has  an interesting post on using Bayesian statistics to combine several items of weak evidence to arrive at a strong conclusion.  Better yet, it […]

  8. Jarek Weckwerth said,

    November 15, 2010 @ 4:20 pm

    Thanks for this post. Extremely interesting. (Also because I was supposed to be at that conference but had to cancel. More insider reports please?)

    One other thing to control (or look)* for in this kind of study, which would be sort-of absent from this version of the Italian/English thought experiment, would be the effect of the listeners' familiarity with the accents, or indeed of their linguistic experience in general. I bet the results would have been different if the survey had been done in a community more "homogenous" than NYC…

    * Nice coordination, no? ;)

  9. John Swindle said,

    November 15, 2010 @ 4:24 pm

    The work was done in New York, so I wonder whether the term "Pacific Asian Americans" is local to New York. In context it looks like it refers to Americans of East Asian (specifically CJK) ancestry, to distinguish them from those with ancestry in some other part of Asia.

    [(myl) Yes, in this case it is meant especially to distinguish Chinese and Korean from (east) Indian, Filipino, etc.]

  10. Dan T. said,

    November 15, 2010 @ 4:25 pm

    Of course, the prior probability of the newspaper being a particular language also needs to be factored in. Your examples apparently assume you started out with a paper equally likely to be English or Italian. If, however, the paper is selected by randomly picking a paper from a newsstand in New York City, then the prior probability of it being English would be highest, with the remainder of the possibility space being divided among lots of other languages that are widely spoken in that area. A newsstand in Rome would get a very different distribution.

  11. Jonathan Mayhew said,

    November 15, 2010 @ 4:42 pm

    Some letters will be more saliently non-Italian: j, k, w, x, y. At that statistical extreme, only a few of these letters would provide the give-away. Or, in a 100-letter random sample, the absence of these five.

    [(myl) You're right. But in this case, for pedagogical reasons, I looked at the Italian side; and avoided drawing any inferences from "the dogs that didn't bark".]

  12. J. W. Brewer said,

    November 15, 2010 @ 5:07 pm

    I thought "Pacific Asian Americans" was odd-sounding, as well. But some google hits sugggest to me that it may be a variant way of saying "Asian/Pacific Islander," which is a metacategory in U.S. affirmative-action/diversity classification jargon (in contexts where it is not thought necessary or desirable to break out persons of "Pacific Islander" race/ethnicity, such as e.g. Samoans, separately). See, e.g., Not that using that metacategory in this context makes any particular sense. And here's a usage that seemingly contrasts "Pacific Asian Americans" with "Southeast Asian Americans," which frankly confuses me:

    In any event, I note that "Pacific" is not used at all in the body of the abstract as opposed to the title, although I suppose academics are not quite like journalists who should be presumed innocent of having themselves written the headlines appearing above their bylines.

  13. Matt C said,

    November 15, 2010 @ 5:32 pm

    An interesting statistical measure of texts is the index of coincidence – I think first used by Friedman in the 20s, but I'm not entirely certain. It is essentially a measure of how uneven the distribution of letters in a text are – and is surprisingly good at separating different languages, although it was originally devised for cryptanalysis (i.e. separating plaintext from ciphertext). is helpful, although I think it underestimates the usefulness for more complicated pen and paper ciphers.

  14. Rubrick said,

    November 15, 2010 @ 7:18 pm

    Excellent post, as usual. I was struck by the first line, though: "One of the most interesting papers at an interesting conference was Michael Newman…"

    This bit of metonymy is completely unexceptional within academia, but must sound quite peculiar to those outside it. (I'm not an academic myself, but I hang out with them a lot….) It reminds me of the previously-posted-about "Who are you wearing?", and one which must be familiar to all conference-goers: "How many Portlands ago did I meet you?"

  15. Ran Ari-Gur said,

    November 15, 2010 @ 7:36 pm

    @Rubrick: I'm familiar with the metonymic usage you're referring to, but in this case, I believe Dr. Liberman is referring to the paper thusly:

    > Michael Newman, "Identifying native English speaking Pacific Asian Americans by voice"

    such that "Michael Newman" is only part of the identifier.

  16. john riemann soong said,

    November 16, 2010 @ 12:09 am

    a fascinating work, but I'm not really content. (plus I really see no reason why CJK should be arbitrarily segregated from other Asian ethnic groups.)

    although I guess I can't expect a lot from a preliminary study. I've always been curious about L1 "ethnically-influenced language enclaves", e.g. AAVE speakers were likely to be highly exposed to standard English environments as children, but Asian immigrants adopt seemingly standard English rather than the speech style of their parents.

    but maybe they are adopting a speech style that is subtly more different?

  17. Patrick Hall said,

    November 16, 2010 @ 4:20 am

    Since the topic is in the air, I'll mention something I believe I've noticed (or have convinced myself that've noticed): a tendency among Asian-heritage native of speakers not to use schwa in certain unstressed syllables.

    I have just two examples in mind:

    I believe I've heard "o'clock" pronounced more than once by members of this group as /ˌoʊˈklak/ as opposed to /ˌəˈklak/.

    The other example is at 0:23 of this interview with Korean American actor Daniel Dae Kim:

    I hear a fully articulated but unstressed diphthong in "you know" at that point: /ˌjuwˈnow/, neither /ˌjəˈnow/ nor /ˌjuˈnow/.

    I'm curious what others' reactions are to this.

  18. GeorgeW said,

    November 16, 2010 @ 8:03 am

    @Patrick Hall: I have noticed a similar tendency among some African-American speakers. I hear schwas given full value in words like the last syllable of 'government.' I have assumed that hypercorrection is the reason.

  19. Amy Stoller said,

    November 16, 2010 @ 9:21 pm

    I wonder whether one of the characteristics of "voice" in this case is resonance.

  20. Rodger C said,

    November 17, 2010 @ 9:41 am

    @GeorgeW: Some African-Americans (and some white Southerners) put secondary stress on the last syllable of words like "government," "Washington" not as a hypercorrection as such but because the syllable otherwise tends to be almost competely swallowed. It's a way of pronouncing the word clearly in a dialect whose stress patterns differ from those of standard American, and also in which most three-syllable utterances with initial stress are of the type "stalking horse."

  21. john riemann soong said,

    November 17, 2010 @ 6:07 pm

    Primarily because I picked up much of my vocabulary through reading, for a long time I thought "walk" and "talk" and "could" and "would" had "swallowed" dark l's.

    though when I ask other white people … sometimes they get surprised too.

  22. Michael Newman said,

    November 17, 2010 @ 10:23 pm

    I didn't know Mark had posted this about my paper. Thanks. I am intrigued by the idea of probability in this because I had assumed that different listeners were picking up one or more cues depending on their attunement to cues and their attunement to the category. I need to rethink this. I'll also check out the scwhas for sure. It's possible but not something I noticed. As for other questions, everyone asks about the bilingualism of the speakers, and all were to some extent, but I don't see it as terribly interesting or relevant. There are features of Latino English that come from Spanish and appear in speakers with very limited Spanish. For an even clearer example, think if Irish English. The voice was breathiness as determined by spectral slope. I'll put the handout up on my website as soon as I catch up on my class preps and other tasks I'm behind on. Pacific Asians is NOT used in NYC. It is used by Angela Reyes in a book to clearly distinguish the population from South Asians. Since in Britain the default Asians are South Asians, I thought I would include it in the title to make it clear that I wasn't discussing that group. Also, the speakers were just Koreans and Chinese.

  23. Tyler P said,

    November 19, 2010 @ 12:25 am

    My experience looking at the relationship between listeners' perceptions of talker race and dialectal distribution of phonetic/phonological features mirrors what is discussed here. Individual talkers may exhibit only a subset of the speech features 'typical' of a given dialect, and moreover may not exhibit them consistently.

    For instance, in our paper, we likewise saw how it is the combination of several features, not merely the presence, absence, or frequency of any particular feature, that is predictive of listeners' racial categorization of voices.

    It's also worth considering whether other acoustic properties of the signal besides the dynamic phonetic/phonological features of dialect may play a role in the perception of race from voice. For example, although differences in vocal tract anatomy between Americans of African or European ancestry has not been demonstrated in the literature, there is some evidence that the vocal anatomy of Asian-Americans may indeed (on average) differ significantly from the former two groups.

    For instance, see the citation below, as well as related work by these authors:
    Xue, S. A., Hao, G. J. P., & Mayo, R. (2006). Volumetric measurements of vocal tracts for male speakers from different races. Clinical Linguistics and Phonetics, 20, 691–702.

RSS feed for comments on this post