Another little Chinese v. English experiment

« previous post | next post »

With respect to yesterday's little perception experiment ("Can you tell the difference between English and Chinese?", 12/20/2013), Edward Lindon asked, semi-rhetorically:

Could the putative perceived similarities have any connection with the rhythms and inflections of the "broadcast voice"? Would the results be the same if the sample were composed of daily or conversational speech?

And Cygil responded, taking him literally:

Exactly. Newsreaderese is a bizarre dialect of English that, if you used in regular conversation, would immediately signal you as a madman.

This is absolutely all true, though incomplete — there are at least four or five quite distinct dialects of newsreaderese in English, and probably in other languages/cultures as well. See "Celebrity Voices", 3/26/2011, for some discussion.

So this morning, I've selected eight phrases at random from published corpora of conversational telephone speech in English and in Mandarin, and you can try the same experiment again.

There are two small changes.  First, lowpass filtering, never a wonderful strategy for "delexicalizing" speech recordings, works especially badly for telephone speech, where the nominal passband is around 300-3400 Hz, so that lowpass filtering at 300 Hz (as I did for yesterday's experiment) is problematic. So I've pitch-tracked the original audio clips, and then synthesized them as amplitude-and-frequency-modulated complex tones. As a result, this time you don't need to worry about the frequency response of laptop or tablet speakers. (Code for creating stimuli of this type is available on request, though I warn you that it's not pretty: a thrown-together amalgam of C, shell, and matlab programs/scripts.)

And second, you could enter your responses on a Google Form here — until I turned data collection off because after a couple of hours, we've got enough responses to give out the answers and therefore shut down the collection. (I still haven't found a convenient way to randomize the order of stimulus presentation on a per-subject basis, while automatically keeping track of the stimulus/response relationships…)

Again, I'll leave comments closed until we have an adequate number of responses [As has now happened…]


Results.  The overall percent correct is 60%. Clip by clip:

Clip Number: (1) (2) (3) (4) (5) (6) (7) (8)
Responses (Chinese/English): 13/27 23/17 35/5 10/30 23/17 23/17 7/33 22/18
Truth: Ch En Ch En Ch Ch En En
Percent Correct: 32.5% 42.5% 87.5% 75.0% 57.5% 57.5% 82.5% 55.0%

The original clips:


Why was the performance so much worse this time?  There are two obvious differences:

  • Conversational speech rather than broadcast news;
  • Frequency-and-amplitude-modulated complex tones rather than radical low-pass filtering.

I'm inclined to think that the second one is more important — in the low-pass speech, there remain some residual cues to the identity of segments, which leaves some rhythmic patterns more salient.


  1. ND said,

    December 21, 2013 @ 1:08 pm

    This was fascinating, I hit the average almost every time — it's fascinating to think there is spoken Chinese that sounds more identifiably Chinese, and spoken Chinese that is more likely to fool English speakers into thinking it's English (also that this does not track with the use of English words in Chinese, like #3).

    Derail observation: the content of the speakers' clips makes the Americans sound, perhaps rightly, like total jerks. Chinese: "If you guys are ok, then I can relax. I think about you every day." English: "…and, uh, and also because I wanted a weapon."

    [(myl) In fairness to the Americans, these are arranged conversations among strangers who have agreed to talk about an assigned topic — the "weapon" quote comes from a discussion of gun control. In contrast, this particular set of Chinese conversations mostly involves calls between family members talking about whatever they want to.]

  2. Yuval said,

    December 21, 2013 @ 1:24 pm

    I was absolutely certain the last one is singing "Happy Birthday to You".

  3. JS said,

    December 21, 2013 @ 6:19 pm

    (3) had me convinced it was English because of the three distinctive moments at which, it turns out, it is English; this one shouldn't count towards totals.

  4. Rick said,

    December 21, 2013 @ 6:26 pm

    I thought number 4 was southern (American). Her intonation at the end (which I later learned was "recycling drive") sounded very southern to me. I live in Georgia and my doctor (a local I think) called me recently and he used a pattern like that at the end of seemingly every sentence

  5. JS said,

    December 21, 2013 @ 6:37 pm

    Oops; overlooked ND's comment to this point… but participants' guessing "Chinese" for (3) is as much a wrong answer as a right one, no? what with topic + verb + complement (whatever they are) all in English…

  6. Tom V said,

    December 21, 2013 @ 8:08 pm

    I haven't worked through the samples (slipped up and hit a couple of the clear speech ones by mistake), but I think this technique must be the one used for adult voices in the Snoopy cartoon specials.

  7. PaulB said,

    December 22, 2013 @ 2:04 am

    I scored one better than yesterday, but with 5 out of 8, I guess I'm still guessing.

    1. C 1
    2. E 1
    3. C 1
    4. C 1
    5. E 2
    6. C 2
    7. E 1
    8. C 1

  8. david said,

    December 22, 2013 @ 8:27 am

    On the performance difference perhaps our hearing/analysis apparatus is better designed or has more experience with low pass filtered sources than with the other synthetic processing.

  9. Neil Ren said,

    December 23, 2013 @ 12:46 am

    Interesting experiment. What's even more interesting are my results: I was 100% correct for the earlier experiment but far below average for this one. It appears that accent could be one important variable in such kind of decision tasks, for example the first audio clip features quickly spoken, hi-pitch Chinese dialect (different from Putonghua) which hardly differs from audio clip 4 in English when the segemental information is removed. In addition the lowpass filtering technique is comparatively problematic. I was able to detect very clear features of Chinese phrases in the earlier test.

    I am also wondering whether it is necessary to group participants into Chi/Eng bilinguals, Chi/Eng mono-linguals, Chi/Non-Eng bilinguals, Non-chi/Eng bilinguals and Non-chi/non-eng bilinguals.

    [(myl) I think you're right on both counts: the lowpass-filtered speech retains a certain amount of relevant segmental information; and no doubt the linguistic background of subjects will affect their performance. These are among several things that would need to be handled in a proper experiment.]

  10. Philip Lawton said,

    December 28, 2013 @ 8:29 pm

    Two things:

    a) What does everyone mean when they say (3) has English in it? I can maybe hear the word "CDs" near the beginning, but that's it.

    b) Why do we think some clips got such a high correct rate compared to others within this test?

  11. Wentao said,

    December 31, 2013 @ 12:41 am

    This is so much harder than the previous experiment… there are only 3 clips that I was absolutely sure, #3, 4 and 6. The ending of #4 sounds like American English, and the dramatic high pitched staccato #6 is also pretty typical of Chinese. However, I can't tell there's any English in #3 by listening to the muffled clip alone.

    As a native speaker from Beijing, I'm embarrassed that I can't fully understand #1 – what exactly is she keeping 留着? I'm also curious what dialect this is, Hebei? Shandong?

    @Philip Lawton
    #3 says: “你的CD-ROM(?)是…是,那个,install在这个seven上的,对吧?”

  12. tuncay said,

    December 31, 2013 @ 3:29 am

    I got 100% but I must admit I wasn't fully sure about 5, but I guessed the correct one.

    I don't know what the formula is but I think I listen to rising staccato mostly.

  13. Andre B said,

    January 1, 2014 @ 7:01 am

    I think the difference in performance is actually more due to the conversational speech v. broadcast speech distinction, because in broadcast speech you were less likely to have interjections/self-repairing which we could say falls outside what we know about intonation in each language. In the first experiment it was easier because we could predict with more accuracy that say, stress patterns belonged to actual speech and were consistent with one language or the other… unless of course this is affected by the audio processes as well (I don't know really know them) e.g. neutralising intensity in English etc.

RSS feed for comments on this post