With respect to yesterday's little perception experiment ("Can you tell the difference between English and Chinese?", 12/20/2013), Edward Lindon asked, semi-rhetorically:
Could the putative perceived similarities have any connection with the rhythms and inflections of the "broadcast voice"? Would the results be the same if the sample were composed of daily or conversational speech?
And Cygil responded, taking him literally:
Exactly. Newsreaderese is a bizarre dialect of English that, if you used in regular conversation, would immediately signal you as a madman.
This is absolutely all true, though incomplete — there are at least four or five quite distinct dialects of newsreaderese in English, and probably in other languages/cultures as well. See "Celebrity Voices", 3/26/2011, for some discussion.
There are two small changes. First, lowpass filtering, never a wonderful strategy for "delexicalizing" speech recordings, works especially badly for telephone speech, where the nominal passband is around 300-3400 Hz, so that lowpass filtering at 300 Hz (as I did for yesterday's experiment) is problematic. So I've pitch-tracked the original audio clips, and then synthesized them as amplitude-and-frequency-modulated complex tones. As a result, this time you don't need to worry about the frequency response of laptop or tablet speakers. (Code for creating stimuli of this type is available on request, though I warn you that it's not pretty: a thrown-together amalgam of C, shell, and matlab programs/scripts.)
And second, you could enter your responses on a Google Form here – until I turned data collection off because after a couple of hours, we've got enough responses to give out the answers and therefore shut down the collection. (I still haven't found a convenient way to randomize the order of stimulus presentation on a per-subject basis, while automatically keeping track of the stimulus/response relationships…)
Again, I'll leave comments closed until we have an adequate number of responses [As has now happened...]
Results. The overall percent correct is 60%. Clip by clip:
The original clips:
Why was the performance so much worse this time? There are two obvious differences:
- Conversational speech rather than broadcast news;
- Frequency-and-amplitude-modulated complex tones rather than radical low-pass filtering.
I'm inclined to think that the second one is more important — in the low-pass speech, there remain some residual cues to the identity of segments, which leaves some rhythmic patterns more salient.