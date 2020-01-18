« previous post |

In "On beyond the (International Phonetic) Alphabet", 4/19/2018, I discussed the gradual lenition of /t/ in /sts/ clusters, as in the ending of words like "motorists" and "artists". At one end of the spectrum we have a clear, fully-articulated [t] sound separating two clear [s] sounds, and at the other end we have something that's indistinguishable from a single [s] in the same context. I ended that post with these thoughts:

My own guess is that the /sts/ variation discussed above, like most forms of allophonic variation, is not symbolically mediated, and therefore should not be treated by inventing new phonetic symbols (or adapting old ones). Rather, it's part of the process of phonetic interpretation, whereby symbolic (i.e. digital) phonological representations are related to (continuous, analog) patterns of articulation and sound.

It would be a mistake to think of such variation as the result of universal physiological and physical processes: though the effects are generally in some sense natural, there remain considerable differences across languages, language varieties, and speaking styles. And of course the results tend to become "lexicalized" and/or "phonologized" over time — this is one of the key drivers of linguistic change.

Similar phenomena are seriously understudied, even in well-documented languages like English. Examine a few tens of seconds of (even relatively careful and formal) speech, and you'll come across some examples. To introduce another case, listen to these eight audio clips, and ask yourself what sequences of phonetic segments they represent:

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

In IPA-ese, I hear the first two of these as something like

ˈgɪɾɚ ˈilɚ

roughly as if they had been spelled "gitter" and "eeler" — and the rest are variations on those themes. Listen for yourself and see what you think.

In fact, the first seven clips present the final two syllables of the word "regular" and the first syllable of the word "attendance", in the seven available performances of TIMIT sentence SX64 "Regular attendance is seldom required". The eighth clip is the final two syllables of the Wiktionary pronunciation of the word "regular". You can listen to the full contexts below:

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

These examples illustrate three general features of American English phonetics:

Intervocalic post-stress consonants are often lenited, in this case /g/ sometimes becoming so weak that (if played in initial position) it sounds like a glide.

The reduced vowel in the unstressed second syllable of "regular" assimilates with the /j/ onset to form a high front vowel [i] or [ɪ] — this can happen in other similar contexts.

A word-initial unstressed vowel can often assimilate to a preceding word-final unstressed vowel — as here with the first syllable of "attendance" and the last syllable of "regular" — so that the residue of the original pair of syllables is a somewhat longer unmodulated merger.

Thus when the TIMIT developers gave the first four syllables of MRLD0_SX64

Your browser does not support the audio element.

the ARPAbet transcription

r eh g y ix l axr ix

so that what they transcribe as

g y ix l axr ix

sounds like this

Your browser does not support the audio element.

they were surely affected by the phoneme restoration effect.

In this context, the following paper is not only plausible but also overdue — Jialu Li and Mark Hasegawa-Johnson, "A Comparable Phone Set for the TIMIT Dataset Discovered in Clustering of Listen, Attend and Spell", NIPS 2018:

Listen, Attend and Spell (LAS) maps a sequence of acoustic spectra directly to a sequence of graphemes, with no explicit internal representation of phones. This paper asks whether LAS can be used as a scientific tool, to discover the phone set of a language whose phone set may be controversial or unknown. Phonemes have a precise linguistic definition, but phones may be defined in any manner that is convenient for speech technology: we propose that a practical phone set is one that can be inferred from speech following certain procedures, but that is also highly predictive of the word sequence. We demonstrate that such a phone set can be inferred by clustering the hidden nodes activation vectors of an LAS model during training, thus encouraging the model to learn a hidden representation characterized by acoustically compact clusters that are nevertheless predictive of the word sequence. We further define a metric for the quality of a phone set (the sum of conditional entropy of the graphemes given the phone set and the phones given the acoustics), and demonstrate that according to this metric, the clustered LAS phone set is comparable to the original TIMIT phone set. Specifically, the clustered-LAS phone set is closer to the acoustics; the original TIMIT phone set is closer to the text.

As exemplified above, the TIMIT phonetic transcriptions often reflect expectations from the formal dictionary-based pronunciation standard, which is influenced by the spelling even before any continuous-speech reductions set in — so matching TIMIT's performance on this paper's "metric for the quality of a phone set (the sum of conditional entropy of the graphemes given the phone set and the phones given the acoustics)" should not be all that difficult. Still, no one has ever done it before, so this research is an important contribution.

The relationship between phonetic variation and lexically-stable phonological categories remains an open theoretical question, in my opinion, but work like this is one very useful direction of inquiry.

Permalink