« previous post |

From a physical point of view, syllables reflect the fact that speaking involves oscillatory opening and closing of the vocal track at a frequency of about 5 Hz, with associated modulation of acoustic amplitude. From an abstract cognitive point of view, each language organizes phonological features into a sort of grammar of syllabic structures, with categories like onsets, nuclei and codas. And it's striking how directly and simply the physical oscillation is related to the units of the abstract syllabic grammar — there's no similarly direct and simple physical interpretation of phonological features and segments.

This direct and simple relationship has a psychological counterpart. Syllables seems to play a central role in child language acquisition, with words following a gradual development from very simple syllable patterns, through closer and closer approximations to adult phonological and phonetic norms. And as Lila Gleitman and Paul Rozin observed in 1973 ("Teaching reading by use of a syllabary", Reading Research Quarterly), "It is suggested on the basis of research in speech perception that syllables are more natural units than phonemes, because they are easily pronounceable in isolation and easy to recognize and to blend."

In 1975, Paul Mermelstein published an algorithm for "Automatic segmentation of speech into syllabic units", based on "assessment of the significance of a loudness minimum to be a potential syllabic boundary from the difference between the convex hull of the loudness function and the loudness function itself." Over the years, I've found that even simpler methods, based on selecting peaks in a smoothed amplitude contour, also work quite well (see e.g. Margaret Fleck and Mark Liberman, "Test of an automatic syllable peak detector", JASA 1982; and slides on Dinka tone alignment from EFL 2015).

In this post, I'll present a simple language-independent syllable detector, and show that it works pretty well. It's not a perfect algorithm or even an especially good one. The point is rather that "syllables" are close enough to being amplitude peaks that the results of a simple-minded, language-independent algorithm are surprisingly good, so that maybe self-supervised adaptation of a more sophisticated algorithm could lead in interesting directions.

Code (in GNU octave) is here. The method is to

calculate the amplitude spectrum in a 16 msec window 100 times a second; at each time step, subtract the sum of the frequencies above 3 kHz from the sum of the frequencies below 3 kHz; set negative values to zero; smooth the resulting time function by convolving with a 70-msec Hamming window; select all peaks whose value is greater than 4% of the maximum peak value.

I set the three free parameters of the algorithm (the 3 kHz pivot frequency, the smoothing time constant, and the peak threshold) based on common sense phonetics and experimentation with a half-dozen English sentences. The resulting algorithm was then applied without any tuning to a set of 6000 Chinese sentences (120 from each of 50 speakers), as described in Jiahong Yuan et al., "Chinese TIMIT: A TIMIT-like corpus of standard Chinese", O-COCOSDA 2017. (This dataset is soon to be published by the LDC.)

Here's the audio, waveform and spectrogram for a randomly-selected example:

﻿ ﻿Your browser does not support the audio element.

Here's the forced alignment of a phone-level transcription, parsed into syllables, with each syllable preceded by its estimated starting and ending times in seconds:

0.2325 0.3925 j in 0.3925 0.5725 t ian 0.5725 0.8425 sh ang 0.8425 1.0225 u 1.3425 1.5525 s ai 1.5525 1.7325 q v 1.7325 1.8925 z u 1.8925 2.0225 ui 2.0225 2.3025 h ui 2.3025 2.4925 zh ao 2.4925 2.6625 k ai 2.6625 2.7425 l e 2.7425 2.9325 j i 2.9325 3.0725 sh u 3.0725 3.2425 h ui 3.2425 3.5425 i

Here's a plot showing the smoothed amplitude contour for the same recording, with the peaks indicated with vertical red lines:



Here are the times and relative values of amplitude peaks, as output by the program given above:

0.330 0.147 0.480 0.476 0.700 0.784 1.470 1.000 1.680 0.434 1.850 0.333 1.940 0.524 2.160 0.507 2.430 0.824 2.620 0.359 2.700 0.380 2.900 0.202 3.030 0.166 3.180 0.328 3.460 0.068

Lining those peaks up with the lexical-phonological syllables, we see that there is one miss and no false alarms:

0.2325 0.3925 j in 0.330 0.3925 0.5725 t ian 0.480 0.5725 0.8425 sh ang 0.700 0.8425 1.0225 u MISSING 1.3425 1.5525 s ai 1.470 1.5525 1.7325 q v 1.680 1.7325 1.8925 z u 1.850 1.8925 2.0225 ui 1.940 2.0225 2.3025 h ui 2.160 2.3025 2.4925 zh ao 2.430 2.4925 2.6625 k ai 2.620 2.6625 2.7425 l e 2.700 2.7425 2.9325 j i 2.900 2.9325 3.0725 sh u 3.030 3.0725 3.2425 h ui 3.180 3.2425 3.5425 i 3.460

The miss is due to failure to catch the zero-onset syllable /u/:



That disyllabic sequence /sh ang u/ might plausibly be a single syllable in some other language:

Your browser does not support the audio element.

Here's another randomly-selected sentence from the same collection:

Your browser does not support the audio element.

In this case, there are no misses but one false alarm — two peaks in the syllable /t ao/:

0.2125 0.4525 g ou 0.330 0.4525 0.7325 f ang 0.570 0.7325 0.9725 t an 0.860 0.9725 1.1725 p an 1.080 1.1725 1.2925 d e 1.230 1.2925 1.4925 g uo 1.430 1.4925 1.8825 ch eng 1.660 1.8825 2.0525 ie 1.980 2.0525 2.2225 j iu 2.160 2.2225 2.3725 sh iii 2.330 2.3725 2.6325 t ao 2.410 2.540 2.6325 2.8425 j ia 2.750 2.8425 3.0825 h uan 2.960 3.0825 3.2625 j ia 3.190 3.2625 3.3825 d e 3.290 3.3825 3.5725 g uo 3.490 3.5725 3.8725 ch eng 3.710

Looking at the waveform and spectrogram for the /t ao/ syllable in context, it seems that there may have been some microphone breath capture:

Your browser does not support the audio element.

Anyhow, across the whole set of 6000 sentences and 88,724 "true" syllables, we get

N SYLLABLES N MISSES N FALSEALARMS 88724 6706 8262 Precision 0.915 Recall 0.924 F1 0.920

…where "precision" is the proportion of detected syllables that are valid; "recall" is the proportion of valid syllables that are detected; and F1 is the harmonic mean of precision and recall, i.e. 2*precision*recall/(precision+recall).

Again, it would be easy to tune the algorithm to do better, and a more sophisticated modern deep-learning approach would doubtless do much better still. But the point is that the basic correspondence between phonological syllables and amplitude peaks in decent-quality speech is surprisingly good, and using this correspondence to focus the attention of (machine or human) learners is probably a good idea.

Permalink