Towards automated babble metrics

« previous post | next post »

There are lots of good reasons to want to track the development of infant vocalizations — see e.g. Zwaigenbaum et al. "Clinical assessment and management of toddlers with suspected autism spectrum disorder" (2009). But existing methods are expensive and time-consuming — see e.g. Nyman and Lohmander, "Babbling in children with neurodevelopmental disability and validity of a simplified way of measuring canonical babbling ratio" (2018).  (It's also unfortunately true that there's not yet any available dataset documenting the normal development of infant vocalizations from cooing and gooing to "canonical babbling", but that's another issue…)

People are starting to make and share extensive recordings of infant vocal development — see e.g. Frank et al., "A collaborative approach to infant research: Promoting reproducibility, best practices, and theory‐building" (2017). But automatic detection and classification of vocalization sources and types is still imperfect at best. And if we had reliable detection and classification methods, that would open up a new set of questions: Are the standard categories (e.g. "canonical babbling") really well defined and well separated? Do infant vocalizations of whatever type have measurable properties that would help to characterize and quantify normal or abnormal development?

There are lots of possibilities, most of which will have to wait until the hoped-for future time when suitable datasets are available. But there's an old idea that might contribute something, namely the suggestion in Potter, Koop and Green's 1947 book Visible Speech that "By recording speech in such a way that its energy envelope only is reproduced, it is possible to learn something about the effects of recurrences such as occur in the recital of rimes or poetry."

I discussed this work in a 12/18/2013 blog post "Speech rhythm in Visible Speech", where I noted that

This is the first attempt — and the first of many failures — to find evidence of literal isochronism in English speech. As Potter, Kopp & Green found, the syllable-scale spectrum of an individual phrase is in general "randomly mottled". Subsequent research on speech rhythms has mainly relied on time-domain measurements of inter-event intervals, where evidence of isochronism has been similarly "mottled" at best.

But even though rhythmic speech is not literally isochronous, it's still true that the proposed method — spectral analysis at frequencies of 1 to 10 Hz or so — can tell us "something about the effect of [syllable-scale] recurrences". And in particular it might tell us something about recurrences in babble-like vocalizations.

Here's the first 25 seconds of the audio track of this YouTube video:

I calculated the RMS amplitude of this signal in 5-msec frames, 200 times per second, and smoothed the result by convolving it with a gaussian window with standard deviation 70 msec (which crudely corresponds to the integration time-constant of human hearing). The results are shown in the top panel below. And I got a smoothed derivative by convolving with the derivative of the same gaussian, shown in the lower panel:

I then calculated amplitude spectra for the derivative signal, in 1.2-second windows, spaced every 50 msec. A spectrogram-style representation of the results for the same 25 seconds, showing frequencies from 0 to 6 Hz, is below. I've added horizontal lines at frequencies 2.0, 2.5, 3.0, 3.5, and 4.0 Hz:

And here's the mean spectrum for those 25 seconds:

The peak value is at a frequency of 2.25 Hz, corresponding in the time domain to 445 msec, which is a reasonable approximation to the dominant quasi-syllable duration in this clip.

Here's another audio clip, from the first 25 seconds of this YouTube video:

Here are the RMS time functions:

The spectrogram:

And the mean spectrum:

A final example, from the last 25 seconds of this video:

This child's peak mean frequency is a bit higher, though you can see in the spectrogram that there's also some action in the 2-2.5 Hz range.

This is only a very crude beginning, and of course there are lots of more modern possibilities, especially forms of modulation-spectrum analysis. But this simple exercise suggests to me that such analyses are like to be fruitful, once we have a systematic body of data to apply them to.

For those who are still reading, I can't resist adding some personal background.

As an undergraduate, more than 50 years ago, I spent a frustrating semester working as a part-time research assistant for Dr. Margaret Bulowa, a psychiatrist who was interested in early life experience from a Freudian perspective. My task was collecting examples for a study of the development of infant vocalization. The input was a large archive of recordings made with a clever but impractical apparatus, described in Margaret Bullowa, Lawrence Gaylord Jones, and Thomas Bever, "The development from vocal to verbal behavior in children"  (1964), and Margaret Bullowa, Lawrence G. Jones, and Audrey R. Duckert, "The acquisition of a word" (1964). From the second reference:

The field technique utilized in gathering the data is one newly devised for the purpose of this study. It seeks to gather information on normal child-mother interactions with minimal disturbance of the subject's customary environment. Each observation lasts one half-hour and consists of: (1) a continuous tape recording from a microphone open to the room at large; (2) descriptions of the behaviour and interactions of the subjects by the observer psychiatrist whispered into a shielded microphone which records on a parallel track of the tape ; (3) photographic film (black and white stop-frame at two frames per second) taken without added illumination. A timing signal which occurs every five seconds is imposed in the field and permits resynchronization of the entire observation. Even if the child is asleep, or making no vocal sound at all, the observation is continued ; thus the diachronic character of the observation is preserved.

The recordings for each child began at birth and continued weekly for 30 months. When I joined the project, the archive consisted of a climate-controlled storeroom with floor-to-ceiling shelves piled with hundreds of pairs of tape boxes and film cans. There were of course no transcripts, so my task was to listen to the tapes, and whenever the infant cooed or babbled or whatever — anything other than crying — I was supposed to stop the tape, and make a copy on a second tape recorder, along with notes about the source and location of the original.

Why did I say that the apparatus was impractical? First, the multi-media synchronization didn't really work. There was a machine that in principle made it possible to view the film synchronized with the audio; but this required a lot of complicated set-up to work, which it by no means always did; and it was even harder to re-synchonize everything after stopping. So I quickly learned to ignore the film and just listen to the tapes. And second, because the whole thing was analog, there was no real alternative to listening to everything sequentially — each 30-minute session took a minimum of 30 minutes to scan, since the four-track tape recorder was already running at its fastest speed.

And why did I say that the experience was frustrating? The idea was to find the developmental distribution of relevant vocalizations, which in principle ought to occur roughly over the period from two to nine months or so. Starting from birth, that's 9*4 = 36 tapes per infant, comprising 36*30 = 1080 minutes = 18 hours of audio. The process of copying any vocalizations that I found added substantially to that time. Since the recordings came from a slice-of-real-life field study, a significant fraction of the tapes recorded the child sleeping or involved in activities with no recorded vocalizations. To make a long story short, at the end of the term I had found and copied a depressingly small number of vocalizations. Worse, it was not really clear what to do with those copies.

So I cordially parted ways with Dr. Bulowa, and took a job at a local ice-cream cone factory, where my tasks included loading and cleaning the batter mixers and chocolate vats, spelling line workers for their bathroom breaks, and sweeping the floors. Then life intervened, and I lost touch with her. When I looked for her again, a decade later, her lab had been turned into offices, and no one could tell me what had happened to those stacks of tapes and films.



  1. Victor Mair said,

    May 27, 2019 @ 11:31 am

    I wonder what Dr. Margaret Bulowa might have done with modern instrumentation and material like that in the following:

    "The babbling phase: ranting toddler speaks out" (9/3/10)

    "Twin talk" (3/31/11)

    [(myl) But she was interested in scientific documentation, not viral videos…]

    See also:

    "Baby talk" (12/21/10)

    "Baby talk, part 2" (8/19/18)

    "Ask LL: parents' beliefs or infants' abilities?" (10/29/09)

  2. Trogluddite said,

    May 28, 2019 @ 10:34 am

    I note that there is no mention of any attention given to the mothers' utterances during the task of transcribing of Dr. Bulowa's field recordings (not that I would have wanted to give the OP even more odious work to do!) Given that the acquisition of language skills is, presumably, heavily influenced by the child's closest care-givers, I wonder how she defined "normal mother-child interactions"?

    There is considerable debate among autistic people about this "cultural" aspect of language acquisition (among many other traits), due to autism being a highly heritable condition. My parents saw nothing unusual about my lack of verbal interaction in infancy, nor my transition from this to "little professor"/"walking dictionary" verbosity around kindergarten age. This is particularly true of my mother, with whom I share a great many of my autistic traits, and who was raised alongside a younger brother strikingly similar to myself. There is much room here for "nature" and "nurture" to interact, I believe, despite growing awareness of autism since my childhood.

    The sensory sensitivities and processing traits commonly experienced by autistic people are a further confounding factor. Even in adulthood, my ability to form contextually relevant utterances is notably impaired by factors as seemingly trivial (to most people) as an uncomfortable item of clothing, the almost imperceptible noise of an air-conditioning system, or a break in regular routines. At times of extreme stress, I can even lose my language abilities completely, to the point of not even being able to use my "inner voice" for reasoning. Deficits in "theory of mind" may confound this further, by impeding comprehension and production of the pragmatic aspects of language.

    Infants and autistic people with intellectual disabilities may have great difficulty in reflecting upon and reporting such confounding factors, which I believe poses a considerable challenge for research into autism – it is very difficult to research how significant they might be during the earliest stages of language acquisition, and certainly requires an inter-disciplinary approach. The input of autistic adults may also be of some assistance; but we often cannot reflect upon our infant experiences, and not without viewing them from an adult frame of reference.

    [(myl) Dr. Bullowa's overall research program was very much oriented to the study of parent-child interactions, since the her motivation included questions about the origins of schizophrenia. And my current interest in ways to quantify babbling behavior comes from the work of a current psychology grad student at Penn, Lisa Yankowitz, who has been studying data from a multi-site longitudinal study of infants at high- and low-familial risk for autism spectrum disorder, and whose research includes an important focus on "social directedness". Because this is 2019, the data is of course in digital form, and a team of undergrad annotators have segmented and labelled over 42,000 child vocalization clips.

    Also relevant here is a recent publication by my colleague Julia Parish-Morris,"It takes two to tango: Multi-directional, dynamic influences on parenting behavior." ]

  3. Michèle Sharik Pituley said,

    May 28, 2019 @ 10:56 am

    @Trogluddite — “The sensory sensitivities and processing traits commonly experienced by autistic people are a further confounding factor. Even in adulthood, my ability to form contextually relevant utterances is notably impaired by factors as seemingly trivial (to most people) as an uncomfortable item of clothing, the almost imperceptible noise of an air-conditioning system, or a break in regular routines. At times of extreme stress, I can even lose my language abilities completely, to the point of not even being able to use my "inner voice" for reasoning. Deficits in "theory of mind" may confound this further, by impeding comprehension and production of the pragmatic aspects of language.”

    Omg I thought it was just me! Happened to me just the other day and I was unable to explain why — I finally settled on “my buffer is full” when I was finally able to get a word out. Thank you for making me feel not so alone.

  4. Trogluddite said,

    May 29, 2019 @ 2:57 am

    @Michèle Sharik Pituley
    You're welcome. The metaphor of a "buffer overflow" is indeed very common among autistic people who struggle with sensory or social overload. I'd encourage you to search out a few of the autism community forums online. On many, sharing common experiences wherever they're found is much more important than any specific diagnosis. It can be very liberating to talk openly about such things (or even just to lurk) without fear of ridicule, and maybe even learn some tips to help manage them.

    Many thanks for your response; it's always heartening to hear about research in areas which are so often talking points among those of us affected by developmental conditions, yet rarely receive the limelight.

    Research into the aetiology and neurology of these conditions certainly has its place, and can be very fascinating. But I have spoken to many autistic people and their carers who feel that for too long those areas have overshadowed research into psychology, language, and learning which might lead to better day to day interventions or teachable coping strategies useful when the experts aren't around. However one feels about the controversies surrounding any potential cure or screening tests for autism, speedier progress in other areas could improve a lot of lives during what, for those who do seek a cure, might be a very long wait.

    The inclusion of primary carers in such programs is certainly necessary, IMHO, and I have spoken to many autistic people and carers who agree. A study a few years back of families in the UK supporting a disabled child found that those with an autistic child were the most likely to be having problems accessing support, within family relationships, with finances/employment, and with their mental health. I have read many posts by distraught parents who love their autistic children dearly and are clearly doing the best they can, but are barely coping physically and mentally. No program for an autistic child should be complete without measures to ensure that families and other carers are best equipped to support both the child and the therapies.

    And yes, it "takes two to tango" – just as autistic people can find dealing with non-autistic people frustrating and painful sometimes, there is certainly also traffic in the other direction. While allowing that the traits of autism or unfortunate circumstances may make the tango much less easy for some of us to dance, I do think that the "neuro-diversity" movement can idealise people with developmental conditions a little too much sometimes (poor social awareness can be such an annoying hindrance to my nefarious plots sometimes, darn it!)

    Your posts touching on such subjects, and LL in general, are a haven from the hyperbolic "Gene for X"/"Imminent Golden Bullet" stories which receive so much of the media hullaballoo. And, since linguistics and audio DSP are among my "autistic special interests" (a.k.a. "hobbies"), I get very good value for money! ;-)

RSS feed for comments on this post