Gesture at 8:00 a.m.

« previous post | next post »

Here at AAAS 2012 in Vancouver, this morning's 8:00 a.m. Section-Z symposium is "Gesture, Language, and Performance: Aspects of Embodiment", organized by Philip Rubin. The abstract:

Communication, language, performance, and cognition are all shaped in varying ways by our embodiment (our physicality, including brain and body) and our embeddedness (our place in the world: physical, social, and cultural). The real-time production of spoken and signed language involves the dynamic control of speech articulators, limbs, face, and body, and the coordination of movement and gesture, by and between individuals. Increases in computing power and the recent emergence of ubiquitous and flexible sensing and measurement technologies, from inexpensive digital video and other devices to higher end tools, are beginning to make it possible to capture these complex activities more easily and in greater detail than ever before. We are on the cusp of a revolution in sign, gesture, and interactive communication studies. New computational and statistical tools and visualization techniques are also helping us to quantify and characterize these behaviors and, in certain instances, use them to control and synthesize speech, gesture, and musical performance. This symposium brings together experts spanning linguistics, computer science, engineering, and psychology to describe new developments in related areas of inquiry. These include coordination and synchrony during spoken and signed language, gestural control of musical performance, physiologically and acoustically realistic articulatory speech synthesis, and cognitive and linguistic development.

There'll be three presentations: Sidney Fels, "Talking with your mouth full: Performing with a gesture-to-voice synthesizer"; Martha Tyrone, "Capturing the structure of American Sign Language"; Erik Vatikiotis-Bateson, "Coordination: The plaything of expressive performance".

The individual abstracts:

Sidney Fels, University of British Columbia, Vancouver

We create DIgital Ventriloquized Actors (DIVAs) that use hand gestures to synthesize speech and song by means of an intermediate conversion of these gestures to articulatory parameters of a voice synthesizer. This requires overcoming technical challenges related to tracking gestures, synthesis quality, and the complex mapping between performance and meaningful, expressive vocal sounds that are easy to learn. We discuss these components contrasting a frequency-based and an articulatory based approach for speech synthesis. The relationship between gesture and voice production embodied in a new type of musical instrument provides a rich means for human expression. We show the latest performance work composed using these DIVAs.

Martha Tyrone, Long Island University, Brooklyn, NY

American Sign Language (ASL) is a natural, signed language used by Deaf people in the United States and Canada. (The term ‘Deaf’ refers to the community of ASL users rather than to clinical hearing loss.) Unlike spoken languages, signed languages use the hands and arms as primary articulators, and signs are perceived visually rather than auditorily. While researchers have been studying the linguistic structure of ASL for several decades, investigation of the physical/articulatory structure of the language has been extremely limited. This study examines ASL using the theoretical framework of articulatory phonology, which proposes that the basic units of speech are articulatory gestures. Thus, according to this theory, the articulatory and linguistic structure of spoken language are inter-related. We hypothesize that articulatory gestures are also the structural primitives of signed language, and we are investigating what the gestures are and how they are timed. For this study, sign production data were collected using an optical motion capture system that tracked the positions of the arms, head, and body over time as Deaf signers produced ASL phrases. The signers were asked to produce specific target signs occurring in various phrase positions. The target signs included movements either toward or away from the body, allowing us to compare superficially-similar but linguistically-distinct movement phases: as the arm moves toward a location on the body, spends some time at that location, and then moves away from the body. Our findings suggest that signs, like spoken words, are lengthened at phrase boundaries in a manner consistent with the predictions of a task-dynamic model of prosodically induced slowing. In the long run, these findings could assist with the automatic parsing of American Sign Language.

Eric Vatikiotis-Bateson, University of British Columbia, Vancouver

Spoken communication and musical performance are arguably our most highly skilled activities. In order to analyze speech and music behaviorally, we must find tractable ways to associate complex signal arrays with events of interest. Unfortunately, both sides of this association are problematic. Signals may occur simultaneously within and across multiple channels and modalities, at multiple physical locations, and with the potential for signal correspondence at multiple levels of spatial and temporal coordination — that is, patterns within patterns. Determining what events to measure is limited by technology and by a predisposition to seek familiar structures that may not accommodate the context-specific event structures that emerge in ephemeral behavior. The problem is that if, as we believe, communicative expression is predominately context-dependent, then identifying emergent events is of fundamental importance. In this talk, we describe an approach to the measurement and analysis of expressive language and musical performance that allows both emergent and familiar events to be quantified in the instantaneous correlation patterns between signals. Our method demonstrates that spatial and temporal coordination within and between performing individuals is ubiquitous and can be accurately assessed so long as temporal fluctuations in the pattern structure are incorporated into the analysis. We also demonstrate the value of optical flow analysis as a non-invasive and labor-saving means of recovering two-dimensional motion from video recordings. What was previously thought to be a crude method of motion capture, when pooled for defined regions of interest, provides sensitive measures of performance behavior. We exemplify the motion capture and correlation analysis techniques using conversational data from: English, Shona (Zimbabwe), and Plains Cree (Western Canada); the integration of posture, respiration, and vocalization in speech and song; and the expressive coordination between pianist and vocalist in Lieder/Art Song.


  1. maidhc said,

    February 20, 2012 @ 3:52 am

    It used to be that only crazy people talked to themselves while walking down the street. Now everybody walks around holding conversations with thin air at the top of their lungs. But you can still spot the crazy people because they are waving their arms.

    But as soon as we get gesture-based tech toys, we will all be waving our arms like the semaphore version of Wuthering Heights, and who will there be to tell us how ridiculous we look? Our grandchildren will think that the Ministry of Silly Walks sketch was a documentary.

  2. Bob Krauss said,

    February 20, 2012 @ 2:04 pm

    @maidhc: In my observation, apparently sane cellphone users do plenty of gesturing as they talk. Indeed, the fact that speakers make gestures that can't be seen by a conversational partner raises interesting questions about the functions these gestures serve.

  3. un malpaso said,

    February 20, 2012 @ 7:14 pm

    Sometimes I notice myself doing this (gesturing while on a cell phone call), although not really often in public. But every time I catch myself making a gesture that can't be seen, I go through a little mental cycle of embarrassment, my brain scolding myself for running an automatic program like that, and then genial acceptance: "well, the gestural part of language can't be suppressed…" all in the span of a couple of seconds during the conversation. Mainly, though, I am just embarrassed.

RSS feed for comments on this post