Language Log

Androids, electric sheep, plastic tongues…

November 12, 2008 @ 5:16 pm · Filed by Heidi Harley under Computational linguistics, Phonetics and phonology

For your edification and amusement: An articulator-based, rather than acoustic, speech synthesis device.

The original context, here on Botjunkie, says that the ultimate goal is a voice compression system for cellphones. I'm a bit confused about this — I *think* that the idea is that representing speech articulatorily will be less data-intensive than representing it acoustically, but that seems wildly improbable to me.

Here's the description of the system on the Takanishi Labs page. Amazingly, they even have a rubber set of vocal cords at work! (scroll down to see them in action).

November 12, 2008 @ 5:16 pm · Filed by Heidi Harley under Computational linguistics, Phonetics and phonology

Permalink

30 Comments

Marinus said,

November 12, 2008 @ 6:10 pm

Yes, the data for an articulatory system would a great deal smaller and easier than just playing the sounds. You can encode every IPA letter and whatever stresses you like in an 8-bit snippet of data, with room to spare. In comparison, you measure the data used in a sound recording in kilobytes per second. Too bad we don't have a working artificial articulatory system just quite yet.
Brandon said,

November 12, 2008 @ 6:18 pm

I think the benefits from analyzing artificial articulatory systems isn't in the amount of data needed to record the sounds, but instead the ability to classify sounds based on how they are spoken rather than their exact acoustic properties. This should prove useful especially in computers' ability to "listen" and interpret human speech.

Of course, as was discussed in the latest parrot post, whether there is a foolproof way to determine articulators based solely on acoustics is probably the subject of much debate. (That is parrots mimic speech quite well with a quite different set of sound organs).
Don said,

November 12, 2008 @ 6:31 pm

I think it has a good chance of providing a lot of compression. The article says it has 19 degrees of freedom. I'm not sure exactly what they mean by that, but let's assume it means 19 moving parts that can move in 3 dimensions with 256 positions in each dimension. That's probably overestimating a bit, and it's still only 57 bytes per sample. You'd only have to sample the positions maybe up to 100 times per second (being a physical device, the machine would produce intermediate positions automatically), so that's about 45kbps. Add some optimizations, like only storing positions for parts when they actually move, and you could get that figure quite a bit lower. That's at least on par with the free Speex voice compression codec, according to Wikipedia.

Using this in cell phones is the hard part. It reminds me of a technology that some old Sound Blaster card was supposed to have. I want to say acoustic modeling, but that appears to be something else. Anyway, it was supposed to be like ray tracing, but for sound. I haven't heard anything about that technology since.
Will said,

November 12, 2008 @ 6:51 pm

Articulatory representation of signal… wow. Considering how theoretically complex the motor theory based speech perception is even when running on something as impressive as the human brain, I can't imagine how they plan to pull it off on a Nokia. Add in the requisite speaker and vocal tract normalization, and, well, good luck guys.

Also, if they got it working, I have to wonder whether the voices on the other side would be completely normalized (ie: Would everybody sound the same?). I don't think people would buy a phone which couldn't transmit their actual voices.
Ray Girvan said,

November 12, 2008 @ 6:51 pm

I *think* that the idea is that representing speech articulatorily will be less data-intensive than representing it acoustically, but that seems wildly improbable to me.
Actually, it's been done way back: Linear Predictive Coding, which uses a vocal tract model.
Rubrick said,

November 12, 2008 @ 7:02 pm

This is very cool research. I wonder if the cell-phone compression bit is more of a funding pitch than the actual goal in the minds of the researchers.
GAC said,

November 12, 2008 @ 7:02 pm

Seems kinda cool, even if the robot doesn't quite have the quality of a human voice. Maybe some decent voice recognition and synthesis could come out of it?

But the application is cellphones? I'd guess it couldn't really be just articulatory sounds. Otherwise everyone would end up with roughly the same voice over the telephone — which is really not a desirable outcome.
dr pepper said,

November 12, 2008 @ 7:04 pm

Sounds like a good system to test backward speech claims.
Marinus said,

November 12, 2008 @ 7:04 pm

@ Brandon:
Unfortunately, there is no immediate way of moving from a technology for producing something to a technology for perceiving something. This system is only a way of manipulating channels of air, and can't advance the 'listening' capabilities of any device at all. As for a study of the sounds themselves, we already have that independently of the model: the science of phonetics. If I were to guess from the material available, I'd say that they're making an artificial model of human speech so they can study that model and see if they can make something functionally similar without being a model of the human speech system, the way parrot mimicry is.

@ Don:
The compression would be a great deal more than that, unless I've missed something. What you say is entirely correct, of course, but a device can get away with encoding all of that data you describe into a series of mechanisms that are set off by the relevant codes. In the same way that you don't need to describe what a button does to press the button, you wouldn't need to describe all the movement along the 19 degrees of freedom, etc, in order to tell the machine to go /ʊ/. It has that data stored on it, you just activate it with the correct code. The process for making each sound is stored only once on the machine, and only a code to activate that process needs to be stored for each file of speech sounds. Accordingly, you would be able to compress speech sounds into something many orders of magnitude smaller than currently possible.
David Eddyshaw said,

November 12, 2008 @ 7:25 pm

Seems analogous to midi files for music, as opposed to mp3s or whatever.
Don said,

November 12, 2008 @ 7:50 pm

@ Marinus:

There's lots of room for optimization, of course, but I was just trying to show that it wouldn't take much data to encode speech with a device like that, even before the optimizers went crazy with it.

I should mention that I'm not making any claims about how accurately this device could reproduce speech; I'm only estimating how much data it would take to move all those parts around.
Nathan Myers said,

November 12, 2008 @ 10:47 pm

@Don: For the record, the number you cited sounds to me like a hell of a lot of data. Phones already do way, way better than that.
Rick S said,

November 12, 2008 @ 10:56 pm

I don't see why people think "everyone would end up with roughly the same voice". 19 degrees of freedom (essentially, 19 variables defining the configuration of the vocal tract) sounds like it could be enough that a couple of them could be used for vocal cord tension ("pitch") and approximate harmonic complement ("timbre"), which seems like it would go a long way toward making a voice recognizable. If I'm not mistaken, most of the other characteristics we use to recognize someone's voice are continuous rather than instantaneous, so they would be encoded as modulations over the sequence of sound samples rather than as discrete variables within the individual samples.

Bear in mind that we have little trouble recognizing our friends over the telephone even now, despite that its limited bandwidth does a very poor job of reproducing the source waveform. I'll bet that with digital compression, this new technology would make people's voices much more easily recognizable.
Rick S said,

November 12, 2008 @ 11:00 pm

Oops! That last statement should have read "digital encoding", not "digital compression"–though no doubt compression would allow even more fidelity.
Freddy Hill said,

November 12, 2008 @ 11:39 pm

@Nathan Mayers:

"For the record, the number you cited sounds to me like a hell of a lot of data. Phones already do way, way better than that."

Really? The quote above I reproduced from you consists of 124 bytes including spaces. This without compression. I imagine that IPA is less compressable, but not by much. If you know of any phone system that can convey that phonetic information(except for texting!), I'd really like to know If one must add significant information to modify pitch, timbre, amplitude and pauses, then maybe we would approach the cost of compressed voice, but I think that the research is worthwhile.

Additionally, I'd love to have a phone that actually mouthed words into my ear! It would take just a little more to have it sexily bite the lobe, or… Maybe there could be a set of password protected functions… Hmmm… But I digress.
GAC said,

November 13, 2008 @ 12:56 am

@Rick S.

Thanks for that. I recognize that voices already sound different over the phone. Maybe I was over-thinking this a bit in imagining a version of the apparatus in the video stuck in my phone (I know, that's kinda ridiculous). Anyway, if it actually led to higher fidelity then I'm all for it — especially if it helps me understand Mandarin Chinese over the phone (I don't know why but it's so much harder than in person).

BTW, did somebody else mention that? I'm not "people".
Paul said,

November 13, 2008 @ 5:18 am

Seems like a nice hi-tech version of von Kempelen's 18th century speaking machine. But, having observed the videos on the labs' web site, I'd have thought they'd have noticed that bilabial plosives involve closing the lips – they look and sound more like bilabial fricatives to me. The vowel identified as [u] sounds more like [ɯ] to me, though perhaps this machine is speaking Japanese phonetics rather than interpreting strictly the values of IPA symbols.
Anyhow, an impressive bit of kit. I know next to nothing about data compression, but I'd have thought we need to avoid the trap of thinking all we need in order to reproduce speech is a string of segments shown as bare IPA symbols with which we can program the machine. There's a lot more than that in the speech signal. Thinking of speech as a string of simple segments is sometimes an inconvenient fiction.
Oskar said,

November 13, 2008 @ 6:34 am

Regardless of how good a job of compressing human speech this method will do, it'll never, ever be used for anything real. Using this compression scheme, one would only be able to compress, or even be able to encode, human speech, or sound that a human can produce. And they wouldn't reproduce them exactly, the apparatus I use for speaking is very different from somebody else's. If I compressed my voice would this, the playback would (maybe) sound sort-of similar, but it wouldn't be my voice.

There are many, many fine audio-codecs which are able to record any sound what-so-ever, compress it quite efficiently into however small sizes we want, depending on how much loss in quality we are willing to accept. And storage space is increasing exponentially (well, I'm not 100% sure about exponentially exactly, it's rising fast is what I mean) and so is transmission bandwidth. This is a bad solution to a problem that doesn't exist.
Ray Girvan said,

November 13, 2008 @ 12:04 pm

Pardon the snarkiness, but did anyone actually read the freakin' link on Linear Predictive Coding?

It's been doing for over three decades exactly what people here are speculating about: high-compression high-quality speech encoding based on a largely articulatory model. It models voiced components by treating the vocal tract as a tube driven by a buzzer, and unvoiced components as pops and hisses without bothering what structures produce them. LPC encoding extracts the (fairly compact) parameters for this model from the audio speech, storing as a succession of "frames"; decoding converts them back into audio. It's used in the GSM codecs for mobile phones.
Bloix said,

November 13, 2008 @ 12:26 pm

I know that guy. He's a lot more articulate when he's attached to his neck.
Tim Silverman said,

November 13, 2008 @ 12:35 pm

And as somebody upthread mentioned, if what we want to do is transmit purely verbal information, we already have texting and email. If we want to transmit sounds, the obvious solution is to use a system that encodes, compresses, and reproduces, like, sounds. And we already have such a system … it's called a telephone. (And how would conference calls work? And how … ? But it seems clear this has nothing to do with telephone calls, in reality.) It's not clear to me that this project serves any purpose whatever, except perhaps as a teaching tool.
Ray Girvan said,

November 13, 2008 @ 1:45 pm

If we want to transmit sounds, the obvious solution is to use a system that encodes, compresses, and reproduces, like, sounds. And we already have such a system … it's called a telephone.

But the point is that if you only want to send speech, this kind of technigives radically higher compression.
Ray Girvan said,

November 13, 2008 @ 1:51 pm

Sorry, typo – pressed Return too soon.

But the point is that if you only want to send speech, this kind of encoding technique gives radically higher data compression for a given quality (with correspondingly reduced bandwidth demands) than techniques for generic sounds.
Tim Silverman said,

November 13, 2008 @ 6:35 pm

But I don't see what niche this would fill that isn't already covered by either sound transmission or text transmission. What value is there in transmitting, specifically, speech sounds, but not including either the sound of the specific speaker or any other sounds? What interesting content is there in this that wouldn't be present in the text? And how could that content be worth so much as to warrant the purchase of a complicated, heavy, delicate piece of equipment? And how would the articulatory features actually get extracted on the transmit side? It's all very well to be able to produce them, but you have to know what they are.

Theoretically, this sort of thing might help with speech synthesis from text, but again, it would seem on the face of it more efficient to go straight to sound, and that seems to be the consensus of researchers in the field, too.

Maybe I'm just being dense, but I'm just having difficulty imagining how this would work in a real application.
John said,

November 13, 2008 @ 7:07 pm

Anybody know which vowel sounds the robot is supposed to be making on this clip? The second seems to be close-mid back rounded (IPA [o]) and the third seems to be close front rounded (IPA [y]). The fourth — an open-mid central rounded vowel?? As for the first and the fifth — they seem to be varieties of open-mid or near-open back rounded vowels, or am I crazy?

Why did they pick such unusual vowel sounds to demonstrate their robot, I wonder?
John said,

November 13, 2008 @ 7:09 pm

^^ Sorry, I meant the *second* seems to be [y], the *third* seems to be [o].
Ray Girvan said,

November 13, 2008 @ 8:26 pm

What value is there in transmitting, specifically, speech sounds, but not including either the sound of the specific speaker or any other sounds?

I think we're talking at cross-purposes. I agree: the robot as it stands is pretty cumbersome. But

And how would the articulatory features actually get extracted on the transmit side? It's all very well to be able to produce them, but you have to know what they are.

Almost certainly LPC. I don't pretend to understand the mathematics, but it's sketched out here. Standard LPC vocoders extract the "formants" (main resonant frequencies) and back-engineers those in terms of parameters for the human vocal tract model (which because it's a universal model, isn't limited to any particular speaker). Presumably they're doing very similar analysis with the Waseda Talker, except using the known characteristics of its artificial vocal tract. (And their description of WT-4

the trajectory of each robot parameter was controlled so that the acoustic parameters (pitch, sound power, formant frequencies that are resonant frequencies of the vocal tract and have the peak of the output spectrum, and the timing of the switch between voiced and voiceless sounds)

is talking about the very parameters used in LPC).
Heidi Harley said,

November 13, 2008 @ 8:42 pm

Thanks for the interesting discussion, all — nifty thread! In answer to John's question about what the vowels are intended to be, the only clue I can offer is this: The name of the video clip on the Takanishi website is "wt7_aiueo.mpg", which I assume stands for "waseda talker 7 a-i-u-e-o". So it may be that those are supposed to be the cardinal vowels — the tongue profile is consistent with what I would expect for those vowels, anyway, I think — in which case they're a looooong way from having anything remotely workable.

http://www.takanishi.mech.waseda.ac.jp/research/voice/movie/wt7_aiueo.mpg
Tim Silverman said,

November 14, 2008 @ 6:40 am

@Ray—sorry, missed the link to LPC (twice! :-( …). I suppose what they're doing might be used to tweak LPC parameters, but I'm kind of sceptical. I think either the benefits they describe have been garbled in transmission from project plan to website, or they're doing the project for some other reason (e.g. it's a difficult but fun technical challenge … ) and the alleged application is just something they feel they have to say. (Kind of like the way all research in biochemistry is supposedly going to find a cure for cancer (or, in the case of the human genome project, all known diseases); or the way seventeenth century scientists always used to claim their research would promote true religion and thereby make everybody better and happier.) But maybe I'm just too cynical for my own good.
Anna Phor said,

November 17, 2008 @ 4:08 pm

But nobody's addressed the really burning question: Why did they build a robot that needs glasses?

RSS feed for comments on this post

Androids, electric sheep, plastic tongues…

30 Comments

Marinus said,

Brandon said,

Don said,

Will said,

Ray Girvan said,

Rubrick said,

GAC said,

dr pepper said,

Marinus said,

David Eddyshaw said,

Don said,

Nathan Myers said,

Rick S said,

Rick S said,

Freddy Hill said,

GAC said,

Paul said,

Oskar said,

Ray Girvan said,

Bloix said,

Tim Silverman said,

Ray Girvan said,

Ray Girvan said,

Tim Silverman said,

John said,

John said,

Ray Girvan said,

Heidi Harley said,

Tim Silverman said,

Anna Phor said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta