Language Log

Weird synthesis

March 2, 2010 @ 7:48 am · Filed by Mark Liberman under Awesomeness, Speech technology

I wouldn't have predicted that this would work as well as it does:

Sinewave synthesis (Robert Remez et al. "Speech perception without traditional speech cues", Science 1981) pointed the way, but this is pretty far down the road. Now I want to hear the string orchestra and brass band versions.

Of course, the apparent intelligibility is mostly a function of knowing what was "said". But still.

[Hat tip: Aengus.]

March 2, 2010 @ 7:48 am · Filed by Mark Liberman under Awesomeness, Speech technology

Permalink

26 Comments

Richard Howland-Bolton said,

March 2, 2010 @ 8:15 am

Ahh! That takes me back to the days of Sparky's Magic Piano, though that sounded much better :-)

http://en.wikipedia.org/wiki/Sparky%27s_Magic_Piano
Richard Howland-Bolton said,

March 2, 2010 @ 8:17 am

Oh! Oh! Oh!!
Here is is!!
http://www.youtube.com/watch?v=s3etiNLAFi0
Jon Weinberg said,

March 2, 2010 @ 9:05 am

I had expected the link from "the apparent intelligibility is mostly a function of knowing what was 'said'" to point somewhere like this.
John Cowan said,

March 2, 2010 @ 10:59 am

After the word "politicians", I shut my eyes to see if I could understand the rest. Not a word. Without the subtitles, the pseudo-speech was completely unintelligible to me. I stopped when the child began to read the text out loud, and restarted the video with my eyes shut. Again, I could understand only what I had seen the subtitles for, so the effect is robust.
couk said,

March 2, 2010 @ 11:46 am

In the screenshot @1:52 it seems that the input signal is overdriven, which introduces extra noise. And I wonder what the composer means when he says that a "fairly high resolution" is only possible with a mechanical piano. Obviously the effect relies on an artificially low resolution, and an electronic instrument wouldn't be so limited by its nature. Unless that electronic instrument is a bunch of hard drives.
Faith said,

March 2, 2010 @ 11:50 am

I covered the screen before I started watching (since I'd been given the hint that there would be subtitles). I could understand "we declare" and "responsibility" and "world." I haven't watched it with the subtitles to check if those words are what the piano was actually trying to say.
majolo said,

March 2, 2010 @ 12:01 pm

I had a different experience from John Cowan's. I tried looking away after reading some subtitles, and the last phrase "protecting our mother earth" was pretty clear with no visual cues. In fact, I would rather say that the apparent intelligibility (for me) was mostly a function of knowing that something was being said, not what was being said.
Victor Mair said,

March 2, 2010 @ 2:27 pm

I kept my eyes closed throughout and could easily hear phrases like "we are responsible" and "we proclaim that another world is possible."
Will said,

March 2, 2010 @ 3:05 pm

I unfortunately watched the video before reading the comments, and I never covered the screen because I really expected them to do a segment without subtitles specifically for a demonstration of understandability, but they never did that. Oh well.
Spell Me Jeff said,

March 2, 2010 @ 3:08 pm

I think perhaps resolution is a misleading word to describe what is going on, though perhaps correct from a strictly psychoacoustic point of view. I dislike it because in most contexts it suggests a kind of precision. What a mechanical piano would introduce is quite the opposite, but rather all kinds of harmonics and resonance, which would be especially noticeable in a live context. The absence of such effects is what makes early tone generators so annoying to listen to. But in this situation, it might be reasonable to suggest that the string resonance and the entire structure of a boxy, wooden piano substitute for the complex structure of human vocal apparatus, from diaphragm to vocal cords made of tissue to sinus resonance. Sophisticated samplers and tone generators can capture and/or reproduce most, but not all of this. (You can always tell if it's live or Memorex.)

[(myl) The organic complexity of acoustic instruments is probably not relevant here, as the success of sinewave synthesis shows. I'm skeptical, frankly, that the outcome would be in general less intelligible or less speech-like if synthetic tone-generators were used instead of this electromechanical rig. And if there's any difference, it would have to do with things that would be easy to change in the synthetic version, like (say) excessive phase coherence or something.]
James C. said,

March 2, 2010 @ 3:34 pm

I believe that he meant resolution in the sense of having enough frequencies available to coarticulate. For a non-electronic instrument the only things which would provide a large frequency spectrum with the possibility of simultaneous frequencies being active are pianos or something like them, e.g. xylophone, marimba, harpsichord, harp. A piano has the added ability to automatically damp strings, and lots of resonance would have destroyed the effect.

I think if they’d chosen a speaker with a slightly lower range they might have been more successful. I note that the lower keys weren’t being used at all. Frankly they did a remarkable job at rendering the high frequency consonants like fricatives.
Mike Albaugh said,

March 2, 2010 @ 3:42 pm

Having done a fair bit of sound synthesis "back in the day" (70's and 80's), I heard the "only a piano…" comment as "There are very few mechanical instruments that can play up to 88 notes at once". Most of the synthesizers I could afford had far fewer (< 8) available "voices". If your goal is to use addition of (sorta) sine waves, you probably do need such independence.

[(myl) The electromechanical device here can play up to 88 notes at once (I guess), with good temporal precision. A human pianist wouldn't have the needed control either in note selection or in timing. Some "player pianos" might not either, I guess.

Synthesizers that need to keep up with real time may have a limit on how many notes they can simultaneously generate, though these days I don't think it would be too hard to find a processor capable of generating 88 simultaneous tones via (say) wavetable synthesis. But there's nothing about this discussion that limits us to real-time synthesis. And I bet that the same recipe, applied to create a digital audio file at whatever speed the software happens to work, would have about the same results as the mechanical system does.]
Kenny Easwaran said,

March 2, 2010 @ 4:56 pm

I had assumed the "only a mechanical piano" meant that he'd never be able to get this sort of effect with a piano played by a human. (Although the fact that some of Conlon Nancarrow's etudes for player piano can be played now by some live instrumentalists suggests that maybe some day some human pianist might be able to do this with a simple phrase.)
Jarek Weckwerth said,

March 2, 2010 @ 5:39 pm

Yes, I agree with Kenny Easwaran. What he means may probably be temporal resolution going well below (or beyond, if you like) traditional note lengths (half notes, quarter notes etc.), and the very precise synchronisation between the notes, unachievable for a human player. The visual illustration there is the scrolling midi-like control sheet (usually called the "piano roll" in sequencer applications).
David L said,

March 2, 2010 @ 7:10 pm

myl said: "And I bet that the same recipe, applied to create a digital audio file at whatever speed the software happens to work, would have about the same results as the mechanical system does."

But a major limitation of the particular mechanical transducer used here is that you have little if any control over the temporal profile of each note. You bang the key with the mechanical finger and the resulting sound has a certain loudness and duration. Whereas with a purely digital system (if that's what you mean; I'm not sure) you could add together "notes" with arbitrary intensity vs. time profiles. It would produce, I'd guess, a more accurate version of the voice. Wouldn't be nearly so cool, though.
Dan Lufkin said,

March 2, 2010 @ 10:19 pm

Am I the only one here who remembers the Bell Labs Voder at the 1939 NY World's Fair? (110,000 hits) That had a keyboard with an operator who could play requests at the demo. It made a lasting impression on me.

Of course that was parmetric rather than synthetic. Didn't A.G. Bell have a crude version?
Mark F. said,

March 3, 2010 @ 12:18 am

I wish there were a video that consisted only of the recitation by the piano. Just as I think I'm starting to pick out the words, the narration cuts in and I lose it.
Graeme said,

March 3, 2010 @ 8:54 am

Amusing.

But all I could think was how eerily tinny the 'voice' was; and how incongruous both the timbre (or lack of it) and the whole project (cheap computer synthesising something as natural as the human voice) were, given the environmental message.

At least Kraftwerk focused on pocket calculators, trains and showroom dummies.
Aviatrix said,

March 3, 2010 @ 5:16 pm

I suspect that the effect is similar to listening to a heavily accented speaker. Some people find him unintelligible, and others can pick out the words. I imagine that after listening to this for a while it wouldn't be much more impenetrable than any other accent. After all, isn't that what's going on here? The "speaker" is unable to form the English phonemes normally and is approximating them with best match sounds from its own "language."
Frans said,

March 3, 2010 @ 5:53 pm

I unfortunately watched the video before reading the comments, and I never covered the screen because I really expected them to do a segment without subtitles specifically for a demonstration of understandability, but they never did that. Oh well.

Same here.
(b)logophile › Voice of the robot revolution said,

March 4, 2010 @ 6:48 am

[…] [via] This was written by tikitu. Posted on Thursday, March 4, 2010, at 12:45 pm. Tagged awe, delight, language, perception. Bookmark the permalink. Follow comments here with the RSS feed. Post a comment or leave a trackback. […]
Dmajor said,

March 5, 2010 @ 3:11 am

I've noticed a similar effect while printing pages on my Epson cx5400 printer. Sometimes the sound seems like a short phrase just on the edge of intelligible speech. Although why the printer should repeat, say, "Bob Mulroney, Bob Mulroney" or "macaroni hat, macaroni hat" I don't know.
цarьchitect said,

March 6, 2010 @ 6:35 pm

Musicians have been using the human voice as an instrument for a while now, it's about time someone tried to create a voice from an instrument.

Also, this reminds me of the harmonic telegraph.

@ Kenny Easwaran – I thought the human-powered Nancarrow pieces were arrangements and not the pieces themselves.
Interesting Stuff: Early March 2010 « The Outer Hoard said,

March 7, 2010 @ 9:08 am

[…] Via Language Log, a talking piano. […]
Dennis Des Chene said,

March 31, 2010 @ 8:35 pm

I'm not surprised. Hold down the damper pedal and shout a vowel into the strings of a piano. You will hear the timbre of the vowel, faintly, resonating in the strings. Notice that the "synthesizer" is much better with vowels than with consonants. You can’t really do noise very well.
The piano, it talks! « CJ Record, The Vermont Version said,

November 23, 2012 @ 9:55 pm

[…] Language Log: Weird SynthesisrnrnFrom a German news broadcast that someone posted on YouTubernrn Share this:TwitterFacebookLike […]

RSS feed for comments on this post

Weird synthesis

26 Comments

Richard Howland-Bolton said,

Richard Howland-Bolton said,

Jon Weinberg said,

John Cowan said,

couk said,

Faith said,

majolo said,

Victor Mair said,

Will said,

Spell Me Jeff said,

James C. said,

Mike Albaugh said,

Kenny Easwaran said,

Jarek Weckwerth said,

David L said,

Dan Lufkin said,

Mark F. said,

Graeme said,

Aviatrix said,

Frans said,

(b)logophile › Voice of the robot revolution said,

Dmajor said,

цarьchitect said,

Interesting Stuff: Early March 2010 « The Outer Hoard said,

Dennis Des Chene said,

The piano, it talks! « CJ Record, The Vermont Version said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta