Under the subject line "Things you never thought you'd get to say", Bob Ladd sent me this note yesterday:
You are among the few people I know who will appreciate this anecdote:
It's been unusually cool, wet, and windy in many parts of the Mediterranean this summer, including our part of Sardinia. On our last full day there last week, our local beach was still unpleasantly rough and windy, so we decided to go to a place called La Licciola about 10 miles away, on the other side of the headland and therefore protected from the wind. The last time we went there a couple of years ago, the final access was a long downhill stretch of dirt road with what amounted to a field to park in at the bottom. It was fairly chaotic in a typically Italian way, with people managing to park along the edges of the dirt road when the field got full, but with everyone always leaving just enough room to get through. Anyway, the other day we got to the top of the downhill road to discover that it has been properly paved, with an actual sidewalk along one side and no-parking signs on the other (though everyone was parking there anyway). The parking field has been improved with clearly delineated spaces and there was a chain across the entrance because it was already full. People were having a hard time turning around because the sidewalk has narrowed the driveable part of the downhill road, and new people kept coming in at the top of the hill looking for a space to park, creating more chaos. We decided to give up and go somewhere else, but it took us the better part of 15 minutes to extract ourselves from the mess. It was only on the way back out to the main road that it occurred to me that, in trying to improve things, they had managed to, well, wreck a nice beach.
It was my misfortune to be sharing the car with someone who wouldn't have understood why I was giggling.
In the interests of increasing technological literacy, I'll fill our readers in on the background of Bob's chortles. Much of this history is unknown even to (most of) those in the speech technology field who are quite familiar with the clichéed homophony "recognize speech"≅ "wreck a nice beach".
The story starts with a passage from Aleksandr Solzhenitsyn's autobiographical novel In The First Circle, which was first published in an abridged English translation in 1968 (though the quote below is from a more recent translation of the original version by Fred Willets, 2009). Tiny but relevant glimpses of this book can be found in "The world in a grain of sand", 9/29/2008, and "Speech-based lie detection in Russia", 6/8/2011 — but if you haven't read (In) The First Circle, you should definitely put it on your reading list!
The context is the Marfino sharashka in the early 1950s. Sharashkas were a uniquely Soviet combination of research laboratory and prison, where political prisoners were made to work on projects of interest to the authorities.
JUST AS ORDINARY SOLDIERS KNOW, without being shown the battle orders from headquarters, whether they are part of the main offensive or a supporting action, so the three hundred zeks in the Marfino sharashka had correctly deduced that Number Seven was the decisive sector.
No one was supposed to know Number Seven’s real name, but everybody in the institute did. It was the “Clipped Speech Laboratory.” “Clipped” was an English word. Not only the engineers and translators in the institute but the fitters, the turners, the grinders, perhaps even the deaf and slow-witted carpenter, knew that the original models for the installation were American, though officially they were “ours.” This was why American journals with diagrams and theoretical articles about clipping, which were on sale on newsstands in New York, were here given serial numbers, stapled, classified, and, to frustrate American spies, sealed up in fireproof safes.
Clipping, damping, amplitude compression, electronic differentiation, and integration of random speech—it produced an engineer’s parody of human speech. It was as if someone had taken it into his head to dismantle New Athos or Gurzuf, put the material in little cubes in matchboxes, mix them all up, fly the lot to Nerchinsk, sort them out, and reassemble them precisely as they were before, reproducing the subtropics, the sound of the surf, the southern air, and moonlight.
This was just what had to be done with the speech reduced to little packages of electrical impulses, and what is more, it had to be reproduced not only so that it could be understood but so that the Boss could recognize the voice at the other end.
In this context, I believe that "Clipped Speech Laboratory" might better be translated for modern readers as "Digital Speech Laboratory", for reasons that are suggested by this figure from Manfred Schroeder's 2004 book Computer Speech: Recognition, Compression, Synthesis:
Some vocoders of the First Circle era used a filter bank to decompose the input into a set of bandpass signals, each of which could be turned by "infinite peak clipping" into a digital signal, preserving only two (positive vs. negative) of the original signal values. The resulting digital filter bank outputs — each a binary stream — can then in principle be reconstituted and recombined with relatively good fidelity, thanks to underlying mathematics proved as a theorem in 1977 by Ben "Tex" Logan. "Electronic differentiation and integration" comes into play because differentiation (= high frequency emphasis), infinite peak clipping, and integration (= low frequency emphasis) can produce a better result than infinite peak clipping alone. And I suspect that the stuff about "little cubes in matchboxes" probably refers to encryption techniques permuting the order of the frequency bands in different time frames.
The next step in the beach-wrecking saga involves my former colleague Manfred Schroeder. I'll let Dave Tompkins tell the story, in a passage from his 2010 book How to Wreck a Nice Beach: The Vocoder from World War II to Hip-Hop, The Machine Speaks:
During the Big Bug Fifties, when movies depicted Communists as giant ants, the Kremlin denounced any Soviet praise of American teknik. By then the United States had fallen suspect to another paranoid Joe, this one a senator from Wisconsin. One of Senator McCarthy’s more ardent subscribers was Homer Dudley, inventor of the vocoder. Dudley’s protégé, Manfred Schroeder, learned this when he was hired by Bell Labs after immigrating to New York from Germany in 1954. “There were two things Homer Dudley liked to talk about,” says Schroeder. “The Communists and the vocoder. I didn’t have the words at my fingertips in those days, but today I would call him a right-wing nut. He thought the State Department was infested by Communists.”
Manfred Schroeder served on a German artillery target acquisition unit during the war, identifying blips in the fog. At times, Russian POWs manned the guns in exchange for food. “They were Communists; they were nice people,” he says. “When Sputnik went up in 1958, Dudley came to my desk with the following idea. He said the Russians could not put up a satellite like that and the beep-beep-beep that people heard around the world, coming from Sputnik, was just an electronic fakery.”
Working with Dudley in the acoustics department, Schroeder would consult The First Circle while developing his own voice-excited vocoder— the first of these machines to actually sound human. Demonstrating for his associates, Schroeder assumed that his vocoder could be understood, only because he’d been listening to it all day, the same pratfall that occurs in The First Circle. Struggling between intelligibility and just hearing things, he noted its annoying habit of turning a phrase. “How to recognize speech” sounded like “How to wreck a nice beach.”
“People will go to any length (and width) to be unintelligible,” wrote Schroeder in his book Computer Speech: Recognition, Compression, and Synthesis. So much for the Language of Maximum Clarity.
In The First Circle, Solzhenitsyn compared speech encoding to disassembling a beach and then re-synthesizing it at another location— essentially transposing a summer getaway as if it were a Soviet munitions factory on the run. He called it “an engineering desecration,” the equivalent of pulverizing a southern resort into grits, sticking them into a billion matchboxes, shaking them up and then flying them to a different sector for reconstruction. “A re-creation of the subtropics, the sound of the waves on the shore, the southern air and moonlight.”
The sand in your shorts, the bad radio reception, the copper tonality, the jellyfish parachute squishing between your toes, the effervescent fizz of unvoiced surf. The burning red sun. For the zeks at Marfino the vocoder could make getaways out of sentences, if only inside their heads. A gulag prison term, an imagined escape. The last re-sort, a desperate scramble. As if Solzhenitsyn had burst from his lab table in a flock of schemata, his beard tangled with headphones, denouncing the artificial beach. Somebody had to say something.
In passing, I should note that the "pratfall" — synthetic speech that sounds fine to its creator but is unintelligible to others — is generally not caused by "listening to it all day", but rather by knowing in advance what it's supposed to be saying, which brings to bear the top-down perceptual effects that in extremis produce the Phoneme Restoration Effect. My personal favorite example of this phenomenon is described in "The dogs of speech technology", 3/1/2005.
I can vouch for Dave Tompkins' suggestion that Manfred saw The First Circle as a text with strong personal as well as technical resonances, and his "recognize speech" → "wreck a nice beach" example is without doubt one of the echoes.
By 1980 if not earlier, Manfred's phrase had become part of the culture of speech technology, used in dozens if not hundreds of presentations and papers. Thus J.S. Bridle et al., "Continuous connected word recognition using whole word templates", Radio and Electronic Engineer, 1983:
Furthermore, there can be very small acoustic differences between some word sequences, so that even humans have to rely on the context to deduce the identity of the words (eg 'recognize speech' and 'wreck a nice beach').
Or J. Picone et al., "Automatic text alignment for speech system evaluation", IEEE Transactions on Acoustics, Speech and Signal Processing, 1986:
The basic problem associated with text alignment is the definition of a meaningful distance metric between two text units, such as words or phonemes, such that the degree of similarity between the two strings can be maximized. Any similarity measure used in an automated scoring algorithm must be a perceptually based measure. It is important that the output of the algorithm accurately interpret the listeners’ impressions of the stimulus data. For instance, two homophones differ in spelling, yet are identical phonemically (for example, “scent” and “cent”). Puns play on the similarity of sounds in words while having radically different spellings and meanings (“to wreck a nice beach” and “to recognize speech”). It is not clear how text strings can be aligned using only the raw text. Our approach is to accommodate these problems by performing matching at the phoneme level. Phoneme-to-phoneme distances can then be computed in a perceptually meaningful way based on experimentally derived phoneme-to-phoneme distances which have been collected through various listening experiments.
Of course, the idea of "homophonic translation" is much older — an important reference is Howard L. Chace, Anguish Languish, 1956. But the "wreck a nice beach" example came into the field because Solzhenitsyn's metaphor inspired Schroeder's fragment of poetic homophony.