"Speech synthesis"

« previous post | next post »

Ordinary language and technical terminology often diverge. We've covered the "passive voice" case at length. I don't think we've discussed  the fact that for botanists, cucumbers and tomatoes are berries but strawberries and raspberries aren't — but there are many examples of such terminological divergence in fields outside of linguistics. However, the technical terminology is itself sometimes vague or ambiguous in ways that lead to confusion among outsiders, and today I want to explore one case of this kind: "speech synthesis".

Andrew Liszewski, "My Favorite Childhood Gadget of the '80s, the Speak & Spell, Is Back", Gizmodo 2/18/2019:

By today’s standards, the Speak & Spell is beyond primitive, but when introduced by Texas Instruments at CES in 1978, it was one of the first handheld devices to incorporate an electronic display, expansion cartridges, and a speech synthesis engine that could say and spell over 200 words. It even ran on one of the first microprocessors, the TMS1000, which was a power hog that would quickly drain the toy’s four C-sized batteries.

The Speak & Spell’s computerized voice was its most impressive feature, and it was so fascinating to me as a kid that I can still clearly hear its raspy, slightly incomprehensible pronunciations in my head; when I’d be hard-pressed to remember the voices of any of my childhood friends. […]

Where the new Speak & Spell differs from the original—and this could be a deal-breaker for some nostalgia-seekers—is its voice. Instead of using a synthesizer that generates spoken words from a bunch of coded instructions, Basic Fun!’s Speak & Spell uses voice recordings that have been processed to sound like they’re being generated by a computer. The monotonous, stilted delivery sounds very close to the original version, but it’s definitely different.

The highlighted region is somewhere between confused and false. The original 1978 TI Speak & Spell used naturally-spoken speech that was compressed via LPC ("linear predictive coding") so as to fit on a then-available inexpensive solid-state memory device, and then uncompressed for playback using TI's then-new LPC chip. So it's true that it "generat(ed) spoken words from a bunch of coded instructions". But so (I assume) does the modern imitation. And so does your cell phone, and your .mp3 or .aac-encoded podcasts, and your Audible audiobooks, etc.

It's true that this process of reconstituting compressed or "encoded" speech is commonly called "(re-)synthesis". But the reason that the original Speak & Spell produced "raspy, slightly incomprehensible pronunciations" is partly that it used extreme compression — about 1200 bits per second, as opposed to 64000 or 128000 bps for typical .mp3 audio, or 4750 to 12200 bps for GSM cellular voice transmission — and partly that it used an early-generation encoding algorithm.

The Wikipedia entry gives more details but is also misleading:

The Speak & Spell used the first single-chip voice synthesizer, the TMC0280, later called the TI TMS5100, which utilized a 10th-order linear predictive coding (LPC) model by using pipelined electronic DSP logic. A variant of this chip with a very similar voice would eventually be utilized in certain Chrysler vehicles in the 1980s as the Electronic Voice Alert.

Speech synthesis data (phoneme data) for the spoken words were stored on a pair of 128 Kbit metal gate PMOS ROMs.

As the Wikipedia article itself goes on to explain, it's not "phoneme data" that was stored — which would generally have been the case if the system had used a text-to-speech algorithm — but rather the time functions of linear prediction parameters, f0, amplitude, and so on, derived from human recordings of the specific words to be spoken:

The technique used to create the words was to have a professional speaker speak the words. The utterances were captured and processed. Originally all of the recording and processing was completed in Dallas. By 1982 when the British, French, Italian and German versions were being developed, the original voices were recorded in the TI facility near Nice in France and these full bit rate digital recordings were sent to Dallas for processing using a minicomputer.[35] Some weeks later the processed data was returned and required significant hand editing to fix the voicing errors which had occurred during the process. The data rate was so radically cut that all of the words needed some editing. In some cases this was fairly simple, but some words were unintelligible and required days of work and others had to be completely scrapped. The stored data were for the specific words and phrases used in the Speak & Spell. The data rate was about 1,000 bits per second.

For some background on LPC, see  Bishnu Atal, "The History of Linear Prediction", IEEE Signal Processing Magazine 3/2006.  Bishnu uses the older and less confusing term "vocoder" (= Voice Coder) to refer to the process of compressing and reconstituting speech — but he also writes about "LPC analysis and resynthesis":

LPC rapidly became a very popular topic in speech research. A large number of people contributed valuable ideas for the application of the basic theory of linear prediction to speech analysis and synthesis. The excitement was evident at practically every technical meeting. Research on LPC vocoders gained momentum partly due to increased funding from the U.S. government and its selection for the 2.4 kb/s secure-voice standard LPC10. LPC required a lot of computations when it started being applied to speech. Fortunately, computer technology was rapidly evolving. By 1973, the first compact real-time LPC vocoder had been implemented at Philco-Ford. In 1978, Texas Instruments introduced a popular LPC-based toy that was called “Speak and Spell.”

And Bishnu's seminal 1971 paper had the title "Speech analysis and synthesis by linear prediction of the speech wave" — and uses the word "synthesizer" to describe the subsystem for reconstituting the speech signal:

Abstract: We describe a procedure for efficient encoding of the speech wave by representing it in terms of time‐varying parameters related to the transfer function of the vocal tract and the characteristics of the excitation. The speech wave, sampled at 10 kHz, is analyzed by predicting the present speech sample as a linear combination of the 12 previous samples. The 12 predictor coefficients are determined by minimizing the mean‐squared error between the actual and the predicted values of the speech samples. Fifteen parameters—namely, the 12 predictor coefficients, the pitch period, a binary parameter indicating whether the speech is voiced or unvoiced, and the rms value of the speech samples—are derived by analysis of the speech wave, encoded and transmitted to the synthesizer. The speech wave is synthesized as the output of a linear recursive filter excited by either a sequence of quasiperiodic pulses or a white‐noise source. Application of this method for efficient transmission and storage of speech signals as well as procedures for determining other speech characteristics, such as formant frequencies and bandwidths, the spectral envelope, and the autocorrelation function, are discussed.

In talking about food or fibers or drugs, we generally don't use the words synthetic, synthesis etc. to describe things created by processing natural substances. Synthetic fabrics are derived from petroleum or whatever, not by processing plant materials. But in the case of speech, synthesis is used to describe reconstituting a stream of natural speech (or other audio) that has been processed for more efficient transmission, as well as to describe the creation of entirely-synthetic audio as in text-to-speech applications. (Though in fairness, TTS these days mostly works by combination and modification of fragments of human speech from a large collection of naturally-spoken audio, or by a fuzzier "deep learning" creation of audio time-functions by analogy to a large body of natural training speech — this is something like fibers created by extraction and polymerization of molecular fragments from an organic source.)

 



22 Comments

  1. Amanda Adams said,

    February 16, 2019 @ 10:14 am

    We can still make each other laugh – the whole family – by saying "Wrong, try again" in the S&S's voice…

  2. David Cantor said,

    February 16, 2019 @ 11:10 am

    The question of synthetic fabrics is an interesting one. The starting material for Rayon is wood, which is then chemically treated. Wikipedia notes:

    Although rayon is manufactured from naturally occurring polymers, it is not considered to be synthetic. Technically, the term synthetic fiber is reserved for fully synthetic fibers. In manufacturing terms, rayon is classified as "a fiber formed by regenerating natural materials into a usable form".

  3. ===Dan said,

    February 16, 2019 @ 11:45 am

    I had the plug-in speech synthesizer for the TI 99/4A computer in the early 80s. My memory is hazy, but I think I played with text-to-speech using BASIC. I also seem to remember not liking the way some words sounded, and making different decisions somehow.

  4. Stephen Hart said,

    February 16, 2019 @ 1:07 pm

    "Speech synthesis data (phoneme data) for the spoken words were stored on a pair of 128 Kbit metal gate PMOS ROMs."

    Given that this is Wikipedia, which anyone can edit, it's possible the parenthetical phrase was added by someone. It can be edited out as well.

  5. Philip Taylor said,

    February 16, 2019 @ 1:42 pm

    "Phoneme data" has been there since 4th July 2007, possibly with breaks.
    https://en.wikipedia.org/w/index.php?title=Speak_%26_Spell_%28toy%29&oldid=142521561

  6. AntC said,

    February 16, 2019 @ 5:46 pm

    Was anybody thrown by the opening phrase of the Liszewski quote?

    By today’s standards, the Speak & Spell is beyond primitive, …

    "beyond" meaning 'less sophisticated than' in a comparison of levels of technology? I find it something of a mis-negation or mis-scaling of a continuum. At my first reading, I thought the "beyond" would mean 'more sophisticated than'.

    I suppose it means: from today's technology look back to primitive technologies; then look in time before that to the S&S. I remember technology before 1978 (in research labs rather than the consumer market), which was even less capable. So when was this putative 'primitive' phase that 1978 was before?

  7. JB said,

    February 16, 2019 @ 7:18 pm

    I was inevitably reminded of this song by Flight of the Conchords: https://www.youtube.com/watch?v=2IPAOxrH7Ro

  8. Stephen said,

    February 16, 2019 @ 11:57 pm

    The word “beyond” these days (and for a while now) is often used as an intensifier. “Are you sad?” “I’m beyond sad” = “it’s even worse than that.” Think “more extreme” rather than “forward.”

  9. Gregory Kusnick said,

    February 16, 2019 @ 11:59 pm

    What does "fully synthetic" even mean? The organic molecules in petroleum are (unless you're a Tom Gold partisan) ultimately of biological origin.

  10. Philip Taylor said,

    February 17, 2019 @ 4:35 am

    Ant C — for me, "beyond" is relative to the direction of travel, so I have no problem with something developed in 1978 being described as "beyond primitive by today's standards" (which is just a re-ordering of the original text), and for me this clarifies that in this context, "primitive" refers to "by today's standards", not by standards that obtained at some earlier point in time.

  11. Kristian said,

    February 17, 2019 @ 5:38 am

    I think the word "synthesis" is confusing in itself. Technically I suppose it is the opposite of analysis and it has a technical sense in chemistry and philosophy, and other fields. But in ordinary speech it just means "artificial", especially when something artificial is contrasted with a "natural" alternative (like synthetic fibers vs natural fibers).

  12. shubert said,

    February 17, 2019 @ 9:19 am

    The word synonym's second definition removed my doubt why those "synonyms" make sense.
    Most of the si/sy- words have secret Hanzi connection.

  13. shubert said,

    February 17, 2019 @ 9:28 am

    My last comment about the use of "alternative" might be improper after checking the full definitions of "synonym".

  14. MikeA said,

    February 17, 2019 @ 1:26 pm

    By odd coincidence, Bruce Schneier's blog dated the 15th

    https://www.schneier.com/blog/archives/2019/02/reconstructing_.html

    had pointers to some articles on SIGSALY (and comments apparently from an author), a "voice scrambler" that apparently started out with something like a VOCODER, VODER pair, but with encryption and various other "secret sauce" in between. Mid 40s.

    [(myl) Those are great links. For some other relevant background, see "Bletchley Park in the lateral interparietal corpus", 1/9/2004; "The world in a grain of sand", 1/29/2008; "Wrecking a nice beach", 8/5/2014; "Bishnu Atal", 9/2/2018.
    Also read The First Circle…]

  15. MonkeyBoy said,

    February 17, 2019 @ 3:26 pm

    "Although rayon is manufactured from naturally occurring polymers, it is not considered to be synthetic. Technically, the term synthetic fiber is reserved for fully synthetic fibers. In manufacturing terms, rayon is classified as "a fiber formed by regenerating natural materials into a usable form".

    which lead to the issue where socks made from rayon made from bamboo where prohibited from just being labeled as "made from bamboo"

    https://en.wikipedia.org/wiki/Rayon#Mislabelling

  16. Trogluddite said,

    February 17, 2019 @ 3:51 pm

    A quick Google search found several other articles about the same product, all of which contain strikingly similar assertions that a 'synthesiser' has been replaced by recordings modified to "sound like" the original. It seems likely, then, that the "somewhere between confused and false" explanation has come from the manufacturer's own press release, and that the journalists who copied it considered "gizmo that makes robot voices" a perfectly reasonable gloss.

    [Mark Liberman]: "Bishnu uses the older and less confusion [sic!] term 'vocoder'"
    Indeed, the meaning is much less ambiguous; it means "gizmo that makes robot voices", as any serious fan of Kraftwerk, Daft Punk, or any similar electronic music surely knows! ;-)

  17. AntC said,

    February 18, 2019 @ 5:23 am

    Apparently it's now all a solved problem: not only synthesising speech but also recognising speech and translating the meaning artificial speech translation is upon us.

    [background] "Noise, Alex Waibel tells me, is one of the major challenges that artificial speech translation has to meet."

    Oh really? So not speech production after translation? Not the translation itself?

    Excuse me while I pass a few phrases through Google translate 'is the era of artificial speech translation upon us' to Chinese (traditional) and back gives 'Is the era of artificial language translation'. Trying a few European languages, that 'upon us' is obviously too colloquial. (Propositions are tricky. I wonder if speech production gets the cadence right?)

  18. James Wimberley said,

    February 18, 2019 @ 7:30 am

    Kristian: to add to the confusion, "synthetic" is undergoing a born-again conversion in renewable energy. "Natural" gas is the bad fossil stuff, while synthetic gas or syngas is made from hydrogen split from water by catalysis using clean wind and solar electricity. Four legs good!

  19. Ursa Major said,

    February 18, 2019 @ 7:45 am

    @Gregory Kusnick

    "What does "fully synthetic" even mean? The organic molecules in petroleum are (unless you're a Tom Gold partisan) ultimately of biological origin."

    Well you could always start from carbon dioxide and water, but why do the hard work when biology does it for you?

    In the field of natural products research there are in fact qualified syntheses. A natural product is a compound that has been extracted from a biological source (it is still called a natural product if synthesised in the lab). A "total synthesis" is a procedure that produces the compound from relatively simple starting materials. A "partial synthesis" produces the compound from a starting material (usually of biological origin) that already has most of the final complexity in place. A "formal synthesis" is a route that begins with different starting materials and stops once it reaches an intermediate product that occurs in a previously published synthesis.

  20. Daniel Barkalow said,

    February 19, 2019 @ 3:14 pm

    I suspect there's confounding jargon from music, where a "synthesizer" does approximately what the classic Speak and Spell did. And you can sample a person speaking and play it with a synthesizer, but that's not what "speech synthesis" means as a compound.

  21. eub said,

    February 19, 2019 @ 4:00 pm

    A "vocoder" after Bell Labs is a specific process of filtering one audio stream by the spectral shape of another
    (sometimes with a whitening filter or other doodads).

  22. Trogluddite said,

    February 19, 2019 @ 7:20 pm

    @Daniel Barkalow
    And to add further ambiguity; if I step through the sounds on my music 'synthesiser' (which might all be generated by the same algorithms), listeners will readily identify some sounds as "being" some other instrument, while others will be identified as "synthesiser" sounds.

    It seems that it's not just that the word is rather ambiguous, but also a problem of how people discriminate between categories. To a technician, 'synthesis' is marked by the way in which the end result is generated, and high fidelity is usually the aim. But what laypeople heard from early implementations had obvious artefacts which marked it as different from the common understanding of 'recordings', which had already attained high fidelity reproduction.

    The boffins called this new thing 'synthesis', and the difference in sound seemed characteristic of it – and a nice shortcut for spotting it without having to know anything technical. "Talking computer/robot" tropes from science-fiction no doubt will have reinforced this impression. To the layman, speech is 'synthesised' when a computer is "deciding how to say the words", and it is easily identified because computers always talk a bit funny.

    In the early days, when limitations of the technology meant that artefacts were unavoidable, and 'recordings' were usually analogue, the technician's and the layperson's views of what constitutes 'synthesised speech' may have referred to nearly the same set of concrete (or fictional) examples, making it less clear that they were talking about two different conceptual categories. As the artefacts become less apparent, and the distinction between 'recording' and 'synthesis' gets blurred by digital technology, it becomes clearer that they are not talking about quite the same thing.

    From the layman's point of view, the technicians have moved the goalposts, so to speak.

RSS feed for comments on this post