Ordinary language and technical terminology often diverge. We've covered the "passive voice" case at length. I don't think we've discussed  the fact that for botanists, cucumbers and tomatoes are berries but strawberries and raspberries aren't — but there are many examples of such terminological divergence in fields outside of linguistics. However, the technical terminology is itself sometimes vague or ambiguous in ways that lead to confusion among outsiders, and today I want to explore one case of this kind: "speech synthesis".

Andrew Liszewski, "My Favorite Childhood Gadget of the '80s, the Speak & Spell, Is Back", Gizmodo 2/18/2019:

By today’s standards, the Speak & Spell is beyond primitive, but when introduced by Texas Instruments at CES in 1978, it was one of the first handheld devices to incorporate an electronic display, expansion cartridges, and a speech synthesis engine that could say and spell over 200 words. It even ran on one of the first microprocessors, the TMS1000, which was a power hog that would quickly drain the toy’s four C-sized batteries.

The Speak & Spell’s computerized voice was its most impressive feature, and it was so fascinating to me as a kid that I can still clearly hear its raspy, slightly incomprehensible pronunciations in my head; when I’d be hard-pressed to remember the voices of any of my childhood friends. […]

Where the new Speak & Spell differs from the original—and this could be a deal-breaker for some nostalgia-seekers—is its voice. Instead of using a synthesizer that generates spoken words from a bunch of coded instructions, Basic Fun!’s Speak & Spell uses voice recordings that have been processed to sound like they’re being generated by a computer. The monotonous, stilted delivery sounds very close to the original version, but it’s definitely different.

The highlighted region is somewhere between confused and false. The original 1978 TI Speak & Spell used naturally-spoken speech that was compressed via LPC ("linear predictive coding") so as to fit on a then-available inexpensive solid-state memory device, and then uncompressed for playback using TI's then-new LPC chip. So it's true that it "generat(ed) spoken words from a bunch of coded instructions". But so (I assume) does the modern imitation. And so does your cell phone, and your .mp3 or .aac-encoded podcasts, and your Audible audiobooks, etc.

It's true that this process of reconstituting compressed or "encoded" speech is commonly called "(re-)synthesis". But the reason that the original Speak & Spell produced "raspy, slightly incomprehensible pronunciations" is partly that it used extreme compression — about 1200 bits per second, as opposed to 64000 or 128000 bps for typical .mp3 audio, or 4750 to 12200 bps for GSM cellular voice transmission — and partly that it used an early-generation encoding algorithm.

The Wikipedia entry gives more details but is also misleading:

The Speak & Spell used the first single-chip voice synthesizer, the TMC0280, later called the TI TMS5100, which utilized a 10th-order linear predictive coding (LPC) model by using pipelined electronic DSP logic. A variant of this chip with a very similar voice would eventually be utilized in certain Chrysler vehicles in the 1980s as the Electronic Voice Alert.

Speech synthesis data (phoneme data) for the spoken words were stored on a pair of 128 Kbit metal gate PMOS ROMs.

As the Wikipedia article itself goes on to explain, it's not "phoneme data" that was stored — which would generally have been the case if the system had used a text-to-speech algorithm — but rather the time functions of linear prediction parameters, f0, amplitude, and so on, derived from human recordings of the specific words to be spoken:

The technique used to create the words was to have a professional speaker speak the words. The utterances were captured and processed. Originally all of the recording and processing was completed in Dallas. By 1982 when the British, French, Italian and German versions were being developed, the original voices were recorded in the TI facility near Nice in France and these full bit rate digital recordings were sent to Dallas for processing using a minicomputer.[35] Some weeks later the processed data was returned and required significant hand editing to fix the voicing errors which had occurred during the process. The data rate was so radically cut that all of the words needed some editing. In some cases this was fairly simple, but some words were unintelligible and required days of work and others had to be completely scrapped. The stored data were for the specific words and phrases used in the Speak & Spell. The data rate was about 1,000 bits per second.

For some background on LPC, see  Bishnu Atal, "The History of Linear Prediction", IEEE Signal Processing Magazine 3/2006.  Bishnu uses the older and less confusing term "vocoder" (= Voice Coder) to refer to the process of compressing and reconstituting speech — but he also writes about "LPC analysis and resynthesis":

LPC rapidly became a very popular topic in speech research. A large number of people contributed valuable ideas for the application of the basic theory of linear prediction to speech analysis and synthesis. The excitement was evident at practically every technical meeting. Research on LPC vocoders gained momentum partly due to increased funding from the U.S. government and its selection for the 2.4 kb/s secure-voice standard LPC10. LPC required a lot of computations when it started being applied to speech. Fortunately, computer technology was rapidly evolving. By 1973, the first compact real-time LPC vocoder had been implemented at Philco-Ford. In 1978, Texas Instruments introduced a popular LPC-based toy that was called “Speak and Spell.”

And Bishnu's seminal 1971 paper had the title "Speech analysis and synthesis by linear prediction of the speech wave" — and uses the word "synthesizer" to describe the subsystem for reconstituting the speech signal:

Abstract: We describe a procedure for efficient encoding of the speech wave by representing it in terms of time‐varying parameters related to the transfer function of the vocal tract and the characteristics of the excitation. The speech wave, sampled at 10 kHz, is analyzed by predicting the present speech sample as a linear combination of the 12 previous samples. The 12 predictor coefficients are determined by minimizing the mean‐squared error between the actual and the predicted values of the speech samples. Fifteen parameters—namely, the 12 predictor coefficients, the pitch period, a binary parameter indicating whether the speech is voiced or unvoiced, and the rms value of the speech samples—are derived by analysis of the speech wave, encoded and transmitted to the synthesizer. The speech wave is synthesized as the output of a linear recursive filter excited by either a sequence of quasiperiodic pulses or a white‐noise source. Application of this method for efficient transmission and storage of speech signals as well as procedures for determining other speech characteristics, such as formant frequencies and bandwidths, the spectral envelope, and the autocorrelation function, are discussed.

In talking about food or fibers or drugs, we generally don't use the words synthetic, synthesis etc. to describe things created by processing natural substances. Synthetic fabrics are derived from petroleum or whatever, not by processing plant materials. But in the case of speech, synthesis is used to describe reconstituting a stream of natural speech (or other audio) that has been processed for more efficient transmission, as well as to describe the creation of entirely-synthetic audio as in text-to-speech applications. (Though in fairness, TTS these days mostly works by combination and modification of fragments of human speech from a large collection of naturally-spoken audio, or by a fuzzier "deep learning" creation of audio time-functions by analogy to a large body of natural training speech — this is something like fibers created by extraction and polymerization of molecular fragments from an organic source.)



