« previous post |

Current text-to-speech systems are pretty good. Their output is almost always comprehensible, and often pretty natural-sounding. But there are still glitches.

This morning, Dick Margulis sent an example of one common problem: inconsistent (and often wrong) stressing of complex nominals:

We have a winding road that we drive with our Google Maps navigator on, to keep us from taking a wrong turn in the woods. We have noticed that "West Woods Road" is rendered with a few different stress patterns as we go from turn to turn, and we can't come up with a hypothesis explaining the variation. Attached is a recording. It's a few minutes long because that's how long the trip takes. The background hum is the car.

I've extracted and concatenated the 11 Google Maps instructions from the four minutes and five seconds of the attached recording:

Your browser does not support the audio element.

This case is especially puzzling since the voice does very well on the first instance, and then screws up pretty consistently thereafter.

But in general, it's not trivial to guess the correct accentuation of patterns like "ADJ NOUN road". If it's a three-word name, as in this case, then the system's first version is the right choice:

Your browser does not support the audio element.

But if with a different interpretation of the same three words, a human speaker might do something like the system's later versions:

Your browser does not support the audio element.

At least, compare this clip from a sociolinguistic interview:

Your browser does not support the audio element.

And the stress pattern of "Woods Road" would be more plausible if it were "Woods Street", since the English language has irrationally decided to assign primary stress to X in X Street, as opposed to what we do for Road, Avenue, Boulevard, Way, Alley, etc. Though the apparently phrase-final lengthening and pitch fall on "west" remains puzzling.

We could try make up a story about robot themes and rhemes gone wrong, but most likely this is just one of those weird and inexplicable things that "deep learning" systems sometimes do.

Some Googlers of my acquaintance are well aware of problems like this one — so maybe Dick's navigational narrative will become more natural, if less interesting, after a few more updates to Google Maps.

Here are a couple of relevant (if antique) references:

Richard Sproat and Mark Liberman, "Towards Treating English Nominals Correctly", ACL 1987.

Mark Liberman and Richard Sproat, "The Stress and Structure of Modified Noun Phrases in English", in Lexical Matters, Sag and Szabolsci, Eds. 1992.

"Parsers that count", 11/25/2003

"Complex nominal of the year", 11/2/2018.

Permalink