Archive for Speech technology

Yanny vs. Laurel, pt. 2

Just when you thought you'd never have to worry about this vexing acoustic phenomenon again, "Yanny vs. Laurel: an analysis by Benjamin Munson" (5/16/18) and the comments thereto having carried out such a probing, exhaustive investigation, a 3:44 video (5/15/18) surfaces that attempts to explain it in a way that has not yet been mentioned:

Read the rest of this entry »

Comments (27)

Yanny vs. Laurel: an analysis by Benjamin Munson

A peculiar audio clip has turned into a viral sensation, the acoustic equivalent of "the dress" — which, you'll recall, was either white and gold or blue and black, depending on your point of view. This time around, the dividing line is between "Yanny" and "Laurel."

The Yanny vs. Laurel perceptual puzzle has been fiercely debated (see coverage in the New York Times, the AtlanticVox, and CNET, for starters). Various linguists have chimed in on social media (notably, Suzy J. Styles and Rory Turnbull on Twitter). On Facebook, the University of Minnesota's Benjamin Munson shared a cogent analysis that he provided to an inquiring reporter, and he has graciously agreed to have an expanded version of his explainer published here as a guest post.

Read the rest of this entry »

Comments (117)

World disfluencies

Disfluency has been in the news recently, for two reasons: the deployment of filled pauses in an automated conversation by Google Duplex, and a cross-linguistic study of "slowing down" in speech production before nouns vs. verbs.

Lance Ulanoff, "Did Google Duplex just pass the Turing Test?", Medium 5/8/2018:

I think it was the first “Um.” That was the moment when I realized I was hearing something extraordinary: A computer carrying out a completely natural and very human-sounding conversation with a real person. And it wasn’t just a random talk. […]

Duplex made the call and, when someone at the salon picked up, the voice AI started the conversation with: “Hi, I’m calling to book a woman’s hair cut appointment for a client, um, I’m looking for something on May third?”

Frank Seifart et al., "Nouns slow down speech: evidence from structurally and culturally diverse languages", PNAS 2018:

When we speak, we unconsciously pronounce some words more slowly than others and sometimes pause. Such slowdown effects provide key evidence for human cognitive processes, reflecting increased planning load in speech production. Here, we study naturalistic speech from linguistically and culturally diverse populations from around the world. We show a robust tendency for slower speech before nouns as compared with verbs. Even though verbs may be more complex than nouns, nouns thus appear to require more planning, probably due to the new information they usually represent. This finding points to strong universals in how humans process language and manage referential information when communicating linguistically.

Read the rest of this entry »

Comments (12)

An experiment with echoing Echos

Henry Cooke (aka "prehensile" on GitHub) has hatched a fascinating techno-artistic experiment. He set up two Amazon Echos to talk back and forth, each repeating a text to the other, with every iteration introducing new errors. His initial inspiration was "I Am Sitting in a Room," a 1969 work of acoustic art by Alvin Lucier, in which a text is recorded and re-recorded until all that is left is the hum of resonant frequencies in the room. (You can watch a 2014 performance with Lucier here.) Rather than replicate Lucier's text, Cooke created new ones for the two Echos to vocalize, with an added wrinkle: iterations of the texts follow the Oulipo S+7 constraint, in which each noun is replaced by another noun appearing seven steps away in the dictionary. You can see the first ten iterations (using Amazon Polly to synthesize different voices) in this video.

Read the rest of this entry »

Comments (4)

Skin motion

Nia Wesley, "Girl's tattoo of late grandma's voicemail can be played with iPhone":

A Chicago singer honored her late grandmother in a unique way.

Sakyrah Morris held on to a voicemail her grandmother sent her just a month before she passed away. She got a tattoo of the voicemail's exact sound waves.

Through technology with a company called Skin Motion, Morris can hold her iPhone camera over the tattoo and hear her grandmother's voice at any moment.

Read the rest of this entry »

Comments (7)

News program presenter meets robot avatar

Yesterday BBC's Radio 4 program "Today", the cultural counterpart of NPR's "Morning Edition", invited into the studio a robot from the University of Sheffield, Mishal Husain and the Mishalbot the Mishalbot, which had been trained to conduct interviews by exposure to the on-air speech of co-presenter Mishal Husain. They let it talk for three minutes with the real Mishal. (video clip here, at least for UK readers; may not be available in the US). Once again I was appalled at the credulity of journalists when confronted with AI. Despite all the evidence that the robot was just parroting Mishalesque phrases, Ms Husain continued with the absurd charade, pretending politely that her robotic alter ego was really conversing. Afterward there was half-serious on-air discussion of the possibility that some day the jobs of the Today program presenters and interviewers might be taken over by robots.

The main thing differentiating the Sheffield robot from Joseph Weizenbaum's ELIZA program of 1966 (apart from a babyish plastic face and movable fingers and eyes, which didn't work well on radio) was that the Mishalbot is voice-driven (with ELIZA you had to type on a terminal). So the main technological development has been in speech recognition engineering. On interaction, the Mishalbot seemed to me to be at sub-ELIZA level. "What do you mean? Can you give an example?" it said repeatedly, at various inappropriate points.

Read the rest of this entry »

Comments off

Voice recognition for inputting

When I'm with my sister Heidi, whether it be in Seattle or northeast Ohio or anywhere else in the world, she's often talking to Siri.  She asks Siri to look up information about trees, about food, about traditional medicines, about Yoga, about genealogy, and anything else she wants to investigate.  Above all, when we're driving around, she asks Siri for directions about how to get where we're going.

To me, who doesn't even own a cell phone, this is all quite miraculous.  A few days ago, at the conclusion of my "Language, Script, and Society in China" class, however, a new (for me) dimension of voice recognition was demonstrated by one of the students.

Read the rest of this entry »

Comments (21)

Awesome / sugoi すごい!

Comments (7)

Siri and flatulence

An acquaintance of mine has a new iPhone, which he carries in a pocket that is (relevantly) below waist level. He has discovered something that dramatically illustrates the difference between (i) responding to speech and (ii) responding to speech as humans do, on the basis of knowing that it is speech.

Read the rest of this entry »

Comments off

Another fake AI failure?

The "silly AI doing something stupidly funny" trope is a powerful one, partly because people like to see the mighty cast down, and partly because the "silly stupid AI" stereotype is often valid.

But as with stereotypes of human groups, the most viral examples are often fakes. Take the "Voice Recognition Elevator"  skit from a few years ago, which showed an ASR system that was baffled by a Scottish accent, epitomizing the plight of Scots trapped in a dehumanized world that doesn't understand them. But back in the real world, I found that when I played the YouTube version of the skit to the 2010 version of Google Voice on my cell phone, it correctly transcribed the whole thing.

And I suspect that the recent viral "tuba-to-text conversion" meme is another artful fraud.

Read the rest of this entry »

Comments (14)

Advances in tuba-to-text conversion

My dad accidentally texted me with voice recognition…while playing the tuba

(h/t Chris Waigl)

[Update: Mark Liberman suggests this might be some artful fakery. See: "Another fake AI failure?"]

Comments (11)

Voice recognition for English and Mandarin typing

In all tech considered (8/24/16), Arrti Shahani has an article titled "Voice Recognition Software Finally Beats Humans At Typing, Study Finds".

Turns out voice recognition software has improved to the point where it is significantly faster and more accurate at producing text on a mobile device than we are at typing on its keyboard. That's according to a new study by Stanford University, the University of Washington and Baidu, the Chinese Internet giant. The study ran tests in English and Mandarin Chinese.

Baidu chief scientist Andrew Ng says this should not feel like defeat. "Humanity was never designed to communicate by using our fingers to poke at a tiny little keyboard on a mobile phone. Speech has always been a much more natural way for humans to communicate with each other," he says.

Read the rest of this entry »

Comments (18)

The swazzle: a simple device for voice modulation

Until two days ago, I had never heard of this word — even though I knew about Punch and Judy shows.

From Wikipedia:

A swazzle is a device made of two strips of metal bound around a cotton tape reed. The device is used to produce the distinctive harsh, rasping voice of Punch and is held in the mouth by the Professor (performer) in a Punch and Judy show.

Swazzle can also be pronounced or spelled Schwazzle or swatchel.

I like the fact that the performer is called "Professor"!

Read the rest of this entry »

Comments (17)