Death or birth?
« previous post | next post »
The most recent IEEE Signal Processing Society Newsletter has an interesting article by David Suendermann, "Speech scientists are dead. Interaction designers are dead. Who is next?".
His argument is that "Commercial spoken dialog systems can process millions of calls per week", and therefore "one can implement a variety of changes at different points in the application and randomly choose one competitor every time the point is hit in the course of a call", using techniques like reinforcement learning to adaptively optimize the design. As a result, "the contender approach can change the life of interaction designers and speech scientists in that best practices and experience-based decisions can be replaced by straight-forward implementation of every alternative one can think of".
I yield to no one in my appreciation for Big Data, eScience (or in this case, I guess, eEngineering…), the Fourth Paradigm, and all that. And reinforcement learning is a fine technique, with interesting roots in the extraordinary Rescorla-Wagner model of classical conditioning (though there are other ideas around). Everyone should know about this stuff, and apply it where it works, in spoken dialog system optimization as elsewhere.
But I think that David goes (or implies going) way too far: IMHO, the massive data accumulation in the digital networking of the whole world is going to increase, not decrease, the demand for scientists and engineers. I don't have time to say anything more about it, for now, so feel free to discuss the question among yourselves.
[Update — Fernando Pereira, who is in a good position to know, has a quick list of contrary bullet points:
- Even with all that data — and I agree it's growing fast as voice interaction on smart phones becomes ubiquitous — randomized search will be swamped by the combinatorial possibilities of interface design.
- The only way to manage the combinatorics is to impose intelligent biases on the search process. That is, we need engineers and designers who understand and know how to apply the relevant computer science and statistics.
- Automated tools do not achieve good designs by themselves, because we do not know how to quantify good design as a mathematical objective function, even if the combinatorial problem could be tamed. We need human designers to steer the tools, evaluate the results, and recognize potential disasters. Their training may be different, but they are not 'dead'.
]
Mark P said,
April 23, 2010 @ 12:23 pm
I imagine there are lots of practical problems that "implementation of every alternative one can think of" can solve. Isn't that the way computer programs beat chess masters? On the other hand, it may not lead to much understanding. In the history of science, the analysis of large (for the time) quantities of astronomical observations led to a relatively good predictive model involving epicycles. I suppose continued implementation of every alternative one can think of would have led to a Sun-centric model of the solar system, but if the Earth-centric model could predict as well (or as well as necessary), why choose one over the other? Aside from the fact that an Earth-centric model would have made sending probes to Mars harder.
Nick Lamb said,
April 23, 2010 @ 12:29 pm
"Rich Phone Applications" is what David's company calls their business sector. I think by this they mean (Rich (Phone Applications)). But this sector is dying, and the reason isn't anything to do with Big Data or the Fourth Paradigm, it is the increasing dominance of ((Rich Phone) Applications). Instead of literally talking to a machine (which remains awkward) people are happy to communicate with it by pointing, tapping and gesturing at a touch screen interface they carry in their pockets. The telephone got smarter.
Nobody wants to fight their way through voice prompt menus, no matter how "optimal". Telephone helplines should be reserved for situations where a (human) representative of the company needs to talk to the (also human) customer. They're a lousy alternative to the GUI for interacting with machines.
Nick Lamb said,
April 23, 2010 @ 12:40 pm
Mark P, actually your instinct is correct, get the maths right and you can get the same answers with a Heliocentric or a Geocentric model. Or you can put a small rock near Jupiter in the middle, it doesn't matter. Check out the 1905 paper "Zur Elektrodynamik bewegter Körper" by a certain Albert Einstein and his later 1915 work which makes it all work properly with gravity. It turns out there is no privileged frame of reference, so you can just pick whichever is convenient.
peter said,
April 23, 2010 @ 1:08 pm
When SPSS and similar statistical analysis software programs began appearing in the mid-1960s, many statisticians thought that the widespread use of these programs would result in unemployment for statisticians. In fact, lowering the expertise-threshold necessary for analysis of statistical data increased the demand for people with high levels of statistical expertise — to advise (and to rectify the work of) those without adequate expertise.
Jens Fiederer said,
April 23, 2010 @ 2:27 pm
"Commercial spoken dialog systems can process millions of calls per week" reminds me of that skit they used to have so many versions of on Saturday Night Live.
It was about Toonces, the cat who could drive a car. Pretty much every episode ended with stock footage of a car falling off a cliff.
Yes, Toonces could drive a car. "But not very WELL."
Peter Taylor said,
April 23, 2010 @ 6:44 pm
Does that Fourth Paradigm link work for anyone? It's the second time this week I've followed a link to that page from LL, and both times the server has been unresponsive.
John Roth said,
April 23, 2010 @ 7:04 pm
What I got out of this is something a bit different. It's an approach that's used in a number of large web sites, and recommended by quite a few leading designers: get feedback on what works from how the customers who are trying to use your site actually behave, dammit!
On a large enough site, you don't have to try one approach at a time. You can try several, or several dozen, and keep track of how people react to each alternative.
Beyond that, I'm not sure what he's recommending. Unless he's recommending automatically generating the alternatives, it's hardly new.
MD said,
April 23, 2010 @ 7:17 pm
I do research in spoken dialogue systems, and I don't see my work dying any time soon ;-) The main rule of statistics/machine learning is "Garbage In, Garbage Out". There is no "just" in annotating for semantic representations. An annotation project that can provide reliable data for machine learning will cost a lot of money to run, and requires supervision of people who actually understand how systems work (= Interaction Designers).
And there is your basic chicken and egg problem: you can optimize by recording lots of calls – but only if you have a reasonably working system first to provide contending choices. Otherwise you are going to alienate lots and lots of customers with bad choices to start with. You can theoretically "pre-optimize" by building simulated users, but your optimization will still be just as good as your simulated user is, which again requires someone who understand how systems work.
So, tools may change and type of work will change, but I think the news of my death are premature.
Mel Nicholson said,
April 23, 2010 @ 10:18 pm
I've heard this joke before with other punchlines…
"Because of computers, we will have a paperless office."
"Email will eliminate all that useless junk mail you've been bothered by."
"Internet related technology should pan out in about ten years, after which we won't need so many programmers."
"With the invention of the Atomic Bomb, war will become unthinkable."
"With the invention of dynamite, war will be impossible."
"The crossbow should spell an end to war because armor is now useless."
I can personally attest to hearing statements with the same intent as all but the last two. They all have the same blind spot for the fact that new solutions breed new problems.
Okko said,
April 24, 2010 @ 6:38 am
Article reads: "Instead of carefully tweaking rule-based grammars, user dictionaries, and confidence thresholds, there is a lazy but high-performing recipe. One needs to systematically collect large numbers of utterances from all the contexts of a spoken dialog system, transcribe these utterances, annotate them for their semantic meaning, and train statistical language models and classifiers to replace grammars that have been used in these recognition contexts before."
But this recipe is precisely what speech scientists do, and it's usually more, not less time-consuming than "carefully tweaking rule-based grammars"…
Aaron Davies said,
April 25, 2010 @ 12:31 am
i think they're talking about googlish-engineering, where alternative/new features are automatically tested on random subsets of users to breed the best possible combination.
Okko said,
April 25, 2010 @ 11:14 am
Sure, but a. the necessary annotation process is far costlier than a bit of rule tweaking (automated or not), and b. speech scientist/interaction designers have (for years) been employing statistical methods for improving speech applications, with various degree of (un-)supervision.
Most importantly, the data can't speak for itself without a model, the creation of which has, in essence, been at the core of speech science and user interface job descriptions.
Maybe the automatic generation and evaluation of tuning parameters provides some novel tools to the speech science/user interface toolkit (replacing some of the "gut feeling" approach criticized), but I don't see any mass firings coming up. The sorry state of some some speech apps out there makes me thing we'll need more of them.
ella said,
April 30, 2010 @ 2:42 pm
wait – speech scientists are dead? All of them? Must begin writing pages of condolence letters to the families of my friends and former colleagues! Wait a minute, I used to be a speech scientist….dodged a bullet there, I'd say…
Sean Crist said,
May 6, 2010 @ 9:38 am
To expand on what Okko said:
Statistical grammars can certainly give better recognition accuracy than rule-based grammars built by hand. They are not a time-saver, though, despite Suendermann’s suggestion. Here are some realistic time estimates:
1. Basic rule based grammars, made by hand: I can knock out dozens of those in a day. They are also easy and cheap to maintain.
2. A grammar with a fixed set of strings gathered from actual caller utterances, with weights based on observed frequencies: this takes around an afternoon for one grammar. If you’re doing a lot of them, you could build tools to make the process somewhat more efficient, but transcribed caller utterances are an inherently dirty kind of data, and I wouldn’t advise making this kind of grammar without some human judgment about what to omit. Having someone semantically tag the utterances is not free. Also, you need to start with a made-by-hand grammar for your initial deployment so that you can collect the data, and most clients don’t want to pay to replace that existing grammar with a statistical grammar unless there’s some obvious benefit. There are cases where this kind of grammar is worth the effort, but it often gives only a marginal uplift in accuracy over a plain old made-by-hand grammar, and it’s harder to maintain.
3. A statistical grammar of the sort that allows the caller to answer an open-ended question such as “Please tell me in a few words why you are calling today”: around two weeks of effort for one grammar, and it takes a very large collection of tagged utterances.
David Suendermann said,
May 6, 2010 @ 5:33 pm
Certainly, statistical grammars are not 100% free. Building them consistently for all recognition contexts for large-scale dialog systems (like I am doing over the past two years) requires an immense infrastructure to collect, transcribe, semantically annotate, train, test, and deploy them in a continuous cycle. Some of these tasks are completely automatic, some of them only partial. Transcription can be automated using ASR for between 40 and 60% of the utterances depending on the recognition context when transcription accuracy is requested to be human-like (i.e. less than 2% WER). Semantic annotation can be highly automated based on the simple principle to never ever touch an utterance formerly annotated. This leads to automation rates of >98% in simple contexts (such as y/n) and >80% in large-vocabulary contexts (HMIHY). Taking into consideration that both transcription and annotation can be outsourced to people with basic English language skills to keep costs as low as possible (what I am indeed doing; not even speaking of crowd sourcing yet), my task, i.e., that of the speech scientist, is to keep the beast running. Honestly, it takes me only a couple of minutes a day to do so unless there is brand new grammars requested that can be quickly hand-crafted because I do not have to worry about performance. The statistical update grammars coming out of the above described process always and everywhere perform better.
Yours,
David
A more optimistic Outlook on the Future of Speech | Okko in Speech said,
June 30, 2010 @ 5:48 am
[…] speech application industry got some critical press in recent months (here are some spirited responses, […]
Daniel.S said,
March 17, 2011 @ 9:19 am
"Automated tools do not achieve good designs by themselves, because we do not know how to quantify good design as a mathematical objective function"
Very true statement.