This started out to be a short report on some cool, socially relevant crowdsourcing for Egyptian Arabic. Somehow it morphed into a set of musings about the
A statistical revolution in natural language processing (henceforth NLP) took place in the late 1980s up to the mid 90s or so. Knowledge based methods of the previous several decades were overtaken by data-driven statistical techniques, thanks to increases in computing power, better availability of data, and, perhaps most of all, the (largely DARPA-imposed) re-introduction of the natural language processing community to their colleagues doing speech recognition and machine learning.
There was another revolution that took place around the same time, though. When I started out in NLP, the big dream for language technology was centered on human-computer interaction: we'd be able to speak to our machines, in order to ask them questions and tell them what we wanted them to do. (My first job out of college involved a project where the goal was to take natural language queries, turn them into SQL, and pull the answers out of databases.) This idea has retained its appeal for some people, e.g., Bill Gates, but in the mid 1990s something truly changed the landscape, pushing that particular dream into the background: the Web made text important again. If the statistical revolution was about the methods, the Internet revolution was about the needs. All of a sudden there was a world of information out there, and we needed ways to locate relevant Web pages, to summarize, to translate, to ask questions and pinpoint the answers.
Fifteen years or so later, the next revolution is already well underway.
This really achieved clarity for me in a couple of recent conversations with Chris Callison-Burch and JHU student Scott Novotney. Chris is one of the leaders in the use of crowdsourcing, particularly using Amazon Mechanical Turk, to acquire data language technologists care about. When I heard that Google and its new acquisition, SayNow, were working with Twitter to make tweeting from Egypt possible via voicemail, I asked Chris whether he was thinking of perhaps crowdsourcing translations from Arabic to English, much as Rob Munro of Stanford did for Haitian Creole SMS messages in the wake of the January 2010 earthquake. It turned out that he and Scott Novotney were already on it.
The first step is getting from speech to text. Scott reports that they grabbed an initial batch of about 10 hours' worth of voicemails from the first week, and they are doing further collection once per day. They're applying their previous Mechanical Turk based speech transcription framework to Egyptian Arabic, and feeding the resulting text to the folks at Alive in Egypt, where it will join the results of transcriptions by volunteers. One can expect their natural next step to be crowdsourcing the translations on Mechanical Turk also.
What's happening here is very cool, just for its social relevance. Two forms of crowdsourcing (volunteer-based and market-based) are being applied in order to connect the Egyptian revolution with the rest of the world, via the spoken word. Restoration of Internet access hasn't slowed it down, either: this Washington Post article reports, "Some of the heaviest volume came after access to both Twitter and the Internet was restored in Egypt earlier this week. The alternative method of tweeting has turned into a forum for longer-form expression because the voice recordings aren't confined to Twitter's 140-character limit."
But what's happening here is more than socially relevant; it also points to the next revolution in the language technology world. The speech recognition community has been putting plenty of effort into automatic speech recognition (ASR) for Arabic over the last ten years, with a great deal of progress having been made. So why isn't ASR a part of Novotney and Callison-Burch's picture here?
It's because the systems that exist are just not ready for the kind of language they're being faced with in this use case. The availability of training data for this scenario is paltry, 20 hours of transcribed telephone speech available for Egyptian Arabic. People have tried applying models trained on the more readily available modern standard Arabic (MSA) to dialects, but the word error rates are atrociously high; Scott comments that for, say, Levantine Arabic, you see word error rates of 70 percent or higher. So, apply ASR to Egyptian voice tweets? Don't even bother.
We're seeing the same thing everywhere. NLP for social media is the wild west: lots of people are getting into the action (how many different companies are doing sentiment analysis of Twitter right now?), but most of what's out there is really shallow, because the data just don't look like the language the community has been focused on. Apply the usual off-the-shelf analysis tools? For tweets, at least, don't even bother. (Noah Smith noted recently that he and his team are currently brainstorming on ways to get decent part-of-speech tagging for tweets. Deeper analysis is surely going to take a while.)
So, the next revolution of language technology needs is the social media revolution. The trickle that's started showing up as individual papers and at specialized workshops, is, I think, shortly going to become a flood. Beginning around 1990 or so, we saw the rise and eventual ubiquity of statistical papers in NLP (illustrated nicely on slide 58 of this 2004 presentation by Ken Church). I suspect that 2010 or hereabouts will eventually be a similar marker, the point where social media datasets started to become unavoidable, and, with them, corresponding need-driven technological approaches ranging from fully unsupervised (when there's enough data) to semi-supervised (bootstrapping from those nice crowdsourced annotations we've been talking about).
Here's what had Scott Novotney most jazzed about the Egyptian Arabic crowdsourcing he's doing, when we spoke: the fact that he's now got a "nice little corpus of 10 hours of dialect data" from the real world, which is going to allow him to bootstrap into ASR for real-world Egyptian dialect using semi-supervised techniques. That's the future, right there.