Four revolutions

« previous post | next post »

This started out to be a short report on some cool, socially relevant crowdsourcing for Egyptian Arabic. Somehow it morphed into a set of musings about the (near-) future of natural language processing…

A statistical revolution in natural language processing (henceforth NLP) took place in the late 1980s up to the mid 90s or so. Knowledge based methods of the previous several decades were overtaken by data-driven statistical techniques, thanks to increases in computing power, better availability of data, and, perhaps most of all, the (largely DARPA-imposed) re-introduction of the natural language processing community to their colleagues doing speech recognition and machine learning.

There was another revolution that took place around the same time, though. When I started out in NLP, the big dream for language technology was centered on human-computer interaction: we'd be able to speak to our machines, in order to ask them questions and tell them what we wanted them to do. (My first job out of college involved a project where the goal was to take natural language queries, turn them into SQL, and pull the answers out of databases.) This idea has retained its appeal for some people, e.g., Bill Gates, but in the mid 1990s something truly changed the landscape, pushing that particular dream into the background: the Web made text important again. If the statistical revolution was about the methods, the Internet revolution was about the needs. All of a sudden there was a world of information out there, and we needed ways to locate relevant Web pages, to summarize, to translate, to ask questions and pinpoint the answers.

Fifteen years or so later, the next revolution is already well underway.

This really achieved clarity for me in a couple of recent conversations with Chris Callison-Burch and JHU student Scott Novotney. Chris is one of the leaders in the use of crowdsourcing, particularly using Amazon Mechanical Turk, to acquire data language technologists care about. When I heard that Google and its new acquisition, SayNow, were working with Twitter to make tweeting from Egypt possible via voicemail, I asked Chris whether he was thinking of perhaps crowdsourcing translations from Arabic to English, much as Rob Munro of Stanford did for Haitian Creole SMS messages in the wake of the January 2010 earthquake. It turned out that he and Scott Novotney were already on it.

The first step is getting from speech to text. Scott reports that they grabbed an initial batch of about 10 hours' worth of voicemails from the first week, and they are doing further collection once per day. They're applying their previous Mechanical Turk based speech transcription framework to Egyptian Arabic, and feeding the resulting text to the folks at Alive in Egypt, where it will join the results of transcriptions by volunteers. One can expect their natural next step to be crowdsourcing the translations on Mechanical Turk also.

What's happening here is very cool, just for its social relevance. Two forms of crowdsourcing (volunteer-based and market-based) are being applied in order to connect the Egyptian revolution with the rest of the world, via the spoken word. Restoration of Internet access hasn't slowed it down, either: this Washington Post article reports, "Some of the heaviest volume came after access to both Twitter and the Internet was restored in Egypt earlier this week. The alternative method of tweeting has turned into a forum for longer-form expression because the voice recordings aren't confined to Twitter's 140-character limit."

But what's happening here is more than socially relevant; it also points to the next revolution in the language technology world. The speech recognition community has been putting plenty of effort into automatic speech recognition (ASR) for Arabic over the last ten years, with a great deal of progress having been made. So why isn't ASR a part of Novotney and Callison-Burch's picture here?

It's because the systems that exist are just not ready for the kind of language they're being faced with in this use case. The availability of training data for this scenario is paltry, 20 hours of transcribed telephone speech available for Egyptian Arabic. People have tried applying models trained on the more readily available modern standard Arabic (MSA) to dialects, but the word error rates are atrociously high; Scott comments that for, say, Levantine Arabic, you see word error rates of 70 percent or higher. So, apply ASR to Egyptian voice tweets? Don't even bother.

We're seeing the same thing everywhere. NLP for social media is the wild west: lots of people are getting into the action (how many different companies are doing sentiment analysis of Twitter right now?), but most of what's out there is really shallow, because the data just don't look like the language the community has been focused on. Apply the usual off-the-shelf analysis tools? For tweets, at least, don't even bother. (Noah Smith noted recently that he and his team are currently brainstorming on ways to get decent part-of-speech tagging for tweets. Deeper analysis is surely going to take a while.)

So, the next revolution of language technology needs is the social media revolution. The trickle that's started showing up as individual papers and at specialized workshops, is, I think, shortly going to become a flood. Beginning around 1990 or so, we saw the rise and eventual ubiquity of statistical papers in NLP (illustrated nicely on slide 58 of this 2004 presentation by Ken Church). I suspect that 2010 or hereabouts will eventually be a similar marker, the point where social media datasets started to become unavoidable, and, with them, corresponding need-driven technological approaches ranging from fully unsupervised (when there's enough data) to semi-supervised (bootstrapping from those nice crowdsourced annotations we've been talking about).

Here's what had Scott Novotney most jazzed about the Egyptian Arabic crowdsourcing he's doing, when we spoke: the fact that he's now got a "nice little corpus of 10 hours of dialect data" from the real world, which is going to allow him to bootstrap into ASR for real-world Egyptian dialect using semi-supervised techniques. That's the future, right there.

Share:



9 Comments »

  1. Yao Ziyuan said,

    February 5, 2011 @ 11:25 pm

    If I understand correctly, the "four revolutions" mentioned in the title are:

    Two revolutions in NLP methods:
    (1) Statistics;
    (2) Human intelligence (I strongly agree!).

    Two revolutions in NLP playgrounds and data sources:
    (3) The Internet;
    (4) Social media.

    But I doubt there should be a fork of NLP specificially for "tweets." The reason why tweets are sometimes different from normal language is largely the 140-char limit. Is it easier for Twitter Inc. to lift that size limit or is it easier to develop a new NLP field specifically for tweets? The answer depends on whether NLP researchers want real research or just more papers published.

  2. GeorgeW said,

    February 6, 2011 @ 7:47 am

    Very interesting and I wish you all the best.

    Maybe I am wrong, but I would think that the diglossia in Arabic would pose additional problems. I have seen signs and heard speech (chants, etc.) that are a mixture of MSA and colloquial.

  3. D said,

    February 6, 2011 @ 10:56 am

    You said, "I suspect that 2010 or hereabouts will eventually be…"

    I think you meant 2011. I read that line and went back to look at the article date, asking myself if I had somehow happened upon an old article. I thought maybe you were a fortune teller who had foreseen the Egyptian Revolution ;)

  4. wally said,

    February 6, 2011 @ 1:10 pm

    And just yesterday I noticed that my android cell phone is automatically transcribing voice mail messages that are left on it. For now a free trial, but I can subscribe for 2 bucks a month.

  5. Jacob said,

    February 6, 2011 @ 2:12 pm

    @Yao The main reason tweets are different is language use, not number of characters. People use slang that doesn't exist in any dictionary, and honestly may not have even been seen a year ago (the character limit simply makes the slang more common since it's shorter). And different slang words may be used to mean the same thing depending on the writer's geographic region. So the main issue with NLP for social media is that language is changing too fast, and we don't have enough data or tools to keep up with it.

  6. Reading Up On Bayesian Methods | Spence Green said,

    February 8, 2011 @ 7:50 pm

    [...] may be. What interests me is the pairing of Bayesian updating with data collection from the web. Philip Resnik recently covered efforts to translate voicemails during the revolution in Egypt as one method of re-connecting that country with the world. This [...]

  7. Recent Linkage 6 « Signifying Media said,

    February 11, 2011 @ 4:09 am

    [...] Philip Resnik at the Language Log, which I seem to be linking to all the time now, describes projects connecting the Egyptian revolution to the rest of the world in the context of state-of-the-art natu…. [...]

  8. ENKI-][ said,

    February 12, 2011 @ 6:31 pm

    @yao: To play the devil's advocate, translation wherein the eventual translated text must obey particular unusual constraints can be argued to be in of itself a technical problem separate from the problems of automatic translation. That the translated text must simultaneously meet (for instance) length constraints and retain more or less the same meaning as the original implies that significant weight must be put onto decisions made when determining the structures and word choice, which would be an intermediary step — and such a system would necessarily be more complex than existing translation systems since it must be expected to be able to translate the same thing in several different ways depending upon the amount of space left.

    That said, the typical cheap-hack fix is fairly sufficient here: translate the text, host it elsewhere, and post a hyperlink.

  9. Michael Gamon said,

    March 1, 2011 @ 12:38 pm

    Fyi: At ACL this year, there will be a workshop on Language in Social Media (http://research.microsoft.com/en-us/events/lsm2011/default.aspx) where we are hoping to get the conversation going on some of these topics. A good place to submit :)

RSS feed for comments on this post · TrackBack URI

Leave a Comment