So many languages, so much technology…

« previous post | next post »

Suppose you had 100 digital recorders and 800 small languages, all in a country the size of California, but in one of the remotest parts of the planet.  What would you do?  What would it take to identify and train a small army of language workers?  How could the recordings they collect be accessible to people who don't speak the language?  My answer to this question is linked below – but spend a moment thinking how you might do this before looking.  One inspiration for this work was Mark Liberman's talk The problems of scale in language documentation at the Texas Linguistics Society meeting in 2006, in a workshop on Computational Linguistics for Less-Studied Languages.  Another inspiration was observing the enthusiasm of the remaining speakers of the Usarufa language to maintain their language (see this earlier post).  About 9 months ago, I decided to ask Olympus if they would give me 100 of their latest model digital voice recorders.  They did, and the BOLD:PNG Project starts next week.  Please sign the guestbook on that site, or post a comment here, if you'd like to encourage the speakers of these languages who are getting involved in this new project.


  1. Carl said,

    February 5, 2010 @ 10:40 am

    Fascinating! The university at which I work in Colombia, South America, is just in the early stages of getting involved with some projects on indigenous languages — so this is very interesting stuff! :)

  2. Mark Liberman said,

    February 5, 2010 @ 12:01 pm

    I should note that there's an embarrassing order-of-magnitude error in the talk abstract that Steven links to — a plausible estimate for the gross volume of recording needed to get 10,000,000 (Latin- or Greek- or) English-sized words would be about 2,000 hours. (The number in the abstract came from a careless collapsing of this minimal classical-Latin-sized corpus with an estimate of an adult native speaker's linguistic experience, which might be something like 20 years * 365 days * 8 hours = 58,400 hours.)

    But the point is that genuine corpus-based documentation of a language requires orders of magnitude more recording (and transcription) than most language-documentation projects have dared to contemplate. And it's really exciting that Steven and others have now started to experiment with techniques (community-based recording, oral annotation, etc.) that have a chance to work at the scale that's needed.

  3. Andrew Dowd said,

    February 5, 2010 @ 12:50 pm

    This looks like a great project for preserving endangered languages.

    Notice that the FAQ page contains a language-log-worthy sentence:

    'We cannot underestimate the potential for yet-to-be-uninvented methods for enhancing the signal and interpreting its content.'

  4. Matthew Walenski said,

    February 5, 2010 @ 2:55 pm

    This is really exciting and fascinating! I am really glad to hear that the native speakers are so involved and excited about this! The more documentation and preservation of all languages the better, but particularly so for those at risk.

  5. Aviatrix said,

    February 5, 2010 @ 3:35 pm

    The line in the instructions, "Try not to spend more than an hour transcribing a minute of text," reveals just how vast a project this is.

    100 recorders times 32 hours per recorder times 60 minutes per hour: if each recorder were filled once, that would produce 92 man-years of transcription work.

    [(myl) On the other hand, for some underdocumented languages, there are thousands or tens of thousands of people actually or potentially involved in literacy programs.

    And while 100 X real time is reasonable for people in the process of developing a writing system, 10 X or less (with reasonable software support) is a more appropriate goal for transcription time for people who are already literate. But it's absolutely true that making N hours of recordings is only the start of a process whose other stages will be significantly more time-consuming.]

  6. Carl said,

    February 5, 2010 @ 4:06 pm

    BTW, when I went to sign the guestbook, it coughed up a whole pile of PHP error messages. I don't know whether to just blame IE (though I may as well), but I am not sure that site-vistors' non-oral discourse is being documented!

  7. Busyhands said,

    February 6, 2010 @ 3:40 am

    This is a terrific idea. Kudos for initiating it, and kudos to Olympus for seeing the benefit and donating the equipment.

  8. Lou Hevly said,

    February 6, 2010 @ 11:35 am

    I've been needing to buy a new digital camera and will now go for an Olympus.

  9. scott said,

    February 6, 2010 @ 3:00 pm

    This is a great step in the right direction.

  10. Clarissa at Talk to the Clouds said,

    February 6, 2010 @ 3:30 pm

    The guestbook worked fine for me in Firefox.

    Excellent initiative and great project.

  11. Sili said,

    February 6, 2010 @ 5:40 pm

    Indeed. Big props to Olympus for supporting a project like this.

  12. Jarek Weckwerth said,

    February 7, 2010 @ 8:15 am

    Mark, the 58,400 estimate: Is it yours, or are there sources for this? What does the 8hrs/day estimate mean? I've been looking for these kinds of figures in published literature but without too much success, so I'm interested.

    (If this is too OT, I'm happy to switch to email if you are OK to reply. But it is a point of general interest, isn't it?)

    [(myl) That's my own very crude back-of-the-envelope calculation. Obviously the experience of different individuals will be very different. It wouldn't surprise me if the inter-quartile range spanned an order of magnitude.

    For average contemporary Americans, A.C. Nielsen estimates four hours a day of television watching. There are wordless sequences, but most of that has some talk going on. Add in four more hours of conversation, radio, reading and web surfing and whatnot, and an average of eight hours a day of some kind of linguistic experience seems plausible.

    There are some diary studies that could be brought to bear, though I don't know any that looked carefully at a suitable sample of individuals in a pre-literate culture — and now we're into a suitable subject for another post or two — but it's plausible that normal adult members of a speech community will generally have had something in the range of 10,000 to 100,000 hours of linguistic experience.]

  13. Beijing Sounds said,

    February 10, 2010 @ 8:09 pm appears to be down, not available, or whatever the terminology is. Any idea when that might be fixed?

    [Complaint 2]
    If LL would install the "subscribe to comments" plugin, it would solve the problem that we forgetful people have: leaving comments and then forgetting about the thread, even though we're very interested in it.

    [Enthusiasm to counterbalance complaints]
    As some other folks and I discuss similar dying language/dialect issues here in China, we are very interested in following your approach. Hoping for more updates!

RSS feed for comments on this post