- Website: http://estive.net/
Posts by Steven Bird:
The village of Akazu’yw lies in the rainforest, a day’s drive from the state capital of Belém, deep in the Brazilian Amazon. Last week I traveled there, carrying a dozen Android phones with a specialized app for recording speech. It wasn't all plain sailing…
Read the full story here.
It has just become a whole lot easier to search the world's language archives. The new OLAC Language Resource Catalog contains descriptions of over 100,000 language resources from over 40 language archives worldwide.
This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.
OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. The OLAC Language Resource Catalog was developed by staff at the Linguistic Data Consortium, the University of Pennsylvania Libraries, the Graduate Institute of Applied Linguistics, and the University of Melbourne. The primary sponsor is the National Science Foundation.
Suppose you had 100 digital recorders and 800 small languages, all in a country the size of California, but in one of the remotest parts of the planet. What would you do? What would it take to identify and train a small army of language workers? How could the recordings they collect be accessible to people who don't speak the language? My answer to this question is linked below – but spend a moment thinking how you might do this before looking. One inspiration for this work was Mark Liberman's talk The problems of scale in language documentation at the Texas Linguistics Society meeting in 2006, in a workshop on Computational Linguistics for Less-Studied Languages. Another inspiration was observing the enthusiasm of the remaining speakers of the Usarufa language to maintain their language (see this earlier post). About 9 months ago, I decided to ask Olympus if they would give me 100 of their latest model digital voice recorders. They did, and the BOLD:PNG Project starts next week. Please sign the guestbook on that site, or post a comment here, if you'd like to encourage the speakers of these languages who are getting involved in this new project.
Usarufa is a language of Papua New Guinea with just 1200 speakers (ISO-639 code "usa"). There's no fluent speakers under the age of 25, so the language must be considered moribund. Before posting recordings of this language online, I needed to get informed consent, so I introduced some speakers to the World Wide Web. We poked around for a while, finding useful sites about about insecticides for dealing with the taro beetle. Then we turned our attention to audio.
I played them a recording of the "last words" of the Jiwarli language of Western Australia. After some questioning looks I explained that this language is now dead, and we were listening to its last speaker before he died. As one they all looked down, shaking their heads in disbelief and saying sorry, sorry, sorry…. It was as if I told them a mutual friend had died. They urged me to put that recording on a cassette tape so they could take it back to their village. That way, everyone would surely understand what will happen to the Usarufa language unless there are serious attempts to revitalize it.
I wasn't prepared for the intensity of their response. Now I'm wondering if a collection of such recordings might be a useful tool in promoting language revitalization, and also in explaining the concept of language archiving. (Thanks to Ima'o Ta'asata, James Warebu, Sivini Ikilele, and Waks Mark for their dedication to the preservation of Usarufa oral culture, and to Aaron Willems and SIL-PNG for facilitating this work.)
Powerset is a search engine that allows users to express their queries as phrases, rather than a few keywords. It uses natural language processing (NLP) technologies to analyze the verb-argument structure of a query and deliver more focused search results, initially just from Wikipedia. Powerset has attracted interest from the NLP community, as its services promise to demonstrate the value of NLP – and of language analysis more generally – in extracting information from the trillion or more words of text on the web. On Tuesday, Microsoft announced it has acquired Powerset, and that Powerset will become part of Microsoft's Search Relevance team. I hope this takeover means that natural language search will become mainstream, scaled up to the entire web, and used far more widely than before [Powerset blog|Microsoft Live Search blog].