Archive for Resources

The sparseness of linguistic data

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

Read the rest of this entry »

Comments off

A reprieve for DARE

A month ago, I posted an "SOS for DARE," detailing the impending financial threat faced by the Dictionary of American Regional English, a national treasure of lexicography. At the time it appeared that the College of Letters and Sciences at the University of Wisconsin, where DARE is based, would be unable to provide support to offset the loss of federal and private grant money. But now there's finally some good news out of Madison, in the form of new funds from the University and external gifts.

Read the rest of this entry »

Comments (1)

SOS for DARE

Many Language Log readers are no doubt familiar with the Dictionary of American Regional English, which I hailed in a Boston Globe column last year as "a great project on how Americans speak — make that the great project on how Americans speak." At the time, I was previewing DARE's fifth volume, which completed the alphabetical run all the way to zydeco.  Since then, a sixth volume of supplemental materials has also been published, and plans are underway to launch the digital version of DARE, which would serve as an online home for future expansions and revisions. But now DARE editor Joan Hall passes along some troubling news about the dictionary's financial fate.

Read the rest of this entry »

Comments (5)

Universal alphabet

Not that I think this is any sort of panacea, but our good friends at BBC have seen fit to ask: "Could a new phonetic alphabet promote world peace?"

Although backers of this supposed universal alphabet claim that "it will make pronunciation easy and foster international understanding", I have doubts that SaypU (Spell As You Pronounce Universal project) constitutes a viable route to world peace.

Read the rest of this entry »

Comments (57)

The American Heritage Dictionary of the English Language, 5th edition

As soon as I heard that the 5th edition of The American Heritage Dictionary of the English Language (AHD) had come out, I rushed to the nearest Barnes & Noble bookstore (yes, they still exist — that was Borders that closed) and plunked down two Bens (hundred dollar bills) to buy three copies at $60 each:  one for my office at Penn, one for my study at home, and one for a friend.  The 5th ed. was actually published in November, 2011, but I was in China then, and didn't get a chance to buy my own copies until the day I arrived back on American soil.

Read the rest of this entry »

Comments (31)

Burlesques, parodies, playful allusions

On my personal blog, here, an inventory of postings on these topics — at the moment, only postings on my blog.

Comments off

i2speak

There's a free web-based tool for IPA entry at i2speak.com:


Read the rest of this entry »

Comments (29)

Inventory of libfix postings

(and related material), assembled on my blog, here.

Comments (1)

Inventory on nucularity

Over on my blog, an assemblage of postings (almost all from Language Log) on the pronunciation of nuclear: here.

Comments off

New search service for language resources

It has just become a whole lot easier to search the world's language archives.  The new OLAC Language Resource Catalog contains descriptions of over 100,000 language resources from over 40 language archives worldwide.

This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.

OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.  The OLAC Language Resource Catalog was developed by staff at the Linguistic Data Consortium, the University of Pennsylvania Libraries, the Graduate Institute of Applied Linguistics, and the University of Melbourne.  The primary sponsor is the National Science Foundation.

Comments (2)

LINGUIST List (2010)!

It's that time of  the year again: the LINGUIST List's annual fund drive is under way, for the month of March; the drive is about halfway (about $32,000) to its goal of $65,000 (the money goes to support the student staff). From the list's site:

The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world.

Read the rest of this entry »

Comments off

Google Demotes Literary Stars

My post about Google's metadata problems, along with a similar piece in the Chronicle of Higher Education, got a lot of people talking about the problem in the press and the blogs. (I even ran into an allusion to it in a La Repubblica piece on the Google Book Settlement when I arrived in Rome yesterday morning.) A number of people passed along their own experiences with flaky metadata. Others criticized me on grounds that could be broadly summed up as "Don't look a gift horse in the server," "It's better than nothing," "Who needs metadata anyway?," "Just give them time," and "Why concentrate on trivialities like metadata while ignoring the real perils of corporate monopoly" (as in "serving as a consultant for monitoring the proper temperatures of the pitchforks in hell").

This is all to the good, if it helps move up the metadata issues in Google's queue. I do think this will get a lot better as Google puts its considerable mind to it. But there was one other aspect of the metadata problem which I hadn't noticed or even thought about, but which in its own small way was unkindest cut of all. It was noticed by the children's book author Ace Bauer, who was prompted by my account of the metadata problems to check his Google Books listing:

Turns out my review rating ranked only one star out of 5. That's dim. But see, the review upon which they based this ranking was Kirkus's. Kirkus loved the book. They gave it a star. One star. That's all they give folks. It's considered a major honor.

Indeed it is, and actually the falling-star glitch affects a number of writers, for example Roy Blount, Jr., the president of the Author's Guild, who is has been an enthusiastic backer of the settlement. Google Books assigns a one-out-of-five star rating to at least two of Blount's books on the basis of their starred Kirkus reviews, Crackers and First Hubby, and visits similar review rating downgrades on books by Guild vice-president Judy Blume and Guild board members Nick LemannJames GlieckOscar Hijuelos, among others.

 I don't know exactly what the Google people will say when they cotton to this one, but it's a good guess the first sentence will begin with "oy."

Read the rest of this entry »

Comments (11)

Some little inventories

Comments off