Archive for Research tools

The sparseness of linguistic data

Gary Marcus and Ernest Davis say in a New York Times piece on why we shouldn't buy all the hype about the Big Data revolution in science:

Big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like "in a row"). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as "dumbed-down escapist fare" that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate "dumbed-down escapist fare" into German and then back into English: out comes the incoherent "scaled-flight fare." That is a long way from what Mr. Lowe intended — and from big data's aspirations for translation.

Read the rest of this entry »

Comments off

A reprieve for DARE

A month ago, I posted an "SOS for DARE," detailing the impending financial threat faced by the Dictionary of American Regional English, a national treasure of lexicography. At the time it appeared that the College of Letters and Sciences at the University of Wisconsin, where DARE is based, would be unable to provide support to offset the loss of federal and private grant money. But now there's finally some good news out of Madison, in the form of new funds from the University and external gifts.

Read the rest of this entry »

Comments (1)

SOS for DARE

Many Language Log readers are no doubt familiar with the Dictionary of American Regional English, which I hailed in a Boston Globe column last year as "a great project on how Americans speak — make that the great project on how Americans speak." At the time, I was previewing DARE's fifth volume, which completed the alphabetical run all the way to zydeco.  Since then, a sixth volume of supplemental materials has also been published, and plans are underway to launch the digital version of DARE, which would serve as an online home for future expansions and revisions. But now DARE editor Joan Hall passes along some troubling news about the dictionary's financial fate.

Read the rest of this entry »

Comments (5)

The American Heritage Dictionary of the English Language, 5th edition

As soon as I heard that the 5th edition of The American Heritage Dictionary of the English Language (AHD) had come out, I rushed to the nearest Barnes & Noble bookstore (yes, they still exist — that was Borders that closed) and plunked down two Bens (hundred dollar bills) to buy three copies at $60 each:  one for my office at Penn, one for my study at home, and one for a friend.  The 5th ed. was actually published in November, 2011, but I was in China then, and didn't get a chance to buy my own copies until the day I arrived back on American soil.

Read the rest of this entry »

Comments (31)

A new chapter for Google Ngrams

When Google's Ngram Viewer was launched in December 2010 it encouraged everyone to be an amateur computational linguist, an amateur historical lexicographer, or a little of both. Today, the public interface that allows users to plumb the Google Books megacorpus has been relaunched, and the new version makes it even more enticing to researchers, both scholarly and nonscholarly. You can read all about it in my online piece for The Atlantic, as well as Jon Orwant's official introduction on the Google Research blog.

Read the rest of this entry »

Comments (13)

Soundex and Metaphone

One of the earliest and best photographers in China was called John Zumbrun, but I have also seen his surname spelled various different ways, including Zumbrum.  Some of his pictures may be seen here (this site is run by Thomas H. Hahn, digital archivist of old photographs).

As soon as I saw his surname, I suspected that it might be a variant of the Zumbrunnen among my own maternal relatives who were of Swiss German extraction.  When I mentioned to my sister Heidi (who does intense genealogical research on our family) that I thought Zumbrun might be a variant of Zumbrunnen, she replied, "Oh man, the variant spellings of Zumbrunnen are driving me batty.  I have even seen Zum Pwunnen.  Have you heard of the soundex?  It is a way to index names & deal with all of the variant spellings."

Read the rest of this entry »

Comments (16)

New search service for language resources

It has just become a whole lot easier to search the world's language archives.  The new OLAC Language Resource Catalog contains descriptions of over 100,000 language resources from over 40 language archives worldwide.

This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.

OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.  The OLAC Language Resource Catalog was developed by staff at the Linguistic Data Consortium, the University of Pennsylvania Libraries, the Graduate Institute of Applied Linguistics, and the University of Melbourne.  The primary sponsor is the National Science Foundation.

Comments (2)

Oxford Chinese Dictionary

Well, my copy of the new English-Chinese Chinese-English (hereafter ECCE) Oxford Chinese Dictionary (hereafter OCD) from Oxford University Press has arrived, and I must admit that it is very big and very impressive.  There has been a lot of buzz about this dictionary in the last couple of weeks, most of it generated by their own publicity department, working with the media.

Read the rest of this entry »

Comments (21)

Embuggerance & Feisty

Problems with Google's metadata are a recurrent theme here on Language Log. Now on his blog Stephen Chrisomalis reports a stunning cascade of screw-ups that led to Google Scholar producing the following citation:

Embuggerance, E., and H. Feisty. 2008. The linguistics of laughter. English Today 1, no. 04: 47-47.

Comments (22)

Google Demotes Literary Stars

My post about Google's metadata problems, along with a similar piece in the Chronicle of Higher Education, got a lot of people talking about the problem in the press and the blogs. (I even ran into an allusion to it in a La Repubblica piece on the Google Book Settlement when I arrived in Rome yesterday morning.) A number of people passed along their own experiences with flaky metadata. Others criticized me on grounds that could be broadly summed up as "Don't look a gift horse in the server," "It's better than nothing," "Who needs metadata anyway?," "Just give them time," and "Why concentrate on trivialities like metadata while ignoring the real perils of corporate monopoly" (as in "serving as a consultant for monitoring the proper temperatures of the pitchforks in hell").

This is all to the good, if it helps move up the metadata issues in Google's queue. I do think this will get a lot better as Google puts its considerable mind to it. But there was one other aspect of the metadata problem which I hadn't noticed or even thought about, but which in its own small way was unkindest cut of all. It was noticed by the children's book author Ace Bauer, who was prompted by my account of the metadata problems to check his Google Books listing:

Turns out my review rating ranked only one star out of 5. That's dim. But see, the review upon which they based this ranking was Kirkus's. Kirkus loved the book. They gave it a star. One star. That's all they give folks. It's considered a major honor.

Indeed it is, and actually the falling-star glitch affects a number of writers, for example Roy Blount, Jr., the president of the Author's Guild, who is has been an enthusiastic backer of the settlement. Google Books assigns a one-out-of-five star rating to at least two of Blount's books on the basis of their starred Kirkus reviews, Crackers and First Hubby, and visits similar review rating downgrades on books by Guild vice-president Judy Blume and Guild board members Nick LemannJames GlieckOscar Hijuelos, among others.

 I don't know exactly what the Google people will say when they cotton to this one, but it's a good guess the first sentence will begin with "oy."

Read the rest of this entry »

Comments (11)

Wordnik

From Language Hat:

A couple of years ago, lexicographer Erin McKean … gave a TED talk about the evolution of language and the shortcomings of traditional dictionaries (an hour long, well worth your while). Since then she has been working on an entirely new sort of online dictionary to address some of those shortcomings, and it's now gone live (in beta) as Wordnik (great name). In the words of Maria Popova at Brain Pickings, "A crowdsourced toolkit for tracking and recording the evolution of language as it occurs, its goal is to gather as much information about a word as possible — not its mere definition, but also in-sentence examples, semantic “neighborhoods” of related words, images, statistics about usage, and more."

Check it out.

Comments off

In defense of Amazon's Mechanical Turk

I can find no better description of Amazon's Mechanical Turk than in the "description" tag at the site itself:

The online market place for work. We give businesses and developers access to an on-demand scalable workforce. Workers can work at home and make money by choosing from thousands of tasks and jobs.

This is followed by a "keywords" meta tag:

make money, make money at home, make money from home, make money on the internet, make extra money, make money …

This makes the site sound a bit like the next stop on Dave Chapelle's tour of his imagined Internet as physical place, and indeed it does have its seamy side. But I come to defend Mechanical Turk as a useful tool for linguistic research — a quick and inexpensive way to gather data and conduct simple experiments.

Read the rest of this entry »

Comments (11)

The return of "the boss of me"

When I jotted off a Language Log post in October 2007 about searching for early occurrences of the expression "You're not the boss of me," little did I know that I'd eventually be supplying fodder for a New York Times article about Google Book Search. In today's Times, Motoko Rich uses my 1883 antedating of "You're not the boss of me" as the anecdotal lead for a piece on how Google Book Search is being used by researchers, and the prospects for even greater access to out-of-print material now that those pesky lawsuits have been settled.

Read the rest of this entry »

Comments (3)

Google lawsuits settled

Rumors had been percolating for a while now, and today it was finally announced: Google has reached a settlement with U.S. authors and publishers who had filed lawsuits challenging the massive digitization project of Google Book Search. According to Google's press release, the settlement resolves lawsuits from the Authors Guild and five major publishers (McGraw-Hill, Pearson Education, Penguin, Wiley, and Simon & Schuster). Google will shell out $125 million, much of which will be used to establish the Book Rights Registry, a system for locating and representing copyright holders (a way of dealing with so-called "orphan works").

Read the rest of this entry »

Comments (10)

All hail the Hathi Trust

Anyone who has ever tried to use Google Book Search for serious historical research has had to grapple with its highly frustrating limitations. I've griped about the situation on several occasions (here, here, here, here). The problem is twofold: GBS is plagued by inaccurate or misleading dating, particularly for serial publications, and it does not offer full page images even for many works that are clearly in the public domain (namely, pre-1923 US works and noncopyrightable government publications). Many of us have been patiently waiting for Google to ease up on its viewing restrictions, which would simultaneously ameliorate the dating problem: if you can skim through page images, then you can determine if the year that Google gives you in the metadata is actually correct.

Help is on the way — but not from Google, exactly. Rather, several of Google's partners in its library scanning project are stepping up to the plate. Jesse Sheidlower of the Oxford English Dictionary passes on the news that the Hathi Trust has been established by the thirteen university libraries that make up the Committee on Institutional Cooperation. This includes the University of Michigan, which has contributed a major portion of Google's scanned material thus far. The Hathi Trust is not nearly as wary as Google in providing page images and fully searchable text for public domain materials. What this means is that if you find something on GBS that only gives you "snippet view," "limited preview," or "no preview available," you may be able to find the full page images by going to a CIC library site. The University of Michigan has already implemented this as part of its Mirlyn Library Catalog, with links to public domain material provided under the name "HathiTrust Digital Library." (Roy Tennant of Library Journal has also mocked up a prototype search service, but it still needs some work.)

Below the jump, an example of Hathi goodness in action.

Read the rest of this entry »

Comments (19)