Language Log

Breakfast experiments in THE

May 29, 2014 @ 7:19 pm · Filed by Mark Liberman under Linguistics in the news

Matthew Reisz, "Big data serves up linguistics insights", Times Higher Education 5/29/2014:

Meaningful research into linguistics can now be conducted in the time it takes to have breakfast, thanks to the “transformative” impact of “big data” on the field.

That is the view of Mark Liberman, Christopher H. Browne distinguished professor of linguistics at the University of Pennsylvania, who told a panel discussion that “datasets are no longer the exclusive preserve of the scientific hierarchy” and that “any bright undergraduate with an internet connection can access and interpret the primary data”.

The event Reisz is reporting on was the joint British Academy/Philological Society panel "Language, Linguistics and the Data Explosion", held 5/9/2014 in London. The slides for my presentation are here.

I'm in Iceland at the moment for LREC 2014, and then going on to a tone workshop at UMass, and I have some grant proposals with looming deadlines, so the next breakfast experiment is not likely to appear for a week or so.

May 29, 2014 @ 7:19 pm · Filed by Mark Liberman under Linguistics in the news

Permalink

4 Comments

Mark Stephenson said,

May 29, 2014 @ 7:58 pm

The link leads to a page with the audio of the session, but not I think, the slides. I regret I don't have time to listen to the 1hr 30min audio, but I'd be interested in viewing the slides.

[(myl) Sorry — cut-and-paste error on my part. My slides are here.]
Big Dave said,

May 30, 2014 @ 10:44 am

Is there concern about sampling bias in these large data sets? That is, just because the sample is large, it doesn't necessarily follow that it is also representative. It's just large.

Are the sampling/mining methodologies published with the data? Certainly the professional researcher takes this all into consideration as a matter of course, but the amateur may not even realize they are passing on an error.

The last thing I want to do is to stifle exploration, growth, or creativity, but I think it is important that responsibility is also preached to the budding big data researchers.

[(myl) You mean, as opposed to the representative and unbiased data that we get from undergraduate subject pools, as tested in highly artificial laboratory experiments that prime them elaborately into a very unusual mindset?

More seriously, researchers certainly need to learn to be cautious about generalizing beyond the samples that they have, whatever the size of the sample; and to use statistical methods to estimate, as well as they can, the effects of the many correlated factors that influence the phenomena they're studying.

And one of the consequences of the "data reformation" is an enormously increased probability that someone somewhere will try to replicate and extend any interesting result, and thereby perhaps discover bias, limitations, or even fraud.]
Sybil said,

May 30, 2014 @ 2:45 pm

"And one of the consequences of the "data reformation" is an enormously increased probability that someone somewhere will try to replicate and extend any interesting result, and thereby perhaps discover bias, limitations, or even fraud."

Perish forfend! as Isaac Asimov used to say.

Nevertheless, I take your various points, and raise you a version of the commenter. Plus ça change, plus c'est la même chose, and all that.
MattF said,

June 1, 2014 @ 7:30 am

On the sampling bias question– I'll note that it can be quite a subtle problem, particularly when sampling is multi-dimensional. To see this, one needs only to take a look at the multi-generaltional struggle to produce a reliable random number generator.

RSS feed for comments on this post

Breakfast experiments in THE

4 Comments

Mark Stephenson said,

Big Dave said,

Sybil said,

MattF said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta