Language Log

Big Data in the humanities and social sciences

May 31, 2012 @ 10:31 am · Filed by Mark Liberman under Changing times, Computational linguistics

I'm in Berkeley for the DataEDGE Conference, where I'm due to participate in a "living room chat" advertised as follows:

Size Matters: Big Data, New Vistas in the Humanities and Social Sciences
Mark Liberman, Geoffrey Nunberg, Matthew Salganik
Vast archives of digital text, speech, and video, along with new analysis technology and inexpensive computation, are the modern equivalent of the 17th-century invention of the telescope and microscope. We can now observe social and linguistic patterns in space, time, and cultural context, on a scale many orders of magnitude greater than in the recent past, and in much greater detail than before. This transforms not just the study of speech, language, and communication but fields ranging from sociology and empirical economics to education, history, and medicine — with major implications for both scholarship and technology development.

We've got until tomorrow afternoon to figure out what we're going to talk about. Here are a few of my own current thoughts. If you're pressed for time, the slogan-sized version is "Big Data is not necessarily Big Science" and "Preserve Endangered Data".

1) The shifting spectrum of size. Or maybe this should be called "Towards the Data-Analysis Singularity". As a result of Moore's Law, along with whoever's law it is that expands accessible digital content, the whole spectrum of analytic scale is shifting rapidly. Yesterday's Borgesian Fantasy turns into today's Heroic Project; yesterday's Heroic Project turns into today's Breakfast Experiment™. Thus the first bible concordance took thousands of monk-years to compile; today, any bright high school student with a laptop can do better in a few hours. In the 1960s, a million-word corpus was a big deal; today, … well, you get the idea. Projecting this trend into the future tells us that today's Heroic Projects, like creating the Google Ngram Viewer, will be tomorrow's undergraduate problem sets.

2) There's room for many a-more. Or maybe this should be "You ain't seen nothing yet". Most academic disciplines and sub-disciplines haven't really gotten on board this train yet. In my own field, phoneticians still mostly measure formant frequencies and voice-onset times by hand, even if they use computer programs rather than specialized electro-mechanical devices to do it. People who do large social surveys still mostly transcribe open-ended responses by hand and code them (also by hand) as if they were multiple-choice answers, ignoring the rest of the information in the recordings and transcripts. "Digital humanities" is still mostly a controversial gleam in a minority of humanists' eyes.

3) It's good to be able to fail. Or maybe, "evolution needs variation and selection". The thing about Heroic Projects is that you can't do very many of them, and it's a big deal if they fail. As it gets easier to ask and answer a certain kind of empirical question, you can afford to ask more questions. As a result, more researchers with a wider range of goals and beliefs can explore a bigger space of more detailed hypotheses about a broader range of problems. This is a Good Thing, in my opinion, even if most of the explorations wind up in blind alleys. Thus the most important thing about Big Data in the humanities and social sciences, in my opinion, is that today's Big Data rapidly turns into tomorrow's No Big Deal.

4) Save Endangered Data! More and more of our lives are carried out digitally and preserved in the Shadow Universe of digital archives. But most human activity is still ephemeral; and much of the small fraction that is recorded is still in danger of vanishing into the entropic mists. Future generations will have reason to wish that we paid more attention to aspects of this problem.

I'll pick two culturally-important examples at random: audiotape archives and court records.

Audiotape archives: Museums, libraries, country historical societies, radio station archives, and individual researchers' closets are full of millions of hours of audio tapes. These voices from the past will be of significant interest and value in the future — if they survive. Many are falling apart; others end up in landfills when storage space or money runs out. There are major efforts underway to digitize the world's books. We need a similar effort to digitize and preserve the world's tapes — and unlike the books, the tapes are unlikely to survive much longer unless something is done soon.

Court data: (Thanks to Jerry Goldman of oyez.org for background information.) If properly collected and archived, the activities of the American judicial system represent a massive collection of formalized social interactions, with great potential for social scientists interested in the activities of American courts and for computer scientists seeking a large highly-structured language corpus. Moreover, the hierarchical structure of the American judiciary represents an opportunity for technologists interested in modeling consequential interactions among institutions in a large system.

However, the American judiciary has been reluctant, at least in practice, to provide access to its data.

Most courts provide access via their websites to recent opinions, but few courts provide access to archival opinions. No consistency to the number of opinions available: Some courts provide all opinions from 2005 forward, others provide only the last term's worth. There is no consistency in the means of delivering the data: Some courts use RSS feeds, others have only a list of links on a page on their website, others require the user to use a search form to access any data.

Only half of the federal circuit courts make recordings of oral arguments available; a smaller fraction of other courts make them available electronically. No court, other than the Supreme Court, appears to make an official transcript of oral arguments available electronically. Almost all written data is contained in PDFs. What audio data is available is of inconsistent quality and in various formats.

Details aside, a sociologist or political scientist today would find it very difficult at best to assemble a complete collection of briefs, oral arguments, and opinions in cases at various levels dealing with some given topic — it would be a lot of work, and there would be many gaps. And in 20 or 30 years, the situation (with respect to cases now and in the past) might well be worse, because much of whatever may be available now might well have vanished.

The scale of the problem is fairly large. State courts of last resort decide about 90,000 cases a year. Intermediate federal courts of appeals decide about 60,000 cases a year. It's not clear how (and even whether) digital versions of oral-argument recordings, collections of briefs, etc., are being preserved by various courts. It seems possible that much of this material, although now almost invariably prepared in digital form, is not being digitally archived in any effective way.

[Jiahong Yuan and I helped Jerry Goldman in an NSF-funded project to rescue about 9,000 hours of U.S. Supreme Court oral arguments from analog tapes on the shelves of the National Archives, to transcribe them, and to make them available online at oyez.org. When complete, the whole collection will be available to researchers in corpus form.]

May 31, 2012 @ 10:31 am · Filed by Mark Liberman under Changing times, Computational linguistics

Permalink

22 Comments

Chris said,

May 31, 2012 @ 11:43 am

I would recommend touching on the topic of curriculum. What is to be the curriculum for 21st Century Humanities & Social Sciences if Big Data is destined to be a crucial foundation? If programs add more stats and technical methods (as I think they should), they are also destined to see a shift in what kinds of students decide to get degrees in those areas.

The pessimistic questions is: What is to become of the humanities student who doesn't like math?

The optimistic question is: What's the best way to teach stats to a drama nerd?
D.O. said,

May 31, 2012 @ 11:43 am

Re: point 1. You cannot predict the future. Moore's law or what not, it's just not possible.
Re: point 4. If you truly believe in rapidly progressing technology then this is the most important point. No matter in which format, if data is preserved it will be digitized and made uniform by future technology quite easily.

Most importantly, if Big Data is really a Big Deal than you have to expect the unexpected. It will not just refine the points, which we already know or support the hypotheses that are already there (maybe help to select among them), but lead to something completely new. But you can not actually be ready for that.
D.O. said,

May 31, 2012 @ 11:45 am

@Chris. Biologists are using the microscope all the time. Do they really know much about optics?
Ethan said,

May 31, 2012 @ 12:12 pm

@D.O. The microscope analogy is more apt than you may have intended. "two ground glass lenses + a visible light source" microscopy is 400 year old technology easily understood by any biologist. But the cutting edge of biological microscropy uses far more esoteric technologies (electron diffraction, Zernike phase contrast, Nomarski interference, sub-wavelength imaging), and indeed a knowledge of optics and image processing technologies has become pretty much required if you want to keep up with the state of the art. Microscopy has evolved from something you do with two lenses and your naked eye into a class of techniques that make heavy use of digital processing and statistical image analysis.

So yeah, if it were the right audience, I could easily imagine framing a presentation of new opportunities in "big data" social science work as being parallel to the explosion of biological exploration enabled by digital/computational microscopy.
Andy Averill said,

May 31, 2012 @ 12:18 pm

Are you going to mention the lamppost problem? ("Why are you searching over here when you dropped your keys way over there?""Because the light is better over here") Presumably some questions are more easily answered using the latest high-tech toys than others. Should that be the criterion that determines the direction that research moves in?

[(myl) But this doesn't have any particular connection with the amount of data involved, or the computational or mathematical tools used, or any other aspect of the process of investigation. It's always the case that some questions are (or seem) easier to ask and to answer than others. ]
Jess H said,

May 31, 2012 @ 12:51 pm

Several School of Information students, under the guidance of Professor Brian Carver, have been working on the court problem for a couple years via a project called "Court Listener." Check out their work: http://courtlistener.com/about/
D.O. said,

May 31, 2012 @ 1:16 pm

Ethan, thanks for the information. But my question is whether the majority of biologists who use the cutting edge technology for their research understand in any detail physical principles on which their equipment works? No doubt they know what it does in terms of the input and output, but do they know how it does it?

[(myl) These are interesting issues, but not really relevant to Chris's questions. He asked how the curriculum in the humanities and social sciences should change in light of these opportunities and their likely impact.

My own answer would start with the suggestion that anyone interested in rational investigation should learn to program. This is not the same as learning about chip design or compiler optimization or any of the many other things that go into making a program work.

I'd also like to see calculus replaced by linear algebra for most undergraduate majors in the social (and for that matter biological) sciences. Again, this doesn't mean that students need to learn how the Lanczos algorithm works — but it would be good if they knew what singular value decomposition is, and so on.]
QET said,

May 31, 2012 @ 1:35 pm

I am sceptical of the claims. But as this is a language-oriented blog, I feel it necessary to point out that the first sentence in the conference description incorrectly analogizes today's data to yesterday's measurement instruments. Not a propitious beginning to the conference. I won't even go into the obvious reasons why even an analogy of today's analytic software to yesterday's optical instruments iis totally wrong.
Chris said,

May 31, 2012 @ 2:03 pm

D.O. & myl: I agree that finding the happy medium wrt to curriculum is not easy, and I think that's what you're both getting at. WRT the microscope analogy, there's no analogous tools yet available to social scientists that are both powerful and easy to use (though tools like NLTK, Google's Ngram Viewer, and BYU's COCA/COHA are fast approaching the happy medium for language data).

More to myl's point, heck, I'd like to see 2 full years of High School math removed from the requirements and replaced with 1 year of stats (which could include some linear algebra, I suspect) and one year or programming. Neither of those basic skills should wait until college, imho.
Ethan said,

May 31, 2012 @ 3:06 pm

@D.O., myl: In thinking about new curricula and new expectations for standard skill sets, I suggest that Bioinformatics is a better biological analogy than microscopy. Again one can point to a 400 year timeline, from the introduction of systematic classification by Carl Linneaus to the data-intensive (to understate things a mite) analysis of terabytes of genomic sequences.

When I was a grad student, trying to infer phylogeny from what little sequence data existed (protein sequences back then, not DNA) was a rather esoteric exercise of interest to a smallish subset of the biological community. The math behind such inference was non-trivial even then, but it wasn't considered an essential curriculum item or skill set. Now it is impossible to ignore the importance of sequence data to this and many other fascinating questions. We teach basic introductions to sequence comparison and genomic data analysis at all levels from undergraduate seminars to continuing education courses for MDs. As D.O. surmises, there are more people who use the available tools in ignorance than there are front-runners who develop and validate new data-mining tools. But IMHO that is a serious problem, and though the standard tools will hopefully (sic :-) become more foolproof as the field develops, black box science is not what we should be aiming for.

One of the major problems I observe and suffer from myself with the current state of the huge genomic data bases is the propagation of annotation errors. People continually do their best to attach useful annotations and commentary to the underlying data, but it is the nature of the endeavor that these inferences become out of date as additional data or newer analysis techniques become available. Yet the old, obsolete, annotations acquire a life of their own and pollute subsequent queries and analyses. Are similar concerns likely to arise for "big data" repositories in the social sciences? I am reminded of some earlier LL blogs about the vagaries of the meta-data attached to books scanned by Google from large library collections. My thought is that some of the problems that genomics data repositories currently suffer from could have been avoided if sufficient thought had been given early on to how meta-data is organized, validated, and curated. I could go on at length, but I'll leave it there. Anyhow, I'm sure it wouldn't hurt to consider in advance whether newly-emerging big data repositories in other fields could learn from our growing pains.
Rubrick said,

May 31, 2012 @ 3:15 pm

I won't even go into the obvious reasons why even an analogy of today's analytic software to yesterday's optical instruments iis totally wrong.

Well not-gone-into, QET! You have failed to make your point with great aplomb.
QET said,

May 31, 2012 @ 3:29 pm

@Rubrick. I'm flattered. And here I thought I was just following LL orders: "If you have a lot to say, post it on your own blog and link to it." As I don't have a blog, I was left with no other alternative. But I will give a hint: the optical telescopes of yore merely made something visible; the user still had to determine where to aim the thing and analyze what he saw. Analytical software already represents a good deal of a prior analysis, the creators of the software having already done much of your analysis for you. I am referring to use in social sciences, not natural sciences.
D.O. said,

May 31, 2012 @ 4:23 pm

I agree that calculus as a general education requirement is greatly overvalued. And linear algebra might take its place. As for biologists, calculus may be of little value for people interested in cladistics (I'm surely might be wrong here), but for many other specialities it is clearly required. Differential equations describe a good deal of evolution of populations. And in specialities where physics and chemistry are required calculus is just a natural language.
I am also sure that Prof. Liberman, as a phonetitian, knows the value of calculus. Would it be nice to see a smart grad student who wants to work with you, but does not have a vaguest idea about Fourrier transform?

[(myl) These days, a phonetician (or musician or sound engineer) mostly needs to understand the Discrete Fourier Transform, which is best thought of in linear-algebra terms as a linear transformation (a rigid rotation, in fact) of coordinates, imposed by matrix multiplication on signals viewed as vectors, with sinusoids (expressed as complex exponentials) as basis functions in the appropriate arrangement in the matrix. See here for how I generally teach it — no calculus involved.]
Circe said,

May 31, 2012 @ 5:16 pm

These days, a phonetician (or musician or sound engineer) mostly needs to understand the Discrete Fourier Transform, which is best thought of in linear-algebra terms as a linear transformation (a rigid rotation, in fact) of coordinates,

Perhaps off topic, but let me add that the continuous time Fourier transform is also best viewed as a change of basis in the space of functions over reals, from the basis of Dirac delta function to the basis of sinusoids. The calculus involved is just computation (analogous to the "multiply and sum" step in discrete Fourier transform).

Prof: Liberman: Welcome to Berkeley!
Circe said,

May 31, 2012 @ 5:19 pm

Sorry, linked to the wrong delta function in the last post: the delta functions involved are more like Kronecker delta rather than Dirac deltas ( A Kronecker delta function is 1 at exactly one point and 0 everywhere else; a Dirac delta is "infinite" at one point and zero everywhere else).
D.O. said,

May 31, 2012 @ 5:23 pm

I know that DFT can be viewed as a unitary (or orthogonal, if you wish) transformation. But it is harder to understand why anybody would like to do that without knowledge of harmonic oscillators and differential equations.

[(myl) Because sinusoids are eigenfunctions of linear shift-invariant systems! This is just as true of vectors as it is of continuous functions.]

It is also not clear how to explain the notion of spectral window, Nyquist theorem, resonance, and probably million other things I cannot think of right away.

[(myl) It's easier than you might think. The Nyquist theorem is basically just about degrees of freedom — see e.g. here, especially the "Subspace account of sampling" section at the end…]

I am sure it can be done. After all, it is possible to teach a bear ride a bycicle. Isaac Newton is said to explain all his work on dynamics using geometry. Werner Heisenberg invented quantum mechanics without knowing what matrix is.
This all is not to deny that linear algebra viewpoint is importent and useful.

[(myl) Frankly, I feel that much of this stuff is easier and more natural if taken from the discrete side first. See also the DSP First textbook. Not to disrespect calculus, but…]
Circe said,

May 31, 2012 @ 5:53 pm

I know that DFT can be viewed as a unitary (or orthogonal, if you wish) transformation. But it is harder to understand why anybody would like to do that without knowledge of harmonic oscillators and differential equations.

At least as far as DFT is concerned, its usages in areas other than the study of oscillations abound. One example is the fast multiplication of integers.

[(myl) Because convolution and multiplication are dual operations in the "time" and "frequency" spaces… again no calculus is needed to explain (or use) this.]
Circe said,

May 31, 2012 @ 6:19 pm

myl: Yes, that's precisely my point. Fourier transform are best presented first as linear transforms (after all that's what they exactly are). I think the idea of doing the continuous version first (with all its computations) is just a relic of history.
D.O. said,

May 31, 2012 @ 7:22 pm

For me it is not a question of continuous vs. dicrete, but rather of the kind of relative position of the horse and the cart. In my paleo view, you first think about simple harmonic motion, then about damped oscillator, then about driven oscillation, then about system with multiple basic frequences. The natural language to think about it is calculus.

[(myl) This is all true and good. The question is, what are the priorities for (say) a sociologist or a neuroscience student who is going to take two, maybe three semesters of math-like stuff? Over and over again I see grad students who took two semesters of calculus as an undergrad, none of which they really remember very well, but who can't understand and implement a simple paper about linear dimensionality-reduction via PCA or SVD, or design and implement a simple digital filter, or etc. ]

By way of analogy, you can learn fluent Spanish leaving in English-only community from a good teacher for whom Spanish is not a mother tongue. You would probably do better by moving to a Spanish speaking country and having a native speaker for a teacher.

Again, this is not to justify two-year regimen of calculus complete with ε-δ language and techniques of integration.
Circe said,

May 31, 2012 @ 8:29 pm

In my paleo view, you first think about simple harmonic motion, then about damped oscillator, then about driven oscillation, then about system with multiple basic frequences. The natural language to think about it is calculus.

Yours is not a paleo view, but what you call the horse and what you call the cart may depend upon the application. For example, discrete fourier transform can be motivated purely as a linear transform over finite length vectors, and it is this form (shorn of physical terms like "frequency" and "time") that is useful in its several applications in modern discrete mathematics such as multiplication of integers and polynomials, or the active discipline of additive combinatorics. Indeed, I can say from personal experience that in several computer science curricula, the whole continuous version of the Fourier transform is left aside as an interesting curiosity with applications in "other fields" in favor of the more readily applicable discrete version (which also has the advantage that the beautiful theory of Fourier transforms can be presented unencumbered by delicate real-analytic arguments that are needed to justify the continuous version in several cases). This is precisely what myl seems to be advocating for humanities too.

However, as far as I am concerned, every human being should be taught Calculus without leaving out any of the (ε, δ)-rigor. But out of considerations of my own personal physical integrity, I typically keep that opinion to myself.
Jake T said,

May 31, 2012 @ 10:32 pm

I'm no scholar, and to be honest, haven't ever run Breakfast Experiment, but I wonder if there isn't an important issue that you're touching on, but not quite addressing head-on in Point #4, namely the permanent availability (or lack thereof) of Big Data.

Certainly, projects like archive.org and some governmental data provide long(er) term solutions, but my impression is that the Really Big Data is provided, in limited access, by corporations (ie. Google, Facebook, etc). Collecting the data may (or may not) be central to their business model, but providing access to the data for outside analysis certainly isn't.

While Google seems too big to fail currently, if big data is as revolutionary as you suggest, we need to be thinking about the long term preservation of that data. Unfortunately, long term, in internet years is 5 – 10 years, not 20, 50 or more accurately 200 years.

Some questions to consider (again, I'm not sure I grok the issue very well in the first place, so I'm lucky if these are valid questions–I won't presume to have any answers at all):
What sort of consequences derive from data being controlled almost exclusively by corporations and, more dangerously, by a single corporation? What sort of consequences does the fast pace of change in the online world hold for this sort of analysis long term? And what can we do to ensure wider, more permanent availability of the Data?
Coby Lubliner said,

June 1, 2012 @ 5:41 pm

Maybe Kitcher was simply misled by the fact that English has "Bachelor of Science" while French has Licencié ès Sciences and German Doktor der Nautwissenschaften.

RSS feed for comments on this post

Big Data in the humanities and social sciences

22 Comments

Chris said,

D.O. said,

D.O. said,

Ethan said,

Andy Averill said,

Jess H said,

D.O. said,

QET said,

Chris said,

Ethan said,

Rubrick said,

QET said,

D.O. said,

Circe said,

Circe said,

D.O. said,

Circe said,

Circe said,

D.O. said,

Circe said,

Jake T said,

Coby Lubliner said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta