logo

Corpus-based Linguistic Research: From Phonetics to Pragmatics

INTRODUCTION TO THE COURSE

Big, fast, cheap, computers; ubiquitous digital networks; huge and growing archives of text and speech; good and improving algorithms for automatic analysis of text and speech: all of this creates a cornucopia of research opportunities, at every level of linguistic analysis from phonetics to pragmatics. This course will survey the
history and prospects of corpus-based research on speech, language, and communication, in the context of class participation in a series of representative projects. Programming ability, though helpful, is not required.

This course will cover:

* How to find or create resources for empirical research in linguistics
* How to turn abstract issues in linguistic theory into concrete questions about linguistic data
* Problems of task definition and inter-annotator agreement
* Exploratory data analysis versus hypothesis testing
* Programs and programming: practical methods for searching, classifying, counting, and measuring
* A survey of relevant machine-learning algorithms and applications

We will explore these topics through a series of empirical research exercises, some planned in advance and some developed in response to the interests of participants.

There may be some connections to the ICPSR Summer Program in Quantitative Methods of Social Research, especially Bob Stine's Herman M. Blalock Memorial Lectures on Data Mining.

Participant Projects

During the course, you will do (or at least plan) a piece of corpus-based linguistic research, and will submit a report at the end (due Friday 7/19). At the end of each of the first three weeks (Mondays 7/1, 7/8, 7/15), you should submit a series of short reports, working towards the final submission. You should submit all four assignments (including the final report) via your CTOOLS dropbox for the class.

7/1 -- Briefly answer four questions:

(1) What linguistic question are you trying to answer? This could be a hypothesis that you want to support or refute, a phenomenon that you want to understand, a technique that you want to (in)validate, or a practical problem that you want to solve.

(2) What important and relevant previous work is there, if any?

(3) What (kind of) linguistic data do you need in order to answer (or at least address) the question that you've laid out in (1)? And how much of it will you need?

(4) Where will you get (record, download, collect, create...) this data? What measurements or annotations will you need? Are these already available, or will you need to create them?

7/8 -- Provide references and specific examples to argue that

(1) You can find the basic data you need, with adequate instances of the phenomena you're interested in.

(2) The needed counts, measurements or annotations already exist, or you can produce them quickly enough in adequate numbers.

(3) The results are likely to connect with the question you started with -- or some other question.

7/15 -- Submit a draft of the final report. Are there serious roadblocks, wrong assumptions, or other problems? Are there any last-minute course corrections that would fix such problems (if any)?

7/21 -- Submit the final report. And explain what you'll do next with this line of work: Write it up and publish it? Finish it or extend it? Give it up?

Office Hours / "Recitations"

This course is larger than I anticipated it would be -- more than 80 students are enrolled. So I'll schedule 12 hours per week (probably 3 4-hour stretches) for small-group discussions with class participants. If everyone enrolled in the class comes to one uniformly-distributed hour, that would be something like
82/12 ≈ 7 people per hour. The distribution will not be uniform, but on the other hand not everyone will take advantage of the opportunity...

As an experiment, rather than assign people to specific times, we'll try letting class participants choose whatever times suit their current schedule. The initial plan is for the follow times and places:

TUESDAY 10:00-13:00
14:00-17:00
Mason 2455
FRIDAY 9:00-12:00
14:00-17:00
Mason 2455

 

The End

Due to my participation in the "Language Diversity Congress" in Groningen 7/18-6/20, I'm unfortunately going to have to leave for Detroit airport before my last lecture on 7/16, so we will have to truncate the course at 7 lectures rather than 8.