Corpus-based Linguistic Research: From Phonetics to Pragmatics

Technical Foundations

A sample of the mathematical and computational topics covered by courses at this Institute:

Systems	Descriptions	Languages	Math
TaDa	"[A] software implementation of the Task Dynamic model of inter-articulator speech cooridnation, incorporating also a coupled-oscillator model of inter-gestural planning, and a gestural-coupling model."	Matlab/Octave	Differential equations
	"With an eye towards Box's dictum (all models are wrong, but some are useful), this course asks: how can computational models be useful for understanding why phonetic and phonological change occurs? Students will study the growing and varied literature on computational and mathematical modeling of sound change that has emerged over the past decade and a half, including models of phonetic change in individuals over the lifespan, phonological change in speech communities in historical time, and lexical diffusion."	R, Python, ?	Probability theory, single-variable calculus, linear algebra
Dyna	"This class presents fundamental methods of computational linguistics. We will develop probabilistic models to describe what structures are likely in a language. After estimating the parameters of such models, it is possible to recover underlying structure from surface observations. We will examine algorithms to accomplish these tasks."		Formal language theory, stochastic grammars, prbability theory
	"Admissible methods that have successfully met the scientific rigor required for legal evidence combine analysis based on universal principles of linguistic structure with statistical analysis of linguistic variability."	R, Python
	"The topics to be touched on include classification methods (Naive Bayes, the perceptron, support vector machines, boosting, decision trees, maximum entropy classifiers) and clustering (hierarchical clustering, k-means clustering, the EM algorithm, latent semantic indexing), sequential models (Hidden Markov Models, conditional random fields) and grammatical inference (probabilistic context-free grammars, distributional learning), semisupervised learning (self-training, co-training, spectral methods) and reinforcement learning."		Probability, linear algebra, information theory, stochastic grammars, ...
	"We will focus on the Generalized Linear Model (GLM) and Generalized Linear Mixed Model (GLMM) – what they are, how to fit them, what common 'traps' to be aware of, how to interpret them, and how to report and visualize results obtained from these models. GLMs and GLMMs are a powerful tool to understand complex data, including not only whether effects are significant but also what direction and shape they have. GLMs have been used in corpus and sociolinguistics since at least the 60s. GLMMs have recently been introduced to language research through corpus- and psycholinguistics."		Probability, linear algebra, numerical optimization, ...
Praat	"This course introduces basic automation and scripting skills for linguists using Praat. The course will expand upon a basic familiarity with Praat and explore how scripting can help you automate mundane tasks, ensure consistency in your analyses, and provide implicit (and richly-detailed) methodological documentation of your research."
	"The course covers in eight sessions the interaction with the Python programming environment, an introduction to programming, and an introduction to linguistically relevant text and data processing algorithms, including quantitative and statistical analyses, as well as qualitative and symbolic methods."	Python
	"Participants will use software tools that embody the theories at hand and will examine and model data from a variety of digital corpora. The course will not cover computational phonology per se, but it will cover enough computation to give participants a good understanding of the tools they are using."	???	???
	"This course is aimed at beginners in statistics and will cover (1) the theoretical foundations of statistical reasoning as well as (2) selected practical applications."	R	Probability, linear algebra, ???

There are lots of other computer programs/systems that are relevant to the things we've been reading and talking about, e.g.

* The Natural Language Toolkit (NLTK)
* The Unix command-line environment, including e.g. grep, sed, tr, gawk, perl, etc.
* Programming languages such as Java, C/C++, Ruby, Scala, etc.
* "Literate Programming" systems such as knitr
* Special-purpose packages like OpenFst

What do you really need to know?

Basic linear algebra:

"Today, most scientific mathematics is applied linear algebra, in whole or in part. This is true of most inferential and exploratory statistics, most data mining, most model building and testing, most analysis and synthesis of sounds and images, and so on. This is just as true for research on speech, language and communication as it is for every other area of science."

And best of all, it's easy! Really, it's just arithmetic (adding, subtracting, multiplying, dividing) and sometimes a few other operations (powers, logs, sines and cosines, ...), plus some bookkeeping. It's a lot easier than calculus, and a lot more useful.

And R and Matlab/Octave are basically linear-algebra machines. (Python can easily pretend to be one...)

An object lesson: The semantics and pragmatics of inner products.

Basic probability and statistics: Not just for significance testing, but for exploratory data analysis and machine learning.

Most of the concepts are stated in terms of linear algebra, i.e. simple operations on lists and tables of numbers. And luckily, once someone has figured out how to implement a concept, it turns into a function call in R or Matlab/Octave or Python.

Basic information theory: What's the right way to think quantitatively about concepts like ambiguity?

Basic machine learning: In the olden days of Classical AI, people thought that artificial intelligence was applied logic. Now most people think that artificial intelligence is applied statistics.

Wherever in that space of concepts the truth turns out to lie, there's an explosion of neat (old and new) techniques for classification, prediction, clustering, parsing, and so on: Subspace Methods, Hidden Markov Models, Conditional Random Fields, Support Vector Machines, Deep Belief Nets, ...

Again, most of these are laid out in terms of simple operations on lists and tables of numbers.

How can you learn it?

You could spent several lifetimes learning this stuff. But don't.

A default recommendation:

* Learn by doing -- learn enough of topic or system X to apply it to some aspect of a problem you're working on. It helps if you can find a collaborator who already knows X. If an effective collaborator insists on using something a bit out of the mainstream, go with the flow, but try not to get trapped in a computational or mathematical sect.

* Take it one step at a time: Try to find problems whose solution is just a bit beyond what you already know.

* The best way to understand something -- and to test your understanding -- is to play with a simple implementation. It's often a good idea to start with a fake problem that's a radically simplified form of the problem you want to solve. And if you generate fake data, your analysis had better pull out the rabbit that you carefully stuck into the hat. If it doesn't, then either the method is suspect, or you did it wrong.

* Learn a little more than you need to solve your current problem, aiming at some broader competences. Plausible goals include basic linear algebra, basic probability and statistics, and R. If you're interested in signal analysis (e.g. in audio or video processing), learn Matlab/Octave. Maybe learn Python for general-purpose programming.

* Be prepared to dig in. You're likely to learn 4 times as much in 2 hours as you do in 1 hour. But don't beat your head against a stone wall. If you're stuck, let someone who already knows the stuff show you how to back off and walk through the open door that's just a little ways to the side.

* Use on-line courses if they work for you -- or just use web search on an ad hoc basis.