Interview with Charles Yang
« previous post | next post »
Charles Yang* is perhaps best known for the development of the Tolerance Principle, a way to quantify and predict (given some input) whether a rule will become productive. He is currently Professor of Linguistics at the University of Pennsylvania, where he collaborates with various researchers around the world to test and extend the Tolerance Principle and gain greater insight into the mechanisms underlying language acquisition.
How did you get into Computational Linguistics?
I’ve always been a computer scientist, I never really took any linguistics classes and I was interested in compilers. I was doing AI, so it was kind of natural to think about how human languages were parsed. I remember going to the library looking for stuff like this and I stumbled onto the book “Principle Based Parsing” which was an edited volume and it was incomprehensible. It was fascinating, actually, I wrote [Noam] Chomsky a physical letter way back in the day when I was a kid in Ohio and he kindly replied and said things like there’s recent work in syntax and so on. That was one of the reasons I applied to MIT to do computer science because I was attracted to the work of Bob Berwick who was the initiator of principle based parsing at the time. While doing that, I also ran across Mitch Marcus’s book. I don’t think I quite understood everything he was saying there but his idea of deriving syntactic constraints from parsing strategies was very interesting. I started reading Lectures on Government & Binding among other things. I applied to MIT, I got in. I had some marginal interests in vision, I was very attracted to Shimon Ullman’s work on the psychophysical constraints of vision. [It was] very much out of the Marrian program as opposed to what was beginning to become common, which was this image processing based approach to vision which was just applied data analysis which didn’t quite interest me as much.
When I got there, I started doing this for real. I was also interested in the book that Berwick and [Amy] Weinberg wrote, which was basically how to map a theory of syntax to a theory of parsing. That was in the aftermath of the failure of the derivational theory of complexity, essentially assuming all the transformations in the theory of grammar at the time would transparently map to parsing difficulties and so on. That was viewed as a failure, but they were trying to salvage what’s left. That made a strong impression on me to this day. I committed, in my own work, to believing linguistic representations and computations are actually real, literally real. But, I was in a Computer Science department, I was doing various things. I was thinking about parsing in the standard psycholinguistics way. One of my first academic publications in linguistics was actually a demo at CUNY Sentence Processing Conference way back, ‘95 or ‘96. I wrote what might have been the first minimalist parser. It was really a minimalist deriver. You take a sentence and try to come up with the derivations that could come up with the surface forms. It was not a full parser, but I was faithfully implementing the minimalist engine at the time.
From the conference and taking several classes it was pretty clear that in sentence processing people just weren’t interested in the computational mechanisms of parsing in the mid-to-late ‘90s. So, I didn’t see a market or even a peer group interested in that, and then I was doing your standard Computer Science stuff. I was doing machine learning, including reinforcement learning, and I took a class on language acquisition and it just dawned on me to combine some of my interests. I came up with, at the time, a pretty novel and probabilistic model of parameter setting and that was sort of how it got started. I realized probably in grad school that I was much more of a cognitive scientist and engineer even though I was in the Computer Science department. That was also the beginning of statistical NLP where NLP was drifting further and further from cognitive science and linguistics. I realized the cognitive science of language is what I was more interested in. But, having said that, to this day I consider myself more of a computational person and am very comfortable thinking about concrete mechanics of representation and computation.
You always knew you were going to be a computer scientist?
Computer science was my thing. I started programming very early. I had very little talent for language, really. Language became interesting from the side because there’s a computational mechanical process that was very natural for a computer scientist. Then I began to realize the implications of the study of language to the general issues of cognition and the mind and also AI.
Do you feel like you have more peers now?
Absolutely. When I was doing probabilistic approaches for language learning, about the only people that were supportive of that work were Chomsky and [Morris] Halle. Nobody else in linguistics or cognitive science. I took a lot of stink for it. Now, of course, that’s completely standard and so time’s certainly changed. Although I would say, at the time you still have the sense that people in that general realm of things were still doing cognitive science. There was a commitment to computation. [It] could be in a very implicit way, [like] when a syntactician gives you a theory of how to derive a structure that’s a computational theory. [It involved] paying attention to mechanics. Now, there’s a lot more computational work [and] a lot more probabilistic work, of course, but a lot of them are cases where the computational tools have really become what I would call a curve fitting device. Where you have a computational model but there are so many tunable free parameters, then it’s no longer different from a regression analysis, right? [But] just dressed up in a computationally more sophisticated way. In that sense, that’s a little bit different from back in the day. Back in the day, regardless of methodology, you’re interested in the mechanics and I thought that was the great thing about cognitive science. But, these days I think mere curve fitting is much more dominant than before. Progress, regression who knows? I think it’s just a changing landscape. I would not be surprised if it changed again in the future.
Now with the commercial success of NLP, it is very different from the “Other Tradition” at MIT. Is there an important distinction here that we need to make in terms of either goals or the questions being asked?
I think so, I think the goals are different and responsible people would not say otherwise. That is not to say they can't be mutually benefiting from the other research direction, but I think I’m still enough of an engineer that I would refrain from criticizing either direction unless I can beat them. Beat them in the F-Score scheme because ultimately I think that’s how real progress is made. If you want to show your theory or your approach is valuable, you have to demonstrate its value to people who don’t necessarily hold that view. That’s how all progress is made, this is how cognitive science arose out of the ashes of behaviorism. It offered a positive solution that tackled previously intractable problems. In order for us to show the science of language would be relevant to engineering problems, we’ve gotta show it works. You need some applications. That’s a long game. You know I'm beginning to do a little bit of that. The reason I did that is I thought I better figure out how language is actually learned by humans first before I can apply those ideas to engineering problems. I would certainly think these two areas have different goals, different methods. Problems only arise, of course, when one pretends to do another person’s job and doesn’t do it right.
Can you explain the Tolerance Principle?
That’s a long story. That’s when I was in grad school even, and it was sort of the heyday of the past-tense debates. What was interesting was that even though they disagreed about the rule, they agreed about the exceptions. They just thought it was memorization, association of pairs of words, and that to me was highly unintuitive. Because, if I were a computer scientist and wrote a program to encode all the irregular verbs of past tense of English, would I literally make a list of 150 if then statements? Instead, I think most programmers have this intuition, if the verb were “think,” “catch,” or “buy” then change the form to this. If they were “sing” and “ring” then change it to that. I dug quite deep into the corpus of child language and found evidence for these kinds of mini-rules that organize the irregular verbs at the time when no linguists were engaging with that kind of quantitative data. But, then of course, the problem arises that those rules don’t generalize. On analogy of “catch”-”caught”, you don’t say “hatch”-”haught” but “-ed” does generalize. If I wanna say they’re all rules, how come some rules wake up in the morning to say “I’m productive, I can generalize,” and some rules don’t?
I pretty much set out immediately to do that because when I was giving talks on this work, people were on the whole persuaded by the evidence that the irregulars are done by rules, but how does one become productive? The tricky thing is to come up with a metric for it. It has to be some sort of cost-benefit analysis, when you have enough evidence going for it, you generalize, otherwise you don’t. Thinking as a computer scientist, you think of the computational currencies of trade-off. It’s either space or time. Space was popular. This was at a time when something called the minimum description length principle was very much in vogue. You want to see how much you can compress the morphology into words and rules and so on. There’s lots of people who have done it, but I feel like in a lot of cases you’re making guesswork about the compression capacity of the human learner. You can almost cook it to compress it as deep as possible. Maybe so deep that every word has a super abstract online form, you just squeeze everything into the rules. That seems to be ad-hoc. Then, if not space, it must be time, so I thought about how I would write a program to do past-tense. I would have to make a list of exceptions before you can apply the rule, which also happens to be how linguists treat rules and exceptions – going back to Pāṇini. You formulate this abstract data structure that has this property that the regular rule has to wait for the irregular process to be checked off. That enabled me to formulate a metric, and I think I guessed the answer but I couldn’t prove it. I sent this off to my friend Sam Gutmann who’s a mathematician that specializes in probability theory. Now, he’s a very big deal in quantum computing. I asked him to prove it for me, and he did. I still have the fax he sent.
That was done when I was quite young, I think a couple years out of my PhD and that was basically the Tolerance Principle, and I knew it was, I wouldn’t say right, but definitely onto something. But I also knew it would take a long time for this to get anywhere. It took a really long time, because who’s going to believe that? You just have to make a very strong case with lots of case studies to show this one trick can really do them all, and that really took a long time.
This is actually where not being good with languages really delayed the work, because I don’t know any languages. When it comes to German, Spanish, Icelandic, and Russian there are lots of the challenging cases. You have rules that don’t generalize even though it’s the super-majority, you have rules that generalize when it covers very few words, and you also have cases where there are gaps where no rules can generalize at all. Those tend to be found in other languages, so I just had to wait for students to appear on the scene to do it, but it took a very long time. So much so I thought, they say you do your best work before you’re 30, and that was probably right. I realized it really would have to be a very long time before I could publish this. I remember submitting to two journals and people were saying, “What are you talking about?” In that case, I don’t know if it’s possible now because of this publishing pressure and culture. That was a long time ago, do young people still have the luxury to sit down and work out something long-term? I mean this pressure seems to me now on people the first day they’re in grad school, to pump out two papers a year and then you may have a job. It’s worrying.
Are there any applications of the Tolerance Principle that have surprised you?
I would have to say the most surprising thing would have to be numerical cognition. That was literally a chance event. I happened to be at a workshop in the Netherlands. Frankly, I don’t quite know why they invited me. The workshop was “Language and Number,” and I thought it was just some quantitative study of language but it turned out to be literally language, and I remember trying to come up with something to say. Some of the very best people in the business were there. It was very stimulating to hear what they were saying. I remember [Randy] Gallistel and [Rochelle] Gelman talking about counting and then David Barner was there talking about Sue Carey’s work and it seems like English kids need to be quite advanced at the cardinality principle to get the successor function. I remember sitting there and saying, you know if you think about counting as a morphological problem, then I can give you a threshold of how high you need to count. Then someone told me to look at the work by Karen Fuson, who studied counting and figured out these “tipping points” which had to be there because no one can count forever. I was very surprised that it was very close, literally what the equation predicted.
Even better, there’s this work that’s now completed with my colleagues in Hong Kong looking at Cantonese which is a different language with a simpler counting structure and you can predict when the Cantonese kids will be able to count and to see how well that correlates with the successor function is surprising. I would have to say, that’s not something I would ever dreamt of. Another thing that’s quite surprising, in some ways more surprising, is how well this works at all. For instance, Katie Schuler and Elissa Newport found some very good evidence for the principle with artificial language studies. The interesting thing in doing a study like that is that you’re dealing with a vocabulary that’s not huge. Maybe 10 or 20 or 100 or so. The principle as you may remember makes use of approximation, that’s where log of n appears. Log of n is to approximate the so-called harmonic number, which comes from Zipf’s law, but everybody knows that approximation is true when n is huge. It’s not supposed to work when n equals 9. I have no idea why that’s the case. In a lot of cases you see n over log n play a big role in lots of natural phenomena, but I have no idea how these can be conceivably related.
What’s made for positive collaborations?
I have to say I’ve been extraordinarily lucky. I think every collaboration I’ve had has been a very positive experience that led to something concrete. When you’re engaging with someone else whose work is obviously different from yours, you can see where your work is relevant to them and hopefully, if they are good, they can see the other way around as well. Those are the best collaborations. Some collaborations are almost ready-made. At a workshop, I presented some formal work on how to quantify that the child language is productive whereas chimpanzee’s is not and Susan Goldin-Meadow presented her work on home signs, which is evidence that suggests that home-signers do have a compositional system. I remember Liz Spelke says “Why don’t you guys work together?” and so we worked together, so that was almost ready-made. In some cases it really takes quite a lot of work because you’re coming from different backgrounds. There I would say one of the most rewarding collaborations being with John [Trueswell] and Lila [Gleitman] on word learning stuff. That was a lot of work, it took several years for it to be put together and published. But, it really taught me how, in a collaborative relationship, you really need to understand where the person is coming from and that process is not trivial. You need to be immersing yourself in the other person’s world to get an appreciation of that. I would definitely think that was a very educational experience for me. To this day, I think that paper we wrote together is a piece of work I really take pride in because it was hard work but it was in the best of all the possibilities.
The other thing I want to add, is sometimes when something doesn’t work, you should just say “This is not going anywhere” and just cut your losses early. Fortunately I haven’t had to deal with that, but there have been cases where I thought it wouldn’t go well, so I just didn’t go there.
Are there any books you’d suggest?
For language it’s easy to list all the classics, I think everybody knows about those. I did a lot of random reading in college and grad school. I read more than my fair share in philosophy of science, I was very interested in evolutionary biology, I was interested in history and philosophy of mathematics. These were just pure hobbies, right? But sometimes they become relevant. It’s hard to make any specific recommendations, but I think if you look at a successful field which can be quite remote from yours, you can see how those greats were able to take very complex phenomena and distill them into simple models and develop nice, formal bodies of knowledge. I think that consciously or unconsciously I treated evolutionary biology as a model science for cognitive science. I got into evolution partially because of its socio-political implications. Evolutionary psychology was huge at the time. I still remember reading stuff in the ‘80s going back to the earlier debate of social biology, social darwinism, and eugenics even earlier. It started as a political interest and became more of a scientific one, and as I learned more about it I began to see how it’s almost like witnessing the beginning of physics, when you had the simple models that worked. I think people might share this experience if they read some serious technical work, but it doesn't have to be technical evolutionary biology. It’s a good model for the study of cognition.
How has cognitive science changed?
The field is certainly a lot more specialized. The conferences were much smaller at the time and I could read vision papers and sort of understand what was going on. That was 25 years ago, and that’s changed a lot. Unless you’re plugged into the literature, you have no idea what’s going on. That’s even true for linguistics, you just can’t keep up. When we teach Intro to Cognitive Science, it can be viewed as a first exposure to the fields as the instructors think is appropriate. I still feel that what makes cognitive science is computation. It’s fundamentally about mechanics, so when I teach Intro CogSci I try to give people a unifying mode of thinking. How do you get the idea of computation in the head of freshmen and sophomore students who have no background in it? It’s trial and error. I teach them finite-state machines because I feel it’s the simplest computational device possible and it’s extremely powerful. You have to try hard to find cases where it doesn’t work, but it also gives people a way of thinking about the process of computation in a purely mechanical fashion. Moreover the formalism is flexible enough you can easily add probabilities to transitions. It seems that introducing students to the idea of computation with a simple model and adding probabilities to them may be the most basic computational model one can come up with for cognition and it’s highly useful.
Do you have any unlikely influences?
Certainly, in the work I did for my dissertation on the reinforcement learning approach to parameter setting. That didn’t come from reinforcement learning, it came from evolution. I was taking class with two of the greatest biologists around, Dick Lewontin and Stephen Jay Gould. Two similar but also very different professors. They gave out a packet of readings for the course, which was of course only their own papers. I took it on a field trip where Gould promised to give a guided tour of the Museum of Natural History in NYC. My girlfriend at the time, Julie was there too. We were waiting for Gould to show up, and I didn’t bring anything so I just brought out the reading package. Lewontin, who’s a giant not just for his technical contributions but his overall insight as a scientist and as an intellectual, mentions that species change is not original to Darwin. Lamark had a view of change where the giraffe would have a short neck, get a longer neck, then have children with longer necks that would do the same. That was something that Lewontin called a transformational process since you have entities that change in their own character over time, but Darwin’s view was different. It was basically that you have a bunch of giraffes, some have longer necks and some of them have shorter ones, and over time because of selection, the longer ones got better represented. It’s really not the individual’s change, it’s the distribution of the individual’s change. I remember immediately writing down on that same piece of paper “Toward a variational theory of language acquisition,” which became my thesis. It goes back to the idea of triggering, the child is viewed as having one grammar or one hypothesis and making some changes to another one, a transformational process. I thought another way of looking at it would be to treat it as a variational process where you have the two hypotheses in your mind to begin with and they’re just undergoing this kind of competition. At the time, I even cooked up some Hebbian learning competition scheme proving it was interesting but it turned out that mathematical biologists did this work way back with much better behavioral support, so I just adopted those models in the end. That was completely from left field but it gave me a dissertation and got me started in this business. That was actually not from reinforcement learning even though in the end it’s obvious once you state it that way that it’s reinforcement learning.
What do you see as the relationship between language and thought?
If I can avoid the label Whorfian, I think I’m becoming more and more of a Whorfian as time goes on. The number thing made an impression on me, that was clearly a case, I think, that having a simple language enables you to develop a conceptual ability earlier. I think it’s pointless to go into the debate of whether you have the ability. The other thing is what Chomsky has been saying recently (I think very reasonably), that you couldn’t have evolved all those individual components for language like theta theory or islands. It has to be one thing, and that one thing has to be Merge or something to that effect. Given the similarities between us and other species, it would have to be language that takes us where we are. That’s not a new idea, that’s an idea that everybody has. Language is what makes us smart. Lila and I had this discussion recently. In the last book that Jerry Fodor wrote with [Zenon] Pylyshn, there was a passage¹ that struck Lila: roughly speaking, how come only we are able to move past the perceptual circle of direct experience? All the other animals are bound by it, but we can talk about whatever. There’s a passage in that book where they said it’s about learning the rules of the language. They emphasize in brackets that it is not the language of thought, but the language of Hindi, French, and so on. Literally the grammar of a particular language. That’s music to my ears.
The reason I say this is because, I think, like Lila’s work, a lot of things are learned from experience (observation) very slowly because the environment is noisy. Everyone knows you can’t learn everything from that, so you have to generalize. One thing I’ve found exciting for a long time was how kids generalize. We relied on constraining the search space, syntax, semantics, mapping, linking rules, all the stuff. That was not satisfactory but that was what we had, but it occured to me that I have something that might be able to do the job. We can generalize from small amounts of data to discover systematic mappings. If that’s the case, then they invite the thinking that the seeds were directly based on perception. Kind of what Sue Carey refers to as Quinean bootstrapping. You have this input analyzer from the sensory data, innate or otherwise, but it can only get you this far. But not very far. Once you have language, you learn how these are put together in a way that she didn’t quite specify but what’s implied is a learning theory that gets you there. Maybe the generalization problem is something the Tolerance Principle can contribute to. If that’s the case, then that’s necessarily a Whorfian view, almost, so, there. This is easier said than done. In principle, how event perception, how social development, how pre-linguistic cognitive development, you wanna see how they are encoded in your particular grammar, and that might be the basis of generalizing beyond the perceptual circle. If the perceptual circle is not so different from individual to individual, from culture to culture, or even from humans to animals, then it would have to be the Fodor/Pylyshn’s later statements. I think Chomsky saying it’s only Merge forces everyone to be a Whorfian.
Any advice for people entering the field?
Maybe having fun would be the most important step. When I started out (well, I knew I was a computer scientist, so I knew I wouldn’t be jobless), I really enjoyed the freedom. I explored. I was supported by very wise advisors who never told me what to do. I wasn’t under particular pressure to publish, so I just did what I did. It’s this experience that I try to pass on to my graduate students. You look at what’s interesting, and you have to be ambitious and have some sort of ideal picture of the work you want to do and then you gradually build towards it. One of the things that reminds me of, I read in [Richard] Feynman’s autobiography this Feynman story of how to be a genius. You should always keep the ten most important problems in your head. You don’t need to think about them all the time, but every time you learn a new trick you do verification or testing to see if it’s gonna help you. That’s probably right, you gotta have a big idea in your mind to go after and then how that comes about probably depends on a lot of things including luck. I certainly think for people going into grad school, obviously, we’re not doing it for the money, so might as well try to make a splash. I think having some long-term goals in mind is probably good, rather than going for the immediate payoff. Although the culture might have changed so much that that’s probably suicidal advice.
*Disclosure: After the time of conducting the interview, Yang joined the interviewer’s advisory committee.
¹The passage from Minds Without Meaning (p. 136) reads “In fact, our inspiration is ethology, not epistemology. It is a constant refrain of the ethologists that every creature except humans lives in an ‘umwelt’ that travels with it and is bounded on all sides by what the creature’s sensory mechanisms can detect. Excepting us, nobody can respond to, or refer to, or remember, or even think about, anything that isn’t, or hasn’t been, within its umwelt. We’re exempted; we can refer to Genghis Khan (as it might be) because there is an intricate system of linguistic contrivances (not the Language of Thought but) real languages: English, French, Urdu, and the like) that causally connects us to conspecifics within whose PCs Genghis Khan once stood. Pace Piaget and many others, language is not what made it possible for us to think; but it is, unsurprisingly, what made it possible for us to communicate our thoughts.” Note from Charles: “Except, I do think that there are cases where language enables us to think, at least in the case of symbolic number.”
Philip Taylor said,
November 12, 2020 @ 4:15 pm
Fascinating. Would I be correct in thinking that the 'PCs' in "conspecifics within whose PCs Genghis Khan once stood" are "Principal Components" ?