Will vs. going to: a recount

« previous post | next post »

Yesterday, I took a quick poll of a few small English-language texts, to see how often future-time meanings were expressed in various tensed-verb forms ("Alternative futures", 12/11/2008). My conclusion was that by far the commonest method in written American English is to use forms of the modal auxiliary will; but that in spoken American English, other alternatives are closer to even with it. However, my sample was too small to draw any very reliable quantitative conclusions.

So this morning, I'm doing another Breakfast Experiment™ to try to get better numbers, at least for some of the alternatives in the spoken language.

The data comes from the transcripts of two published conversational speech corpora, "Fisher English Training Speech Part 1 Transcripts" and "Fisher English Training Speech Part 2 Transcripts", together comprising 11,699 conversations of up to ten minutes each, recorded in 2003. The speakers come from a broad sample of regions, ages, and socioeconomic levels — full demographic details are available in the associated publications.

Since I don't have time during breakfast to read all these conversations and look for future-time references in tensed verbs, I'm going to come at the question from a different angle.

First, I'll scan the transcripts for instances of the strings "will", "won't", "'ll", "going to", and "gonna", taking appropriate precautions to avoid cases where these strings form part of other words (e.g. "willing" or "Williamsburg").

Second, I'll check a sample of the hits in order to estimate what percentage are genuine modals or semi-modals, as opposed to things like "I believe in absolute free will" or "I hate going to the doctor". (In the case of "won't", I'll also exclude things like "they won't let you smoke anywhere" and "I won't watch that stuff".)

And finally, I'll use those estimated proportions to get corrected counts.

The raw results:

will 'll won't going to gonna
15,219 32,515 4,975 29,331 22,176

This yields 15219+32515+4975 = 52,709 for the various forms of will, versus 29331+22176 = 51,507 for the various forms of going to.

I checked a random sample of 100 instances of will, and found that 3 were things like "then it's their own free will to read it".

Checking a random sample of 100 instances of going to, I found that 13 were things like

all the money's going to transportation and food
instead of going to the drive through
i'm going to a ear nose and throat specialist
i'd be interested in going to see it

And checking a random sample of 100 instances of won't, I found that 24 were things like

they won't really let you stop even for a minute
my husband won't fly
my body won't take it no more

(This is a bit unfair, since I didn't try to exclude modal but non-future instances of the other word-types, but still…)

I didn't try to exclude non-standard stuff like "and she died and oh i know you won't gonna believe this i had her stuffed".

As far as I can tell, "'ll" and "gonna" in the transcripts are essentially always the modal auxiliary or the semi-modal, respectively.

Anyhow, adding everything up with the estimated adjustment factors, we get

0.97*15219 + 0.76*4975 + 32515 = 51,058

for forms of "will", and

0.87*29331 + 22176 = 47,694

for forms of "going to". Or retaining a more reasonable number of significant digits, about 51 to 48 in favor of "will".

Note that I haven't tried to calculate how many of these, in any of the lists, are actually references to future time. Some of them are (for example) predications of generic propensity like these:

only once in a while will i watch the show with my parents

seven six and he plays like six one you know what i mean you go through him shaq will go through you and through somebody else

being a nurse i'm gonna listen to the health more than i wanna listen to say uh financial news

Others are "past future" uses of going to like:

i was gonna say what's it like now

i heard originally when they were gonna make that they wanted to do all three

And of course we've completely ignored all the other ways, standard and non-standard, that English has for expressing future time reference in tensed clauses.

Still, I think it's fair to conclude that in contemporary American speech, forms of will and forms of going to fulfill this function about equally often, with will perhaps slightly in the lead.



24 Comments

  1. language hat said,

    December 12, 2008 @ 9:23 am

    Fascinating! This is the kind of thing linguists should be doing, dammit. Facts, give us facts!

  2. Chris said,

    December 12, 2008 @ 9:52 am

    Some of them are (for example) predications of generic propensity like these:

    only once in a while will i watch the show with my parents

    I don't see how this is different from the excluded uses of won't in constructions like "My car won't start" and "That dog won't hunt", aside from not being negative.

    [(myl) As I explained, I used a different criterion in the case of won't. The reason? I did it later. Why didn't I go back and re-do the others (since going to has a similar usage, in fact)? Because breakfast time was over.]

  3. Brett said,

    December 12, 2008 @ 10:29 am

    I don't have it handy, but the Longman Grammar of Spoken and Written English is full of facts like this, and might even address this particular question.

  4. Joe said,

    December 12, 2008 @ 11:30 am

    Very interesting results! Thanks for looking into this.

  5. g d gustafsson said,

    December 12, 2008 @ 2:01 pm

    This is similar to Swedish "kommer att" vs "ska". With ska (similar to "shall") being more intentional and "kommer att" similar to english "going to"

  6. N said,

    December 12, 2008 @ 2:03 pm

    I have to leave a comment where it's possible:

    [(myl) Please don't. If you have something to say about a post that doesn't have open comments, the right thing to do is to send email to the author. In this case, since the author was just passing on an ad that appeared on a mailing list, you might address your remarks to the filmakers, or to the National Science Foundation, who funded the project. ]

    As a now-graduated linguistics undergrad and a future linguistics grad student, I am outraged that it would cost $300 to obtain a copy of The Linguists. When we first heard about the movie, everyone in the psycholinguistics lab where I work was massively excited, and wondering where we can see it or get a copy of it. I would love to be able to share the movie with my family and let them see what some intrepid linguists do, but it is priced completely out of a regular person's budget. How does _that_ promote linguistics?

    [(myl) Obviously, neither Eric (who wrote the post in question) nor any of the rest of us know anything about this. Speculating wildly, I'd guess that we're looking at a phase of documentary distribution that is more like the initial theater run of a movie — the filmmakers hope to make some money from a relatively small number of purchases by schools and the like. Their goal, of course, is not to promote linguistics. I agree that the goal of promoting the field would be better served by selling the DVD at the more usual price of $15 or so; but I doubt that NSF or David Harrison have any real leverage in deciding what the pricing should be.

    For educational use, the cited price is in fact not so preposterous. If the linguistics department at Penn, for example, planned to show this movie in Linguistics 001, where we get a couple of hundred students a year, we could show this film to three years of classes for about $0.50 per viewer, which seems reasonable, especially in a world where a lot of textbooks cost $100 or $150 each.

    Continuing to speculate without any real basis for doing so, I imagine that there might at some later time be a cheaper version, and/or a release through rental outlets. ]

  7. Molly said,

    December 12, 2008 @ 3:19 pm

    I was under the impression that the present progressive/continuous (example: I'm doing my laundry tomorrow) was the most common form, statistically, used for expressing future time in English. I must have read it years ago and believed it. Are there no statistic available on this question?

  8. Lance said,

    December 12, 2008 @ 3:48 pm

    Still, I think it's fair to conclude that in contemporary American speech, forms of will and forms of going to fulfill this function [i.e., expressing future time reference in tensed clauses] about equally often, with will perhaps slightly in the lead.

    I admit I haven't been following carefully, but (as Stephen Jones suggests) it's not really the case that the various future-oriented expressions only express future time reference. So—examples based on Bridget Copley's dissertation—the present progressive can't be used for prediction or mere strong belief:

    (1) The Red Sox {will play / are going to play / are playing} the Yankees tonight.

    (2) The Red Sox {will beat / are going to beat / #are beating} the Yankees tonight.

    And, as Stephen said, "be going to" doesn't work for offers; Copley imagines two variants of a billboard on a highway approaching Madera:

    (3) We'll change your oil in Madera.
    (4) We're going to change your oil in Madera.

    The version in (3) is an offer; the version in (4) is a threat. (Sentences like (3) can be preceded by "If you like"; (4) is more suited to "like it or not".)

    The upshot is that, while the experiment isn't comparing apples and oranges, it's comparing cooking apples and eating apples. At best, it looks like we can perhaps conclude that speakers discuss future possibility and future inevitability with equal frequency.

  9. Amy said,

    December 12, 2008 @ 4:03 pm

    N–
    Try your local library. If their collection doesn't have The Linguists perhaps you can talk them into ordering it.

  10. xtopher said,

    December 12, 2008 @ 4:43 pm

    My mom's always fixing to do something.

  11. Mark Liberman said,

    December 12, 2008 @ 6:33 pm

    @Lance: There appear to be many senses and sub-senses for each of the form-classes under discussion, and not a lot of consensus about exactly what they are are. Some of these senses seem to be shared — at least there are many contexts where different alternative forms could be used with little clear difference in meaning — while others are clearly not shared, again in the sense that in some contexts one alternative could be used while another could not, at least without an obvious change in meaning.

    Sense disambiguation is typically hard to do in an intersubjectively valid way, and this case seems to me to be worse than usual in that respect.

    The place to start would be a simple but complete taxonomy of senses for each of the forms involved, and also perhaps a simple but complete taxonomy of types of circumstances of use. Then a published sense-analysis or context-analysis of a published corpus would give us an idea of the relative frequency of various alternative form/meaning/context associations; and this would also make it possible to compare alternative analyses over the same body of material. A problem with a lot of research on semantics and pragmatics in this area is that it doesn't offer such a taxonomy, but instead just analyzes certain (perhaps rare) classes of cases.

  12. Mark Liberman said,

    December 12, 2008 @ 6:51 pm

    Lance: .."be going to" doesn't work for offers…

    There's something to this, but it's not true as stated. It's easy to find counterexamples on the web, e.g.

    For everyone you refer, we're going to give you 20% of their earnings in FREE POINTS! AND, as if that's not enough, we're also going to give you an additional 5% for everyone THEY sign up!

    Buy one of these collector's editions, help save the sharks and save the world, and we're going to throw in a free tee for whichever one you get.

    Lance: …the present progressive can't be used for prediction or mere strong belief…

    Again, the specific example seems persuasive, but I'm not sure about the claim in general. Isn't the apodosis of a conditional an example of "prediction or strong belief"? But that's a context where the present progressive can often be found, as in these web examples:

    Buy it anywhere else and you're getting ripped off.
    So that means, if you buy Cypress you're getting $1.75 billion of SunPower, which means you get the semiconductor business for substantially less than a billion dollars.

  13. Mark Liberman said,

    December 12, 2008 @ 9:53 pm

    Molly: I was under the impression that the present progressive/continuous (example: I'm doing my laundry tomorrow) was the most common form, statistically, used for expressing future time in English.

    It's not easy to check this directly, but here's an indirect check that makes it seem almost certain that your impression is false, at least in these transcribed conversations.

    I scanned the conversations in the two collections cited above, for instances of am|is|are or 'm|'re|'s followed by some character string ending in -ing, and found 93,755 phrases containing at least one of these.

    This underestimates the count of present progressive verb forms, since it omits e.g. cases with an interpolated adverb; but it also overestimates, since it includes things like "that's interesting" or "someone's parking space".

    So I checked a random sample of 100 phrases, and found that 5 of them were things like "he's boring" or "that place is bring your own"; 15 of them were instances of "be going to"; and 80 were other present progressives. Of those 80, none (0) had a future time reference — all were either specifically tied to the current time, or were hypothetical or generic/habitual, for example

    … just [because] they have the technology doesn't mean they're willing to use it …
    … the thing is at first you are trying to think well what is this what is the catch …
    … the different marriages that i've seen um the ones that are communicating even if they're joking you know …

    I checked a second random sample of 100 phrases, and found 2 out of 100 with the kind of "future scheduled event" meaning of the example you cited:

    … i have an internship that i'm leaving for in a couple of weeks …
    … i'm closing tomorrow …

    But there were no other examples of future-time reference in that sample, other than instances of "be going to". This suggests that only about 1% of the 93K hits, or around 900, will be future-time reference for present progressive forms other than the semi-modal "be going to". Whatever the true number is, it has little or no chance to be greater than the roughly 50K forms of "will" and the roughly 50K forms of "be going to" in the same collection.

  14. Freddy said,

    December 13, 2008 @ 1:12 am

    Not to veer too off-topic, but around these parts (St. Louis), AAVE speakers use "fixing to" which reduces to "fiittin' tuh" (tt=some kind of glottal stop) which reduces yet again to "funna": "He funna go." It is very common here.

  15. Lance said,

    December 13, 2008 @ 3:44 am

    Mark:

    The thing about your be-going-to offer examples is that, in both cases, the offerer is "going to" do something (give you earnings, throw in a T-shirt) once you've already accepted the offer (by referring someone, by buying a collector's edition). Naturally, Copley's dissertation is more nuanced than what I was expressing; her proposal for "be going to" has no trouble with these example.

    [Specifically, if I recall it correctly, she says that "be going to X" means that all future worlds from this point are worlds in which X, which means that essentially those sentences mean "you getting earnings is inevitable and unavoidable, not from the current state of affairs, but in those worlds where you've referred someone". A standard offer, like the Madera billboard, doesn't have that "in those worlds in which" context, so the result is an assertion that oil-changing is inevitable, period.]

    As for the present progressive examples, I meant specifically "the present progressive as a future marker". I mean, obviously I can say "The Red Sox are probably losing right now", and I'm talking about a strong belief; just not one that involves the future.

    Anyway, I'm mostly saying these things to stress that I, and not Bridget, should be held responsible for any miscategorization I've made of her work.

    But as to the first comment in reply to me…I think that that's more or less exactly what I'm saying, i.e. that the nuances of sense make it hard if not impossible to get meaningful results from the kind of corpus counting you're doing. I mean, are there "sense-analysis" corpora? It doesn't seem to me that such a thing would even be feasible. A lack of taxonomy may be a "problem with a lot of research on semantics and pragmatics" in terms of getting a semantically-tagged corpus, but I think it's not so much a present failing as an inherently nigh-impossible task.

  16. Mark Liberman said,

    December 13, 2008 @ 5:39 am

    Lance: the nuances of sense make it hard if not impossible to get meaningful results from the kind of corpus counting you're doing. I mean, are there "sense-analysis" corpora? It doesn't seem to me that such a thing would even be feasible.

    Yes, there are plenty. You'll find references here and here, for a start. An especially clear summary of the state of the field ten years ago can be found in Adam Kilgariff, "Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs", 1998.

    It's true, as I said, that it's notoriously difficult to get intersubjective agreement about what the senses are and how to assign them to particular cases. This led Adam Kilgariff to write a famous paper "I Don't Believe in Word Senses", Computers and the Humanities 31(2), 1997. From his abstract:

    An analysis is presented in which word senses are abstractions from clusters of corpus citations, in accordance with current lexicographic practice. The corpus citations, not the word senses, are the basic objects in the ontology. The corpus citations will be clustered into senses according to the purposes of whoever or whatever does the clustering. In the absence of such purposes, word senses do not exist.

    On the other hand, one of Adam's own notable contributions has been software that helps lexicographers to cluster citations for the purpose of dictionary-making.

  17. Mark Liberman said,

    December 13, 2008 @ 5:40 am

    Lance: A lack of taxonomy may be a "problem with a lot of research on semantics and pragmatics" in terms of getting a semantically-tagged corpus, but I think it's not so much a present failing as an inherently nigh-impossible task.

    I'm not sure that I understand. Are you really saying that the current methods for analysis of meaning are such that they can't, as a matter of principle, be applied to specific examples that are not chosen by the investigator? Or that if these methods are applied in an attempt to understand a series of such examples, e.g. those involving the time-reference of the present progressive in English, they don't converge on a set of descriptive types that would cover a monotonically increasing proportion of unseen examples? Or that we shouldn't expect, even in principle, that different investigators would be able to agree about how to do this?

    All these things might be true; but if they are, it's a depressing picture of the state of the field.

    It's one thing to say that word meanings don't fall naturally into a simple tree-structured taxonomy — the history of semantics since Wilkins gives us plenty of reason to believe this, as I have often argued. But that's different from saying that semantics and pragmatics, in principle, can't give us any intersubjectively valid way to describe and categorize the ways that words are used.

    And in fact, things are not so bad. With careful iterative design of the categories and training of annotators, sense-tagging of corpus examples can be done with good levels of inter-annotator agreement.

  18. Mark Liberman said,

    December 13, 2008 @ 8:44 am

    Lance: I mean, obviously I can say "The Red Sox are probably losing right now", and I'm talking about a strong belief; just not one that involves the future.

    Well, the season is over, and spring training doesn't start for another two months, so at this point any belief about the fortunes of the Red Sox is unavoidably about the future (or the past)…

    But more seriously, the examples that I cited seem clearly to state beliefs about possible future events, e.g. "if you buy Cypress you're getting $1.75 billion of SunPower", which (I think) was addressed to an audience of readers who had not previously even considered the option of buying Cypress.

  19. John Niekrasz said,

    December 13, 2008 @ 11:10 am

    Mark said: 'This underestimates the count of present progressive verb forms, since it omits e.g. cases with an interpolated adverb; but it also overestimates, since it includes things like "that's interesting" or "someone's parking space".'

    If you want a more robust way of automatically picking out and (syntactically) classifying the verb phrases for your manual semantic/pragmatic analysis, I can suggest the LT-TTT2 software. The output of the "chunking" stage in their out-of-the-box processing pipeline will be much more accurate than your regular expression approach. It will give you values for tense, voice, aspect, and modality of verb phrase chunks. And the software is a breeze to use. I use it on transcripts of conversational speech to do something quite similar to what you are doing, and it works quite well. (If you actually get around to trying it, use the -D 128 flag on the call to verbg.gr to get verbose output about the verb chunk rules that were used).

    http://www.ltg.ed.ac.uk/software/lt-ttt2

    Here's a paper which uses this software for IR on legal texts:

    http://homepages.inf.ed.ac.uk/bhachey/PUBS/ailaw-egov-preprint.pdf

    And why not add other modals/semi-modals to your list? These often express future time too, don't they?

  20. DM said,

    December 13, 2008 @ 11:21 am

    Interesting as always. Posted at 0854. It's always a source of amazement to me how long Mark Liberman has for breakfast. Is the average linguist post-grad going to have this much free time? Or will they be under constant pressure to prove their cost effectiveness?

    The results of my own breakfast experiments, involving how quickly I can eat a piece of toast and drink a cup of tea, have so far been inconclusive.

  21. Lance said,

    December 13, 2008 @ 9:31 pm

    I'll take a look at the semantic corpus links. As for the present-progressive-future, though: perhaps I wasn't clear when I said "the present progressive can't be used for prediction or mere strong belief". I didn't mean "the present progressive can't be used when emedded in predictive/belief contexts" (the sentence I think the Red Sox are playing the Yankees tonight is a pretty clear indication that the present progressive is usable in a sentence that, overall, expresses belief). I meant "the present progressive can't be used to itself indicate prediction or mere strong belief". So in:

    But more seriously, the examples that I cited seem clearly to state beliefs about possible future events, e.g. "if you buy Cypress you're getting $1.75 billion of SunPower", which (I think) was addressed to an audience of readers who had not previously even considered the option of buying Cypress.

    I'm still not seeing anything future-marked here. Non-past, yes, but basically it's saying: At all world/times where [you buy Cypress], [(simultaneously) you get $1.75 billion of SunPower]: the present progressive just marks that X happens at the same time as some other Y, not that X happens in the future. And the present progressive is certainly not itself expressing a belief: there's no doubt in the speaker's mind that getting SunPower happens in those circumstances. The future possibility comes from the conditional, not from the use of the present progressive.

  22. Steve Tripp said,

    December 14, 2008 @ 12:19 am

    Seven futures
    1. The ship is about to leave.
    2. The ship will leave at 7 o’clock.
    3. The ship is going to leave at 7 o’clock.
    4. The ship leaves at 7 o’clock.
    5. The ship is leaving at 7 o’clock.
    6. The ship will be leaving at 7 o’clock.
    7. The ship is to leave at 7 o’clock.
    These all have different nuances, which are clearer if we change the subject to “I.”
    1. I am about to leave. (immediate future, without delay)
    2. I will leave at 7 o’clock. (an offer or predication; often if some condition is met)
    3. I am going to leave at 7 o’clock. (my intention is to)
    4. I leave at 7 o’clock. (according to a schedule, not under my control)
    5. I am leaving at 7 o’clock. (according to a schedule, under my control)
    6. I will be leaving at 7 o’clock. (after an interval)
    7. I am to leave at 7 o’clock. (a formal announcement)

    [(myl) See the previously cited post "The Lord which was and is" for a list of other options as well. However, at least according to the counts here, the options other than "will" and "be going to" are rather rare in current spoken American English.]

  23. Adrian said,

    December 14, 2008 @ 4:38 pm

    Here's a good example of non-future "will": http://quotation-marks.blogspot.com/2008/12/those-giant-prize-checks-are-no-good.html

  24. Merri said,

    December 19, 2008 @ 8:16 am

    But isn't there simply a difference in meaning ?

    "Will" and "Shall" are auxiliaries for unlinked future.
    "Go to" is an auxiliary for linked future, the exact symmetric to linked past, aka past perfect.

    The difference between "I shall do" and "I'm going to do" is the same as between "I did" and "I have done", only it's less visible because of the auxiliaries.

RSS feed for comments on this post