Comparing phrase lengths in French and English

« previous post | next post »

In a comment on "Trends in French sentence length" (5/26/2022), AntC raised the issue of cross-language differences in word counts: "I was under the impression French needed ~20% more words to express the same idea as an English text." And in response, I promised to "check letter-count and word-count relationships in some English/French parallel text corpora, when I have a few minutes".

I found a few minutes yesterday, and ran (a crude version of) this check on the data in Alex Franz, Shankar Kumar & Thorsten Brants, "1993-2007 United Nations Parallel Text", LDC2013T06.

The question of what a "word" is, within and across languages, is a notoriously tangled one in general — for a peek at some of the issues, see "Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008; "Comparing communication efficiency across languages", 4/4/2008; "Laden on word counting", 6/1/2010; and "Lexical limits", 12/5/2015.

But comparing phrase lengths doesn't run into as many issues as attempts to estimate vocabulary size or vocabulary knowledge do: we're counting word tokens rather than word types, and basing our counts on standard orthographic tokenization in the languages in question. Still, some issues come up even in counting orthographic tokens in languages like English and French. For example, should strings be split at apostrophes and hyphens, e.g. can't, dog's, n'est-ce, l'été ?

For this exercise, I decided to accept the tokenization done by the authors (at Google) of the cited U.N. dataset. where the goal was aligning "words" across languages in order to train a (somewhat old-fashioned) automatic translation system. For example:

I also have the honor to refer to my letter to the President of the Security Council dated 29 March 2005
J'ai également l' honneur de me référer à la lettre que j'ai adressée au Président du Conseil de sécurité en date du 29 mars 2005
0 1 0 2 3 4 6 7 11 12 13 -1 14 15 -1 18 16 20 22 23 24
0 1 3 4 5 6 6 7 -1 9 -1 8 9 10 12 13 16 -1 15 -1 17 17 18 19 20

The digit strings represent the dataset's calculated cross-language token-to-token alignment. And as you can see, the algorithm has split l'honneur, but not j'ai. In this particular case, the result is 21 English tokens compared to 25 French tokens, for a French/English ratio of 1.19, which is not far from AntC's 20%.

On the other hand, there are examples like this:

Of this amount , $ 0.44 million related to projects that had been inactive since January 2000 .
Un montant de 440 000 dollars concernait des projets inactifs depuis janvier 2000 .
-1 -1 1 2 5 3 5 6 7 8 -1 -1 10 9 10 11 12 13
0 2 -1 5 6 6 7 8 9 13 14 15 16 17

In that case, the (human UN) translators produced 18 English tokens versus 14 French tokens, for a ratio in the opposite direction of 1.28. (And note that the algorithmic tokenization splits punctuation out as separate tokens, in both languages…)

Some might object to the translator's choices in this case, as in others. And there were occasional more serious problems with the source documents, and occasional errors in the algorithmic division in phrases. But since the dataset has 32,815,891 English-French phase pairs, I chose to deal with this, crudely, by eliminating phrase pairs where the English/French ratio was less than 0.3 or greater than 1.6.

Comparing character counts, this leaves 32,525,755 phrase pairs, and a histogram of the result looks like this:

The distribution's shape evokes the Central Limit Theorem, and its mode is at 0.9, where I've drawn a vertical line.

(The glitch at an English/French ratio of 1 is caused by phase pairs consisting of things like numerical document references and other cases where no real translation is involved.)

The same approach applied to the ratios of space-separated tokens yields a slightly messier but similar histogram, again with a mode at an English/French ratio of about 0.9:

If we eliminate all cases with ratio==1.0, the mean ratio of the remaining 27,827,988 English/French token pairs is 0.881. So at least in the world of United Nations documents and United Nations translators, English seems to use about 11.9% fewer words (really, orthographic tokens) than French does, on average — with a wide range of variation around this value. (Depending on what you think "X percent more/fewer" means, this might be 1/0.881 ~ 1.135 → 13.5% fewer…)

This is relevant to my speculations on "Trends in French sentence length", but not crucial, since the point there can be expressed as a within-language comparison across time.

 



10 Comments

  1. mollymooly said,

    May 28, 2022 @ 7:32 am

    It would be interesting to separate data by whether the original language was English, French, or a third UN language. I would guess a translation is inherently likely to be more prolix than an original text.

    [(myl) That's certainly true — somewhere there's a LLOG post that compares relative text sizes as a function of translation direction, but my attempts to find it are coming up empty. Unfortunately the relevant translation-direction metadata is not trivially available for the U.N. corpus used in this post.]

  2. Cervantes said,

    May 28, 2022 @ 7:39 am

    I have translated from Spanish to English, and also purchased Spanish to English and English to Spanish translation services. We normally expect the word count in Spanish to be about 10% higher, not from systematic research but from real world experience. I think there are a couple of reasons for this. English has a larger vocabulary (a double vocabulary of original English and French) so there may be one word available where Spanish needs a phrase. (It takes three words to say "toes" in Spanish.)
    While Spanish verbs don't necessarily need a subject, this is counteracted by the frequent need for a reflexive pronoun which is not needed in English. Finally, I suspect there is less tendency over time to collapse phrases into compounds. No need for the clitic "to" in Spanish but this is apparently outweighed by other factors.

  3. J.W. Brewer said,

    May 28, 2022 @ 8:02 am

    This makes me curious as to whether there is any difference in average word "length" (not measured by character count, which is an artifact of orthographical conventions, but as measured in phonemes or syllables or morae or whatever) between English and French, and if so whether it offsets this difference to at least some extent. This I guess would be an indirect way of getting at cross-linguistic differences in the average amount of information-content-per-word, since it seems plausible that if you have more information-per-word you will ceteris paribus need fewer words to convey a given amount of substantive content.

    My other curiosity, which could be addressed by similar analyses of other corpora (if they exist …) stems from the notion that UN texts are likely to be heavily concentrated in a specific style/register/genre, so it would be interesting to know if the same ratio holds for English v. French parallel texts in other styles/registers/genres.

  4. Joe said,

    May 28, 2022 @ 9:32 am

    This seems to be a result of specific grammar rules, not necessarily any difference in semantics or style.

    For example, French usually requires an article for every noun, rarely stacks up multiple nouns in a compound without a preposition to explain how they're related, and generally doesn't do ellipsis with prepositions. The title of this post might be translated as "Comparer la longueur des phrases en français et en anglais", literally "Comparing the length of (the) phrases in French and in English", English:French ratio 0.7 with no change in meaning or flourish.

  5. Thomas Lee Hutcheson said,

    May 28, 2022 @ 10:07 am

    I often compose formal/business letters in English, Google-translate them into Spanish before editing the Spanish. I constantly find the Spanish text takes up fewer column inches. Similarly surprising.

  6. Julian said,

    May 28, 2022 @ 6:41 pm

    @Joe
    And (correct me if I'm wrong) French insists on a relative clause in some places where English is happy with a participle. 'la vache qui rit' = 'the laughing cow'.

  7. phanmo said,

    May 29, 2022 @ 4:52 am

    As you mentioned, I disagree with the choices of the translator for the second example; there is no mention of the initial amount ("Of this amount") in the French sentence.

    It's hard to say without knowing the sentence that came before, but I probably would not have used a second sentence, instead I would have added ", dont 440 000 dollars qui concernaient* des projets inactifs depuis janvier 2000" to the previous sentence.

    Of course this would have increased the words count for the previous sentence but would have eliminated this one. My feeling is that, generally speaking, French sentence word counts are higher than English sentences, but this might be partially mitigated by having less sentences overall, although this would probably only be noticeable in more complex writing.

    *Or "concernant"

  8. Pau Amma said,

    May 29, 2022 @ 7:46 am

    Of this amount , $ 0.44 million related to projects that had been inactive since January 2000 .
    Un montant de 440 000 dollars concernait des projets inactifs depuis janvier 2000

    Yeah, the English here implies that the $440K is part of a larger amount discussed previously when the French doesn't.

  9. Philip Anderson said,

    May 29, 2022 @ 3:12 pm

    @Cervantes
    I doubt that English uses fewer words just because English vocabulary is larger. While English has added many French and Latin words to an Old English stock, very many of these are doublets, with the same basic meaning. The (sometimes subtle) differences may be exploited by a writer or speaker, but are unlikely to need to be preserved in a translation.
    Of course there are examples where one English word calls for a phrase in translation, but the opposite also occurs (despite the size of English vocabulary).

  10. ohwilleke said,

    June 6, 2022 @ 2:02 pm

    A similar study (by French researchers) finds an inverse relationship between information density per syllable and speech speed. http://content.time.com/time/health/article/0,8599,2091477,00.html

RSS feed for comments on this post