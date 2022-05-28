« previous post |

In a comment on "Trends in French sentence length" (5/26/2022), AntC raised the issue of cross-language differences in word counts: "I was under the impression French needed ~20% more words to express the same idea as an English text." And in response, I promised to "check letter-count and word-count relationships in some English/French parallel text corpora, when I have a few minutes".

I found a few minutes yesterday, and ran (a crude version of) this check on the data in Alex Franz, Shankar Kumar & Thorsten Brants, "1993-2007 United Nations Parallel Text", LDC2013T06.

The question of what a "word" is, within and across languages, is a notoriously tangled one in general — for a peek at some of the issues, see "Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008; "Comparing communication efficiency across languages", 4/4/2008; "Laden on word counting", 6/1/2010; and "Lexical limits", 12/5/2015.

But comparing phrase lengths doesn't run into as many issues as attempts to estimate vocabulary size or vocabulary knowledge do: we're counting word tokens rather than word types, and basing our counts on standard orthographic tokenization in the languages in question. Still, some issues come up even in counting orthographic tokens in languages like English and French. For example, should strings be split at apostrophes and hyphens, e.g. can't, dog's, n'est-ce, l'été ?

For this exercise, I decided to accept the tokenization done by the authors (at Google) of the cited U.N. dataset. where the goal was aligning "words" across languages in order to train a (somewhat old-fashioned) automatic translation system. For example:

I also have the honor to refer to my letter to the President of the Security Council dated 29 March 2005

J'ai également l' honneur de me référer à la lettre que j'ai adressée au Président du Conseil de sécurité en date du 29 mars 2005

0 1 0 2 3 4 6 7 11 12 13 -1 14 15 -1 18 16 20 22 23 24

0 1 3 4 5 6 6 7 -1 9 -1 8 9 10 12 13 16 -1 15 -1 17 17 18 19 20

The digit strings represent the dataset's calculated cross-language token-to-token alignment. And as you can see, the algorithm has split l'honneur, but not j'ai. In this particular case, the result is 21 English tokens compared to 25 French tokens, for a French/English ratio of 1.19, which is not far from AntC's 20%.

On the other hand, there are examples like this:

Of this amount , $ 0.44 million related to projects that had been inactive since January 2000 .

Un montant de 440 000 dollars concernait des projets inactifs depuis janvier 2000 .

-1 -1 1 2 5 3 5 6 7 8 -1 -1 10 9 10 11 12 13

0 2 -1 5 6 6 7 8 9 13 14 15 16 17

In that case, the (human UN) translators produced 18 English tokens versus 14 French tokens, for a ratio in the opposite direction of 1.28. (And note that the algorithmic tokenization splits punctuation out as separate tokens, in both languages…)

Some might object to the translator's choices in this case, as in others. And there were occasional more serious problems with the source documents, and occasional errors in the algorithmic division in phrases. But since the dataset has 32,815,891 English-French phase pairs, I chose to deal with this, crudely, by eliminating phrase pairs where the English/French ratio was less than 0.3 or greater than 1.6.

Comparing character counts, this leaves 32,525,755 phrase pairs, and a histogram of the result looks like this:

The distribution's shape evokes the Central Limit Theorem, and its mode is at 0.9, where I've drawn a vertical line.

(The glitch at an English/French ratio of 1 is caused by phase pairs consisting of things like numerical document references and other cases where no real translation is involved.)

The same approach applied to the ratios of space-separated tokens yields a slightly messier but similar histogram, again with a mode at an English/French ratio of about 0.9:

If we eliminate all cases with ratio==1.0, the mean ratio of the remaining 27,827,988 English/French token pairs is 0.881. So at least in the world of United Nations documents and United Nations translators, English seems to use about 11.9% fewer words (really, orthographic tokens) than French does, on average — with a wide range of variation around this value. (Depending on what you think "X percent more/fewer" means, this might be 1/0.881 ~ 1.135 → 13.5% fewer…)

This is relevant to my speculations on "Trends in French sentence length", but not crucial, since the point there can be expressed as a within-language comparison across time.

Permalink