Is English more efficient than Chinese after all?

« previous post | next post »

[Executive summary: Who knows?]

This follows up on a series of earlier posts about the comparative efficiency — in terms of text size — of different languages ("One world, how many bytes?", 8/5/2005; "Comparing communication efficiency across languages", 4/4/2008; "Mailbag: comparative communication efficiency", 4/5/2008). Hinrich Schütze wrote:

I'm not sure we have interacted since you taught your class at the 1991 linguistics institute in Santa Cruz — I fondly remember that class, which got me started in StatNLP.

I'm writing because I was intrigued by your posts on compression ratios of different languages.

As somebody else remarked, gzip can't really be used to judge the informativeness of a piece of text. I did the following simple experiment.

I read the first 109 or so characters from the xml Wikipedia dump and wrote them to a file (which I called wiki). I wrote the same characters to a second file (wikispace), but inserted a space after each character. Then I compressed the two files. Here is what I got:

1012930723 wiki
2025861446 wikispace
314377664 wiki.gz
385264415 wikispace.gz
385264415/314377664 approx 1.225

The two files contain the same information, but gzip's model does not handle this type of encoding well.

In this example we know what the generating process of the data was. In the case of Chinese and English we don't. So I think that until there is a more persuasive argument we should stick with the null hypothesis: the two texts of a Chinese-English bitext are equally informative, but the processes transforming the information into text are different in that the output of one can be more efficiently compressed by gzip than the other. I don't see how we can conclude anything about deep cultural differences.

Note that a word-based language model also would produce very different numbers for the two files.

Does this make sense or is there a flaw in this argument?

The flaw, clearly, was in *my* argument. I asserted that

modern compression techniques should be able to remove most of the obvious and simple reasons for differences in document size among translations in different languages, like different spacing or spelling conventions. If there are residual differences among languages, this either relates to redundancies that are not being modeled [e.g. marking of number and tense, or omission of pronouns] or it reflects a different sort of difference between languages and cultures [such as differing habits of explicitness].

But Hinrich's simple experiment shows that the first part of this assertion is simply false. At least, gzip compression can't entirely remove even such a simple manipulation as the insertion of a space after every letter of the original. In principle, I believe, coders like gzip, based on accumulating a "dictionary" of previously-seen strings, should be asymptotically oblivious to such manipulations; but in the case at hand, we're clearly a long way from the asymptote. (Or perhaps the principle has been llost due to practical compromises.)

Hinrich's note also prodded me to do something that I promised in one of the earlier posts, namely to try a better compression program on some Chinese/English comparisons. A few simple experiments of this type showed that I was even more wrong than Hinrich thought.

First, I replicated Hinrich's experiment on English. I took the New York Times newswire for October of 2000 (from English Gigaword Third Edition, LDC2007T07). I created two derived versions, one by adding a space after each character of the original, as Hinrich did: and another by removing all spaces, tabs and newlines from the original.

I then compressed the three texts with gzip and with sbc, a compression program based on the Burrows-Wheeler Transform, which seems to be among the better recent text-file compressors. The results:

Original Spaces added Space, tab, nl removed
No compression 61,287,671 122,575,342 51,121,392
gzip -9 21,467,564 26,678,868 19,329,166
gzip bpB

(bits per byte)

2.802 1.741 3.025
sbc -m3 11,881,320 12,702,780 11,632,941
sbc bpB 1.551 0.829 1.820

This replicates Hinrich's result: the spaces-added text is about 24% larger after gzip compression, and about 7% larger after sbc compression. Better compression is reducing the effect, but not eliminating it.

In the other direction, removing white space makes the original file about 17% smaller, and this difference is reduced but not eliminated by compression (10% smaller after gzip, 2.1% smaller after sbc).

Next, I thought I'd try a recently-released Chinese/English parallel text corpus, created by translating Chinese blogs into English (Chinese Blog Parallel Text , LDC2008T06). I processed the corpus to extract just the text sentences.

Chinese English English/Chinese ratio
No compression 814,286 1,034,746 1.271
gzip -9 362,565 366,322 1.010
gzip bpB 3.562 2.832
sbc -m3 263,073 254,543 0.968
sbc bpB 2.585 1.968

In the originals, the English translations are about 27% larger than the (UTF-8) Chinese originals, which is similar to the ratios seen before. However, even with gzip, the difference is essentially eliminated by compression. With sbc, the compressed English is actually slightly smaller than the compressed Chinese.

So I went back and tried one of the corpora whose compressed size was discussed in my earlier post (Chinese English News Magazine Parallel Text, LDC2005T10). Again, I processed the corpus to extract only the (Big-5 encoded) Chinese or English text, eliminating formatting, alignment markers, etc. To my surprise, in this case, the English versions come out smaller under both gzip and sbc compression:

Chinese English English/Chinese ratio
No compression 37,399,738 54,336,642 1.453
gzip -9 22,310,891 19,803,723 0.888
gzip bpB 4.77 2.916
sbc -m3 16,708,712 12,458,354 0.746
gzip bpB 3.57 1.834

This is the same corpus as the one called "Sinorama" in the table in my first post on this subject ("One world, how many bytes?", 8/5/2005), where the English/Chinese ratio before compression was given as 1.95, and after gzip compression as 1.19.

(Why the difference? Well, the numbers in my 2005 post reflected the results of compressing the whole file hierarchy for each language, without any processing to distinguish the text from other things; and the Chinese files were encoded as Big5 characters, meaning that even the Latin-alphabet characters in the sgml formatting codes were 16 bits each.)

My conclusions:

1. Hinrich is right — current compression techniques, from a practical point of view, reduce but don't eliminate the effects of superficial differences in orthographic practices.

2. It's a good idea to be explicit and specific about the sources of experimental numbers, so that others can replicated (or fail to replicate) the process. So what I did to get the Chinese/English numbers is specified below, for those who care.


For the Sinorama corpus (LDC2005T10), in the data/Chinese and data/English directories, I extracted the text via this /bin/sh command:

for f in *.sgm
do
  egrep '^<seg' $f | sed 's/^<seg id=[0-9]*> //; s/<.seg> *$//'
done >alltext

and then compressed (using gzip 1.3.3 with the -9 flag, and sbc 0.970r3 with the -m3 flag).

For the Chinese blog corpus (LDC2008T06), in the data/source and data/translation directories, I extracted the text via

for f in *.tdf
do
  gawk -F '\t' '{print $8}' $f
done >alltext

and then compressed as above.



16 Comments

  1. Pekka Karjalainen said,

    April 28, 2008 @ 10:45 am

    Minor spelling issue: Isn't it the Burrows-Wheeler Transform?

    [myl: Oops. Fixed now.]

  2. Jeff Berry said,

    April 28, 2008 @ 1:34 pm

    You can compare efficiency of a writing system by calculating the redundancy in that system.

    Redundancy = (Max Entropy – Actual Entropy) / Max Entropy

    find max entropy by using log n, where n is the number of graphemes in the system (for English n=27 (including space), max ent = 4.75)

    find actual entropy by finding the unigram frequency p of each grapheme, then sum – p log p for each grapheme. For the text of the Wall Street Journal from the Penn Treebank, this comes out as follows:

    & p & -log p & -p log p \\
    ————————————————
    & 0.170038 & 2.556071 & 0.434629 \\
    a & 0.069675 & 3.843214 & 0.267776 \\
    b & 0.012724 & 6.296331 & 0.080113 \\
    c & 0.029773 & 5.069847 & 0.150945 \\
    d & 0.031845 & 4.972778 & 0.158359 \\
    e & 0.098468 & 3.344202 & 0.329297 \\
    f & 0.017829 & 5.809607 & 0.103581 \\
    g & 0.016908 & 5.886142 & 0.099523 \\
    h & 0.034021 & 4.877444 & 0.165934 \\
    i & 0.062508 & 3.999818 & 0.250020 \\
    j & 0.001773 & 9.139746 & 0.016203 \\
    k & 0.006419 & 7.283481 & 0.046751 \\
    l & 0.034082 & 4.874854 & 0.166144 \\
    m & 0.022562 & 5.469980 & 0.123412 \\
    n & 0.060598 & 4.044584 & 0.245094 \\
    o & 0.060929 & 4.036726 & 0.245954 \\
    p & 0.019223 & 5.701048 & 0.109589 \\
    q & 0.000898 & 10.120998 & 0.009089 \\
    r & 0.056578 & 4.143611 & 0.234438 \\
    s & 0.059687 & 4.066429 & 0.242715 \\
    t & 0.073351 & 3.769033 & 0.276464 \\
    u & 0.022772 & 5.456564 & 0.124260 \\
    v & 0.008375 & 6.899695 & 0.057785 \\
    w & 0.012001 & 6.380735 & 0.076573 \\
    x & 0.002336 & 8.741966 & 0.020418 \\
    y & 0.013965 & 6.162003 & 0.086055 \\
    z & 0.000662 & 10.561087 & 0.006990 \\

    Actual Entropy = 4.12811140687
    Max Entropy = 4.75488750216
    Redundancy = 0.131817229115
    types = 27
    tokens: 474388

    The PH corpus for Mandarin Chinese:

    Actual Entropy = 9.571563
    Max Entropy = 12.187042
    Redundancy = 0.214611
    types = 4663
    tokens = 3252625

    So in this sense, English is more efficient

    [myl: But this is all beside the point. We were never interested in the redundancy of the orthographies -- the (perhaps forlorn) hope was that good compression would wash that out. (And in any case, all of the compression methods under consideration do better than unigram probabilities at removing redundancy: they are all compressing English text way better than 4.128 bits per character!) The question at issue is whether a given message content (in some language-independent sense) might be normally be expressed more or less redundantly in one language rather than another (in some orthography-indendent sense).]

  3. Andrew Rodland said,

    April 28, 2008 @ 3:27 pm

    I would suggest using PPMd or LZMA; both of them are heavyweight "brute force"-ish compressors that compress general input as well as anything you can get for free. :)

    [myl: Both appear to have done worse on English text than sbc in the ACT compression test. The point here is not to do a general compression bake-off, or trade compressor preferences. Do you have any empirical evidence, or any theoretical argument, that we'd learn something new about the matter in question by taking the time to try two additional compression methods?]

  4. Bob Hall said,

    April 28, 2008 @ 4:24 pm

    '…the Chinese files were encoded as Big5 characters, meaning that even the Latin-alphabet characters in the sgml formatting codes were 16 bits each." Actually, Big 5 uses the same 8-bit codes for the lower 128 characters of ACII as UTF-8 or ISO Latin-1. You can see this by setting the display of a page of English in your browser to Big 5 (with my browser set to Big 5 by default, this happens all the time). All the lower ASCII characters (letters, numbers, common punctuation) show up just fine. The main issue is with so-called "smart quotes" which are upper ACII characters.
    Moreover, Big 5 is far more efficient for Chinese characters since they take up two bytes in Big 5, but 3 bytes in UTF-8, which has to deal with a far larger number of possible characters.
    So, I'd expect Big 5-encoded Chinese to be much smaller than UTF-8-encoded Chinese.

    Oops. Actually, I knew that, now that I think about it, which makes my mistake just that much dumber. Thanks for the correction.

    But this means that the difference between the cited ratios is again a mystery — and likely to remain one for a while, since I don't have time to investigate it.]

  5. Dave Costa said,

    April 28, 2008 @ 4:40 pm

    "Better compression is reducing the effect [of adding spaces], but not eliminating it."

    If the compression tools acted as you seem to think they should, it would be disastrous for computing!

    gzip is a lossless compressor, meaning that the original input file can be completely reconstructed from the compressed file. (I would assume that sbc is as well.) If two different files produced the same compressed file, how would we know which version to reconstruct when decompressing?

    Your assumption is that adding spaces adds no information. This may be true to you as a human reader of a text, but gzip operates on a binary data stream and has no way of knowing the intended interpretation of that data. I believe the expectation is that the data is primarily ASCII-encoded text, and the algorithm is optimized for that case; but it must also account for the possibility that the data is something else entirely.

    From a computing perspective, all of those spaces are information, and gzip cannot choose to discard them.

  6. Dave Costa said,

    April 28, 2008 @ 5:07 pm

    If gzip behaved as Mark and Hinrich seem to be expecting, it would be disastrous! The purpose of data compression is to be able to recover the input data. Note I say "data" not "text". When gzip is invoked on your file, it has no knowledge of the intended interpretation of the bits it contains. Its guiding principle is that it must compress the data in such a way that decompression will reproduce the input data EXACTLY.

    You seem to be expecting that compressing the files "wiki" and "wikispace" in the example should produce identical compressed files. From a computing perspective, this would be throwing away information. While under your interpretation of the data, this information is irrelevant, gzip cannot know that, and must preserve it.

    [Neither Hinrich nor I have the obviously stupid expectation attributed to us. What we expect is that an ideal lossless compression algorithm would not waste bits encoding redundant (i.e. predictable) aspects of its input. If every other byte of a file is a space, this fact can be noted (and preserved in the uncompressed form) without doubling the size of the compressed file. Similarly, an ideal compression algorithm would be able to deal with any arbitrary character-encoding scheme, without changing the size of the compressed file (other than perhaps by a fixed amount), since by hypothesis the information encoded does not change.

    Exactly the same point holds for data other than text -- different ways of encoding the color of pixels in an image, for example.

    Hinrich's point was that gzip is very far from ideal in this respect, as his little experiment shows. My experiments show that sbc is considerable better at abstracting away from such local redundancies, but still far from the ideal; as a result, the effects of character-encoding and other trivial orthographical modulations can't be ignored in a discussion of this sort.]

  7. Gwillim Law said,

    April 28, 2008 @ 7:20 pm

    Have you taken into account the variability of the translation process? Surely two different translators could produce English texts that were accurate translations of the Chinese blogs, but differed in length by ten or twenty percent.

    You can translate "Il n'y a pas de quoi" as "You're welcome" or "Don't mention it" and the compression ratios will be a little different. As an experiment, I took three paragraphs from a French Wikipedia article and translated them into English, twice. The first time, I stuck to a word-for-word translation as much as possible. The second time, I rephrased the sentences somewhat, so that they read more like my natural English writing. Both translations have the same degree of formality and present the same facts. Here are the statistics:

    Version | Words | Characters
    French | 307 | 1793
    English-1 | 291 | 1756
    English-2 | 260 | 1587

    The second English translation is about 10.7% shorter in words and 9.6% shorter in characters than the first. It seems to me that this kind of stylistic disparity would overshadow any difference due to the characteristics of the languages.

  8. john riemann soong said,

    April 28, 2008 @ 8:26 pm

    What if someone took the painstaking task of converting the texts to IPA? It doesn't make sense to be analysing artificial orthography when what we want is to the measuring the entropy in the sounds of natural language.

  9. john riemann soong said,

    April 28, 2008 @ 8:38 pm

    *is the measuring

    Furthermore, there are times it seems, when one might be removing informationally-salient whitespace, so any salient information contained in prosody like stress would all be kept.

    Translating the Chinese blogs into IPA wouldn't be impossible, just a bit tedious. A wiki project could even work.

  10. john riemann soong said,

    April 28, 2008 @ 9:32 pm

    Lastly, are we comparing languages or writing systems here? If only the latter, then maybe I misunderstood the aim of the idea. (Potentially far more interesting as an idea perhaps is the "efficiency" of natural spoken language.)

    [myl: Mr Soong, have you considered reading the sequence of posts that you're commenting on? A radical suggestion, I know, but believe it or not, some people do it.]

  11. Anders Ringström said,

    April 29, 2008 @ 9:09 am

    Forgetting compression algorithms, wouldn't Chinese be more efficient when writing a message for, say, a mobile phone, where physical space for what's displayed counts more than the internal representation?

  12. Nick Lamb said,

    April 29, 2008 @ 1:07 pm

    I meant to send email about this, but then Language Log experienced an unscheduled hiatus and I forgot about it.

    This post reminded me, and the new comment system gives me the opportunity to just write "off the cuff".

    Compression algorithms need to operate on a string of symbols. Choosing the minimum symbol size (1 bit) makes things very difficult, so this is rarely attempted. In a compressor with the goal of compressing 16-bit PCM audio (such as FLAC), these symbols are usually 16-bit PCM samples, or (stereo) pairs of such samples. In a general purpose algorithm like GNU zip aka deflate the most obvious choice of symbol size is the octet (8 bits, often called "a byte").

    Now this biases your analysis because you're comparing English text in ASCII (one octet corresponds exactly to the language's own symbols, the glyphs of the Latin alphabet) to Chinese in either Big5 or UTF-8, where some variable number of octets correspond to the language's own symbols. Inevitably a general purpose, "octet-oriented" compressor will do less well in the latter case. To make this fairer you might try converting both to UTF-16 (where most symbols from either system will correspond to a single 16-bit code unit) and then, to remove a further bias, add say 0x2000 to every 16-bit value in the English text, thus making the actual numerical values more similar to those in the Chinese, while admittedly making their meaning a bit opaque to a human.

    In computer systems the deflate algorithm is used because it's cheap. In places where deflate is used on data that isn't just a stream of bytes, you can usually improve things a lot by pre-processing the data. PNG for example, specifies a number of alternative pre-processing steps for any scanline, such as "replace all but the first pixel with the difference from the previous pixel". These pre-processing steps correspond to the authors' knowledge about how pixels are related to one another in meaningful images. Most good implementations use a heuristic which "guesses" an appropriate type of pre-processing for each scanline, the type of pre-processing, plus the processed scanline are then encoded together and sent to deflate for compression. This improvement over just using gzip/ deflate on raw pixel data accounts for a significant decrease in file size compared to earlier lossless file formats.

    If linguists come up with suitable pre-processing steps for specific languages, which used their higher level knowledge about the meaning of the symbols involved, I have no doubt that it would be useful in conjunction with deflate for compressing human language text, and it would probably help in your investigation.

  13. Ran Ari-Gur said,

    April 29, 2008 @ 8:23 pm

    The results may not be meaningful for technical reasons, but the idea and the process are still very thought-provoking. It's too bad it didn't work out: we could have empirically tested the various claims about certain languages being "better for poetry," others for philosophy, etc. :-)

    I wonder: even in an ideal world where the size of a compressed file really accurately represented the bitwise information content of the file, mightn't the difference in writing systems still have a large effect? English writing conveniently breaks speech down to the individual phoneme (albeit very inconsistently), so an AI compressor could theoretically learn and make use of rules like the "word initial phonological /fn/ is almost unheard-of" rule mentioned here a while back. Chinese writing, by contrast, is a lot less informative in this regard; the AI compressor could only learn rules at the word/syllable level and up. (Or is it that Chinese writing already incorporates the lower-level rules, since it only has logograms for phonologically possible words/syllables? Maybe Chinese writing already does part of an AI compressor's work? I can't tell.)

  14. Phil Hand said,

    April 30, 2008 @ 7:28 am

    I don't have anything valuable to contribute to the discussion, but I reckon I must have been involved in the creation of the parallel corpora you used here (or their successors – I worked on this over the last two years). The agency I did the translation for never actually told me the name of the final client, but the coincidence (corpora of blog posts and news, translation procedures) would be too much for this not to be the project I worked on. I did it for a pittance, but I don't really mind, it's always good to see my work getting some use.
    To Anders – Chinese texting seems both faster and more efficient than English texting to me these days, but then, I never did much texting in Britain, whereas I do a lot here. It could just be a practice thing. But I rarely send a "two page" text in Chinese, while my English texts often run to two or three pages.

  15. john riemann soong said,

    April 30, 2008 @ 7:20 pm

    "myl: Mr Soong, have you considered reading the sequence of posts that you’re commenting on? A radical suggestion, I know, but believe it or not, some people do it."

    You kept on talking about languages (as opposed to writing systems). I was rather misled.

    Noting that for example Chinese can be written perfectly well in xiao'erjing (a sort of Arabic script), and that other languages can be written in the Chinese writing system (as kanji), I really thought at first (and hence my comment that was posted before checking out the rest of the links) you were attempting to compare natural languages, not orthographic systems.

  16. john riemann soong said,

    April 30, 2008 @ 7:29 pm

    Furthermore, how would you define a superficial difference between writing systems, and what is a non-superficial difference?

    If an ideal test for information entropy was applied on a Mandarin text that is say, converted to xiao'erjing, shouldn't we expect similar results?

    This post and the posts it cites talks about languages, and superficial orthographic differences, which makes me think you're comparing the efficiency of natural languages, then you talk about the efficiency of writing systems, which makes me think the other way round. Do pardon me for my confusion.

RSS feed for comments on this post