Language Log

Information content of text in English and Chinese

October 9, 2017 @ 10:17 am · Filed by Mark Liberman under Information technology, Orthography

Terms and concepts related to "letters" and "characters" were used at spectacularly crossed purposes in many of the comments on Victor Mair's recent post "Twitter length restrictions in English, Chinese, Japanese, and Korean". I'm not going to intervene in the tangled substance of that discussion, except to reference some long-ago LLOG posts on the relative information content of different languages/writing systems. The point of those posts was to abstract away from the varied, complex, and (here) irrelevant details of character sets, orthographic conventions, and digital encoding systems, and to look instead at the size ratios of parallel (translated) texts in compressed form. The idea is that compression schemes try precisely to get rid of those irrelevant details, leaving a better estimate of the actual information content.

My conclusions from those exercises are two:

The differences among languages in information-theoretic efficiency appear to be quite small.
The direction of the differences is unclear — it depends on the texts chosen, the direction of translation, and the method of compression used.

See "One world, how many bytes?", 8/5/2005; "Comparing communication efficiency across languages", 4/4/2008; "Mailbag: comparative communication efficiency", 4/5/2008; "Is English more efficient than Chinese after all?", 4/28/2008.

October 9, 2017 @ 10:17 am · Filed by Mark Liberman under Information technology, Orthography

Permalink

7 Comments

~flow said,

October 10, 2017 @ 4:57 am

My suspicion matches your 'forlorn hope that good compression would wash out [differences due to encoding details and addition of repetitive junk data]'. Thanks for the data. One could construct an extreme but not altogether artificial sample from mechanically re-encoding Korean Hangeul texts using precomposed syllables versus single conjoining Jamo (and use different encodings—bit pattern rules—like UTF-8 vs UTF-16 and so on); the text would remain the same to the human reader, but I guess will result in dramatic differences in raw and compressed text size. Now that I said that, I'd volunteer if asked to.
Chris Button said,

October 10, 2017 @ 12:17 pm

Slightly off-topic, but when things are flipped around, there is also the question about efficiency in terms of reception as opposed to production – i.e. the notion that standard written Chinese is read faster than written English which in turn is read faster than something like written Spanish by native speakers. Apparently the difference isn't great (so of limited functional consequence), but where it is interesting is how it seems to tie into the idea of the primacy of the syllable in terms of cognition, rather than the phonetically very real but phonologically rather arbitrary notion of a consonant-vowel contrast…
xylo said,

October 10, 2017 @ 4:24 pm

@~flow
I would love to see such a sample. But because there is simple arithmetics connecting the encoding of hangeul syllables and hangeul jamo (i.e. Code point of Hangul = (initial) × 588 + (medial) × 28 + (final) + 44032, as explained on Wikipedia), I think what we are testing is the compression algorithm and not the encoding. The same goes for comparing UTF-8 vs UTF-16.

If it is just to compare the results to the variance of the other data, such an experiment with Korean text would be interesting nonetheless. I would add the same text in romanised form as a third artificial sample, though. But who is gonna do the job? I think we need a volunteer.
GMan003 said,

October 10, 2017 @ 5:26 pm

I feel like "compressing the text" is sort of like factoring out the difference between writing systems. Latin text compresses far better than CJK logograms, in pretty much exact proportion to the difference in information density. I'm sure I could come up with some way to encode Chinese logograms as sequences of individual strokes – and I expect that it would compress to about the same number of bytes as the others.

While such analysis is really interesting, it's often irrelevant in practice. Twitter, the original example, doesn't care about number of bytes, it cares about normalized Unicode codepoints – and CJK logograms are single codepoints. It doesn't matter if you encode it as a three-byte UTF-8 sequence or a two-byte UTF-16 character or a two-nonet UTF-9 sequence, it's considered a single codepoint. (Twitter probably doesn't accept UTF-9 but that's beside the point)

This has relevance beyond just Twitter. Computers handle text at a character level, so all kinds of things are impacted by what they consider a character.

One practical effect is in video game speedruns – attempts to beat a given game as fast as possible. Players often import Japanese or Chinese copies of games because, when a game puts text onto the screen one character at a time, getting fewer characters means you can get through it faster. I even know of one game where the "fastest" setting is Japanese text with English audio – apparently English can verbally relay ideas faster, although I think a lot of that is cultural. I'm sure if games started being translated into Ithkuil, speedrunners would start using it, as long as it ran faster. This all has no bearing on comprehension, obviously – nobody's learning Chinese in order to play games faster, they just memorize which menu item does what, and mash through text as fast as they can.
Bathrobe said,

October 10, 2017 @ 7:42 pm

It's good to be holistic, but how much difference would it make if the Chinese were converted to pinyin?
~flow said,

October 11, 2017 @ 6:37 am

@Gman003 "I'm sure I could come up with some way to encode Chinese logograms as sequences of individual strokes"—I'd challenge you to this one. The difficulty lies not in the analysis of the characters into component parts—be they single stroke or more complex compounds—but in the *reassembly* of those parts. There have been numerous attempts to do this; one early (but maybe not even the earliest one) is mentioned in Carl Faulmann's "Buch der Schrift, enthaltend die Schriftzeichen und Alphabete aller Zeiten…", Wien 1880, p49ff. https://archive.org/stream/dasbuchderschri01faulgoog#page/n64/mode/2up. Like all other attempts I'm aware of, including ones done in software, this attempt failed to produce a truly workable system with typographically pleasant results.

That said, I think the (by now) widespread existence of systems that accept conjoining Jamo as input and display them 'on the fly' using fonts with pre-assembled Hangeul syllables tells us that something very similar should be feasible for CJK characters. The decomposition is then used only for encoding and transmission; for display, you'd still need ready-made fonts.

As for the CJK disassembling part, I happen to have a database for that. As a first step, I've uploaded a quick-and-dirty piece of NodeJS demo code that disassembles Hangeul syllables into their constituent parts with configurable levels to https://github.com/loveencounterflow/script-compression.

The first Article of the Human Rights Declaration gets treated as follows:

Original text (using precomposed syllables): 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다. 인간은 천부적으로 이성과 양심을 부여받았으며 서로 형제애의 정신으로 행동하여야 한다.

Almost maximally-decomposed equivalent: ㅁㅗㄷㅡㄴ ㅇㅣㄴㄱㅏㄴㅇㅡㄴ ㅌㅐㅇㅓㄴㅏㄹ ㄸㅐㅂㅜㅌㅓ ㅈㅏㅇㅠㄹㅗㅇㅜㅁㅕ ㄱㅡ ㅈㅗㄴㅇㅓㅁㄱㅘ ㄱㅝㄴㄹㅣㅇㅔ ㅇㅣㅆㅇㅓ ㄷㅗㅇㄷㅡㅇㅎㅏㄷㅏ. ㅇㅣㄴㄱㅏㄴㅇㅡㄴ ㅊㅓㄴㅂㅜㅈㅓㄱㅇㅡㄹㅗ ㅇㅣㅅㅓㅇㄱㅘ ㅇㅑㅇㅅㅣㅁㅇㅡㄹ ㅂㅜㅇㅕㅂㅏㄷㅇㅏㅆㅇㅡㅁㅕ ㅅㅓㄹㅗ ㅎㅕㅇㅈㅔㅇㅐㅇㅢ ㅈㅓㅇㅅㅣㄴㅇㅡㄹㅗ ㅎㅐㅇㄷㅗㅇㅎㅏㅇㅕㅇㅑ ㅎㅏㄴㄷㅏ. (This uses the non-joining Unicode Jamo versions; the conjoining versions could have been used with little difference, except that the browser would render syllable blocks instead).

Some statistics: Number of characters went up from 87 to 181; number of distinct code points went down from 51 to 30. When you choose to decompose things like ㅔ and ㅆ as well, it looks more like 87 -> 194 for the absolute number of characters and 51 -> 23 for distinct code points.

Of course, a single short text tells little; given a sufficiently long run, the number of distinct precomposed syllables will skyrocket into the thousands, but the number of distinct Jamo will stay below 50.
Eidolon said,

October 12, 2017 @ 5:46 pm

"It's good to be holistic, but how much difference would it make if the Chinese were converted to pinyin?"

A tremendous difference at the unit code level, as should be obvious, since pinyin unit codes are phonemes while hanzi unit codes are syllables. The comments thread of the original post was a mess because of a failure to distinguish between different measuring standards. Jim Breen and Victor Mair measured memory / bandwidth, expressed in bytes; Horwitz was talking about unit code length, expressed in number of characters; this post refers to compression ratio, which is close to Breen and Mair's bytes standard but controls for the "unnecessary" orthographic features in writing systems that inflate or deflate their byte values in common use. I put that in quotes because while those orthographic features – such as white spaces – maybe useless for a computer, they could be highly significant for human comprehension.

The amount of controversy generated doesn't seem warranted. It should be trivial to see that a writing system whose character set consists of syllables and whose underlying language tends to have fewer syllables per word, has a much higher density of information per *character*. This does not indicate that the writing system has a much higher density of information per *byte*, because a syllabary necessarily has a much larger character set than an alphabet and therefore must require more bytes to represent each individual character. The trade off is clear, and the only fault of the original article was in failing to formally define what it meant by "efficiency" even though from the context, it's easy to see that it meant "characters count."

RSS feed for comments on this post

Information content of text in English and Chinese

7 Comments

~flow said,

Chris Button said,

xylo said,

GMan003 said,

Bathrobe said,

~flow said,

Eidolon said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta