Twitter length restrictions in English, Chinese, Japanese, and Korean
« previous post | next post »
Josh Horwitz has a provocative article in Quartz (9/27/17): "SAY MORE WITH LESS: In Japanese, Chinese, and Korean, 140 characters for Twitter is plenty, thank you"
The thinking here is muddled and the analysis is misplaced. There's a huge difference between "characters" in English and in Chinese. We also have to keep in mind the difference between "word" and "character", both in English and in Chinese. A more appropriate measure for comparing the two types of script would be their relative "density", the amount of memory / code space required to store and transmit comparable information in the two scripts.
Just looking at the 279 "characters" of English and the 280 characters of Chinese, the latter take up more than twice the space of the former.
Comment by Jim Unger:
All in all, I agree with your criticism.
My understanding is that Twitter (which I do not use myself) set its 140 ASCII 7-bit character limit because, at the time, that was the maximum number of characters that could be displayed on popular cell phone screens. (Bytes are 8 bits each; ASCII uses the 8th bit for data checking. IBM EBCDIC encoding used all 8 available bits.) Chinese characters require two bytes each in UNICODE, and a legible Chinese character with many strokes requires a rectangle of more pixels than any legible ASCII character, so a 140-character limit per tweet on Chinese characters is pretty meaningless. Japanese and Korean standards also use double-byte encoding and require more pixels per displayed character (in Korean, each syllable is treated as a single display character).
Just one more instance of the massive confusion that surrounds the nature of the Chinese script.
Elizabeth Yew said,
October 7, 2017 @ 7:38 am
I am reading the Hongloumeng using the Hawkes translation as a crib. I notice that 3-4 pages of the English translation are needed for each page of the original.
John Roth said,
October 7, 2017 @ 8:29 am
The issue is how much can be fit in a rigid computer format for different languages. The example shows that you can fit a lot more meaning into Twitter's format, which means that "character" almost has to mean "unicode code point."
Graham Neubig said,
October 7, 2017 @ 8:31 am
A few years back we did a bit of analysis of the "information content" of tweets in different languages: http://www.phontron.com/paper/neubig13sam.pdf
Perhaps the most interesting thing we found is that just because Japanese/Chinese/Korean tweets *can* write more, didn't necessarily mean that they *did* write more (on average).
random lurker said,
October 7, 2017 @ 8:32 am
Cell phone *screens* hardly had room for one line of text when GSM (and therefore SMS) was first being standardized. The limitation comes from the fact that SMS message length was specified as 140 bytes, which with the "GSM 7-bit alphabet" comes out to 160 characters, 140 characters with 8-bit and 70 characters with Unicode.
Your cell phone automatically determines which encoding the message is sent with. You could see this in action in real time when composing a message on any Nokia phone and selecting a special character that is not part of the GSM 7-bit alphabet – the number of available characters would drop remarkably as soon as you entered a character that was not common.
Twitter first started out with support for tweeting through SMS messages; why they chose 140 as the limit although 160 character messages were most common is unclear.
Elonkareon said,
October 7, 2017 @ 8:33 am
You seem to be reading an argument into the article that isn't there. There's no attempt to compare the densities or efficiencies of alphabetic vs. more logographic writing systems, only the plain and practical reality that up to this point, Chinese and Japanese users of Twitter have been able to express far more in a single tweet than English or Spanish users.
Also, the 140 character limit has nothing to do with… whatever Jim Unger is describing. It's a consequence of the space allocated for SMS messages by cell network standards–exactly 140 bytes, enough for 160 7-bit characters, 140 8-bit, or 70 16-bit (required for Japanese, Korean, Chinese). I don't think display capabilities ever had anything to do with it, and I can't imagine 140 characters ever fitting on a pre-smart phone screen. Twitter's old character limit allowed room for the @username, hence why it's only 140 instead of the full 160.
Nowadays that SMS limit is basically irrelevant due to smart phones and extended tweets, and East Asian languages were already exceeding it. The only reason the character limit still exists is resistance from users to anything longer. (There were talks about increasing it to a thousand or more.)
leoboiko said,
October 7, 2017 @ 8:37 am
> at the time, that was the maximum number of characters that could be displayed on popular cell phone screens.
This is incorrect; 140 bytes (not characters or screen real state) was the size of one SMS message, and Twitter got its head-start by partnering with SMS providers worldwide, back before mobile data was widespread. The relevant character set wasn't ASCII but GSM. If you limited yourself to the 7-bit set of GSM 03.38, you could fit 160 characters in a tweet (because 7 * 160 = 1120 /8 = 140 bytes); I used to make a game out of that. Add an ISO-style character not in the 7-bit set, and your limit would drop to 140 8-bit characters. And, if you added a kanji or other non-03.38 character, the SMS encoding would change to Unicode UCS-2, and then you'd have only 70 characters (from the classic BMP at least). Most cellphones would show those limits instantly in the UI, as soon as you added or remove the offending characters; the limits didn't come from Twitter but from SMS, and Twitter just followed suit. Unicode combining characters, non-characters etc. would also count towards the message limit, which was in bytes.
> Chinese characters require two bytes each in UNICODE
This is also incorrect. Unicode isn't an encoding, so it doesn't deal with bytes; it doesn't make sense to speak of how many bytes "Unicode" requires per character. Unicode has a number of different compatible encodings, like UTF-8, UTF-16, -32, Punycode etc., each of which requires a different number of bytes per character. The misconception might be from a confusion of Unicode with one of its encodings, UTF-16, which does require 2 bytes for most Chinese characters; but even then, it's not 2 bytes for all of them, for the simple reason that there are more Han characters than 2^16 so it's impossible to handle them in 2 bytes. Rarer characters require 4 bytes, not 2 (meaning that software designed to handle UTF-16 has to take into account the possibility that each character may need up to 4 bytes).
Screen space was never an issue for Twitter, and they've disentangled themselves from the byte/character confusion quite early on. For a long time now, the 140 characters limit was linguistic (really about characters, not about bytes), intended to shape conversation by limiting each utterance to a short form. Observers have long noted that the expressive power of a tweet depended on writing system, due to differences in the morphemes-per-character ratio; the recent change in limits is intended to accommodate this difference to an extent.
Matt said,
October 7, 2017 @ 8:58 am
If you look at the blog post from Twitter, there's an intriguing graph showing that tweets in English are much more likely to be at or near the 140-char limit than those in Japanese.
Twitter uses this graph to reassure its users than even once the limit is changed to 280 for English, there won't suddenly be a huge increase in super-long tweets—after all, the Japanese already have double length tweets (in terms of expressive power within the character limit relative to English), and they don't abuse the privilege. It'll be interesting to see how that works out.
Matt said,
October 7, 2017 @ 9:05 am
And here's some more follow-up via one of that blog post's authors. By this measurement (which is based on automatic translation, so, grain of salt), languages like Danish and French are roughly half as "dense" as English, implying that they are currently languishing under the equivalent of a 90-character limit in English, yearning to tweet free.
Matt said,
October 7, 2017 @ 9:07 am
(Er, 70-character limit.)
John said,
October 7, 2017 @ 9:10 am
But for the purpose of "how much you can express in a tweet", character count *is* the right measure, isn't it? See https://developer.twitter.com/en/docs/basics/counting-characters
I agree that the analysis is confused, but is the conclusion wrong that you can fit more into a 140-character tweet in Chinese, Japanese, or Korean than in English?
Mara K said,
October 7, 2017 @ 10:16 am
As I understand it, the choice of 140 characters over 160 was made to leave some characters for the handle if the person tweeting. Is that not the case?
Rodger C said,
October 7, 2017 @ 11:07 am
@Mara K: That's buried in Elonkareon's comment.
Christian Weisgerber said,
October 7, 2017 @ 3:21 pm
What counts as a character for Korean?
Jim Breen said,
October 7, 2017 @ 4:24 pm
These days the text portion of Twitter messaging is in Unicode in its UTF-8 transfer format. Since kanji, hanzi, hangul and kana all take 3 bytes in that format and diacritic-free Latin alphabetics take 1 byte, the languages which seem to be more efficient at the character level actually use more bandwidth.
Victor Mair said,
October 7, 2017 @ 4:47 pm
@Jim Breen
Hallelujah!
I was waiting for an informed commenter to say precisely what you did.
~flow said,
October 7, 2017 @ 5:37 pm
I think overall the density of muddledness and misplacedness in Mr Unger's comment is appreciately higher than that in Mr Horwitz's piece.
[b]TL;DR I will point out in the below problematic points of Mr Unger's statement, and argue in the second part that Mr Horwitz's article is reasonable, well-informed as well as linguistically and mathematically sound.[/b]
Mr. Unger: "My understanding is that Twitter (which I do not use myself) set its 140 ASCII 7-bit character limit because, at the time, that was the maximum number of characters that could be displayed on popular cell phone screens."
—refuted above
Mr. Unger: "(Bytes are 8 bits each; ASCII uses the 8th bit for data checking. IBM EBCDIC encoding used all 8 available bits.)"
—It's true that the (now) so-called US ASCII encoding standard—created in the 1960s—did not use the 8th (i.e. highest) bit of each byte for encoding written signs, a.k.a. 'characters' and / or (depending on the usage of a particular encoding standard) 'codepoints', but for 'data checking' (more properly, 'error checking'). More specifically, the 8th bit of each byte was to be set to 0 or 1 depending on the count of 0s and 1s in the remaining 7 bit; this is called parity checking and was thought to be a suitable measure to counteract message corruption in flaky networks and on failing magnetic tape. Technology quickly found better means to ensure flawless transmission and storage, and the 8th bit was thus freed from that task (I'd guess during the 70s, maybe 80s). Crucially, what is missing here (and what @leoboiko kindly shared with us) is the information that the absence of US ASCII characters with the 8th bit set means that you can cram the remaining 7 bits into a contiguous space of 140 bytes, ignoring byte boundaries, which gives you a moderate but stable compression rate of 1/8 for your basic-English-only SMS.
However, the approximate EBCDIC equivalent to US ASCII is shown on that encoding's Wikipedia entry as leaving 90 codepoints out of 256 undefined. All 8 bits are used, but not in all combinations, so there must mathematically be a simple way to achieve a constant compression for that encoding that is almost as good as for US ASCII.
I only tell this story to show that there's quite a bit (pun intended) of half-bakedness in this single sentence alone. Add to that the fact that it is largely irrelevant when it comes to SMS and GSM cellphones, because while the relevant standard—cf https://en.wikipedia.org/wiki/GSM_03.38—takes some inspiration from ASCII (like virtually *all* modern encoding schemes, including Unicode), it is in detail all different. Importantly, the standard mandates implementation of 3 compression ('packing') modes: 'CBS' (packs 93 7-bit codepoints into 82 8-bit bytes), 'SMS' (160 codepoints in 140 bytes) and 'USSD', which packs up to 182 characters into 160 bytes. Not sure where the last one was used; GSM SMS has a hard limit on 140 8-bit bytes.
Mr. Unger: "Chinese characters require two bytes each in UNICODE, and a legible Chinese character with many strokes requires a rectangle of more pixels than any legible ASCII character, so a 140-character limit per tweet on Chinese characters is pretty meaningless."
—refuted above. Unicode is a coded character set, not a byte encoding scheme, and it is not written in all-caps. The two-bytes story is not even wrong.
I try to parse the writer's intent and repeat to myself 'a 140-character limit is meaningless in the case of Chinese, because some Chinese characters need more pixels on the display than English letters and digits do'. I don't get it. Apples and pears.
Mr. Unger: "Japanese and Korean standards also use double-byte encoding and require more pixels per displayed character (in Korean, each syllable is treated as a single display character)."
—Yeah, well. Yes, some encoding standards have become known as double-byte encodings; Microsoft in particular likes to call them DBCSs (double-byte character sets, a MS misnomerism), which are a special form of MBCSs (multibyte character sets, another MS misnomerism). Hm. Not quite clear how relevant that is. Then this pixel-per-character thing, again. Twitter messages are NOT measured in pixels spent, but characters used. SMS counts bits used. And yes, no, Korean *can* be encoded using 1 codepoint for the entire orthographic syllable, or, equivalently, with 1 codepoint per Hangeul letter (2 to 4 per syllable). "treated as a single display character" is probably to mean 'treated as a single character by Twitter' which is true (Twitter counts 'visual characters', not 'codepoints' or 'bytes' per se).
Sorry for the long post so far but I went to a computer exhibition today so I'm sort of primed.
No I will defend Mr Horwitz, who IMHO wrote a legible, informative and well-researched, yet non-technical article that dealt quite sensibly with the difficult task to explain the matter at hand to a general western audience with little background in CS or East Asian languages.
Mr Horwitz uses 'letter' in one sentence and 'character' in the next to denote 'script entity for writing English'. What he's talking about is not the 'character' as in 'the characters of the Chinese written language' or 'characters used in printing', but the 'character' in 'coded character set'. Each English letter, the digits, each Russian and Greek and Arabic letter is one 'character' in that technical sense, and sure is that a legitimate and relevant, if not the ultimate, certainly not the only way to compare information / space ratios across languages and writing systems.
Prof Mair: "We also have to keep in mind the difference between "word" and "character", both in English and in Chinese."
—Yes, and this is exactly what Mr Horwitz does. Quote: "Words in English consist of an average of 5.1 letters each […] Typically, a “word” in Chinese consists of between just one and two characters.", both claims with links to relevant pages.
Prof Mair: "Just looking at the 279 "characters" of English and the 280 characters of Chinese, the latter take up more than twice the space of the former."
—Chinese must definitely be typeset at a larger size than English; in mixed scripts texts, a rough measure is typically given by making Chinese characters take up the entire height needed by Latin letters, including descenders and ascenders and upper case letters (whichever takes more space). This is not always done, but then too much mixed scripts printing has teeeny Chinese characters. Even with Chinese set in that large a size, complicated characters will be blurry in, say, 10 or 11pt documents.
But as for Twitter, that is irrelevant, of course. Twitter does not count pixels used, and Mr Horwitz does not want to discuss pixels used. Mr Horwitz did the right thing, he shows examples for both scripts side-by-side, in sizes with comparable legibility (actually, the Chinese is a bit on the smallish side). If there's one problem then it's his not giving us a translation of the Chinese piece, so the intended English-speaking audience may obtain a better understand of how much is really being said in those 280 Chinese characters (but readers can still click through to read that tweet author's comment: "@tianyuf In English it’s a news abstract, in Chinese it’s an entire piece of news".
May I add that there's no need to call out Mr Horwitz for his use of the term 'character' when talking about typeset English. The scarequotes wrongly suggest that his usage isn't the technically correct one; but it is, and it is exactly the sense as used by Twitter (i.e. equivalent to 'code point', and, where multiple codepoints coalesce to form a single visual entity, that entity).
Prof Mair: "A more appropriate measure for comparing the two types of script would be their relative "density", the amount of memory / code space required to store and transmit comparable information in the two scripts."
—I take the second half of the statement to be an explanation for the first, not a juxtaposition, so 'density as understood or measured as amount of memory / code space required to store and transmit'.
I am not sure why this measure is more appropriate (in what way?) to compare written English and Chinese (than counting characters). I think counting bytes used vs characters used is a totally appropriate measure to compare English and Chinese as those languages are tweeted, and characters per word is also an appropriate measure.
Leaving aside for the moment the question of what is meant by 'code space' in the above, 'amount of memory required to store' is clear enough. Now, if we measure 'amount of memory for storage' in bits, we still have to stipulate one or more coded character sets and an algorithm to translate codepoints into bit patterns. In this case, it is a little unfair to use Unicode and UTF-8 (a variable-length encoding), because the US-based Unicode consortium made sure that US ASCII is patterned 1:1 onto the first 128 code points of Unicode, which makes letters a to z, A to Z, digits 0 to 9 and so on the only writing system in the world than can be 100% written with one byte per character using UTF-8. Bingo. Greek, Cyrillic, Arabic, all those scripts need more bytes per character, without any substantial fact about the respective writing system being expressed in terms of bits needed for storage.
So when we turn to UTF-16 (an almost constant-length encoding), then English needs 16 bits or 2 bytes per letter, just as is the case for the vast majority of all Chinese, Japanese and Korean characters in actual daily use (a number that includes all Kana and all precomposed Hangeul syllables).
What *is* vastly different is the *number of code points* needed to encode the respective scripts, and here, of course, Chinese is the champ in the ring, weighing in at roughly 85,000 code points (80 to 90 out of 100 Unicode code points belong to CJK alone). Now compare that to the English alphabet with its 26 letters.
No wait. In fact Unicode has over 1,000 (!) codepoints dedicated to the Latin script. Oops. And, the above figure of 85,000 does include roughly 70,000 rather obscure characters, some of which have no known meanings, no known readings, and no known occurrances in any text, ancient or modern; moreover, almost 1400 codepoints are intentional or accidental duplicates.
So that scales the comparative number of code points needed somewhat; Chinese is still vastly bigger than any other encoded script, and therefore needs inherently more bits per unit-of-writing. I take it that 'code space' could conceivably mean this, so let's do the Math.
We can estimate that when all of languages written in Latin are taken care of and all precomposed characters are eliminated, Latin would need in the ballpark of 200 to 500 codepoints at most, so 9 bits of data "should be enough for everyone", if I may channel Bill Gates. For a thousand codepoints, 10 bits suffice.
Ten or fifteen thousand Hanzi codepoints should be almost sufficient for Chinese (except for the long tail); that's about 14 bits, just shy of 16 bits. This does fit in well with what many computer users have observed over the decades with various character encoding standards.
Seen that way, Chinese is not quite twice as weighty, bit-wise, than the Latin script; the ratio is somewhere between 16b/7b (2.3) and 14b/10b (1.4); call this measure absolute (relative when compared) bits per script (bps), equivalent to bits per script character (bpc) (because you can't just leave out unneeded bits in uncompressed storage, all the 0s and 1s have to be spelled out).
Now, the higher end of that interval happens to be almost exactly the inverse of the estimate cited by Mr Horwitz *that, he suggests, holds for character lengths of equivalent number of words* (characters per word, cpw) between Chinese and English (somewhere around 1.8cpw/5.1cpw = 1/2.8), as such a quite different measure altogether that is highly language-dependent (and one hard to pin down because Chinese largely lacks overtly marked orthographic words). It must be stressed that this is coincidental; average cpw for a given language as such is not directly linked to number of characters avilable in a script. Still, it's good enough for back-of-the-envelope calculations.
Sorry again for the long post.
John Roth said,
October 7, 2017 @ 6:33 pm
Sigh. This is what happens when you try to mix academically correct language analysis with a short article intended to make a simple point in a way that's understandable to a layman. The typical member of the target audience does not care about the difference between a character and a grapheme and a code point. To that person, a character is what shows up on the screen (that is, a grapheme), and Twitter understands that even if some people want to make it a bit more complex.
It's not even all that relevant that the UTF-8 encoding uses 1 byte for the 96 code points that represent ASCII characters, and Chinese uses either three or four (depending on whether it's on the basic plane or one of the extension planes). Certain graphemes, such as an emoji with a skin tone variation, can take a dozen or more bytes.
liuyao said,
October 7, 2017 @ 6:55 pm
From the title of the article I thought it was clear and straightforward. They should have stopped at saying that on Twitter one could express more in Chinese (and Korean and Japanese) than in English or other alphabetic languages. It was not by design, but an artefact of the plain fact that our computers regard a Chinese character as a single indivisible unit. Not sure about Korean Hangul.
~flow said,
October 7, 2017 @ 6:59 pm
OK, here we go again.
Prof Breen: "These days the text portion of Twitter messaging is in Unicode in its UTF-8 transfer format. Since kanji, hanzi, hangul and kana all take 3 bytes in that format and diacritic-free Latin alphabetics take 1 byte, the languages which seem to be more efficient at the character level actually use more bandwidth."
Prof Mair: "I was waiting for an informed commenter to say precisely what you did."
Sorry but this is wrong, from the CS side as well, I'm afraid, from the linguistic side.
As I showed above, you can indeed encode letters a-z and A-Z within the limits of a single byte. But on the one hand, you can encode *those* 52 characters using only 6 bits; more is not needed for these diacritic-free Latin alphabetics. However those are far from being the only diacritic-free Latin alphabetics, there are for example ßÞÆŋƆƋƧƩ (Latin Capital Letter Esh, distinct from Greek Capital Letter Sigma) and so on that the Unicode consortium defines as non-composed Latin alphabetics; each of these characters happens to take up 2 bytes in UTF-8.
Second, that UTF-8 needs only a single byte for a-z, A-Z is *not* a fact about the Latin alphabet; it is a technical artifact, created by the historical accident that the US has, in the 20th c, enjoyed a position in the world that helped to make its 1960s national standard—US ASCII—a nearly universally observed precedence for defining coded character sets for any written language. Had the Russians been the ones that everyone defers and refers to, maybe a Soviet Юникод Committee would have put Latin letters to the upper end of the BMP; in that region, each letter needs three bytes in формат Преобразования Юникода 8-битный (фПЮ-8). Or they could have put it above codepoint 0x10000, where each letter needs four bytes.
Conversely, and I think Prof Breen should be well aware of it, there have been encodings where all of US ASCII was crammed together with a full set of Katakana into an 8 bit code page (example: http://www.sljfaq.org/afaq/encodings.html#encodings-JIS-X-0201). Using that arrangement and UTF-8 as an encoding, each Kana would necessitate only 2 bytes, and just by swapping out the Latin and the Kana halves of the code table, each Kana only needs a single byte. Since there are around 100 codepoints need for Hiragana and Katakana together, you can do that within the bounds of a 7-bit encoding if you want to. Yes, both Japanese syllabaries fit into 7 bits. And because JIS-X-0201 actually wastes up to 94 codepoints on unneeded control characters and undefined positions, you can do all of US ASCII, Katakana, and Hiragana within the limits of a classical MS DOS 8-bit codepage. Dang. No three bytes needed, Japan happy.
The languages which seem to be more efficient at the character level today use more bandwith per character than English because UTF-8.
True, because of the size of the inventory, CJK does inherently and unavoidably need more bits per character than the Latin, the Greek or the Cyrillic script, each script being dealt with separately.
As for the UTF-8 encoded bytes needed for those alphabets, however, that disadvantage could easily be reverted by repositioning these alphabets to the upper end of the encoding space. And, since each CJK code point potentially represents more linguistic content than, say, an English code point, and each English word uses maybe 2 or 3 times more code points than a Chinese word, you can easily imagine a world where 280 letters and spaces of English text need up to 280×4=1120 bytes (фПЮ-8 FTW!), but the Chinese text with about the same linguistic content only needs 280/2.8=100 characters that you can encode in 100×2=200 bytes.
I have just shown that Chinese is 1120/200=5 times (five, times) as bandwith-efficient as English.
You're welcome.
Jim Breen said,
October 7, 2017 @ 7:23 pm
When I wrote "Twitter messaging is in Unicode in its UTF-8 transfer format" I chose my words carefully. In that context I do not believe anything I wrote was "wrong". I was not referring to JIS X 0201 or ASCII or any other coding system.
~flow said,
October 7, 2017 @ 7:50 pm
Prov Breen: "hangul […] take 3 bytes in [UTF-8]"
—Only when you submit precomposed syllables. If you encode each Jamo separately (which you are allowed to, see http://www.gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html), you need 2 or 3 code points per syllable, each of which needs up to 3 bytes in Unicode; that's up to 9 bytes per syllable, so 3 bytes is only the lower boundary. (Above I wrote you need up to 4 Jamo, which is apparently not correct; for syllable endings, precomposed Jamo pairs must be used).
OTOH the way that Unicode encodes the Jamo is highly idiosyncratic (and probably originated from a Korean standard; I'd have to look into that). If you'd be willing to make full use of modern font technology, you'd only need like around 25 Jamao codepoints or so, namely ㄱㄴㄷㄹㅁㅂㅅㅇㅈㅊㅋㅌㅍㅎㅏㅑㅓㅕㅗㅛㅜㅠㅡㅣ plus a few obsolete and special-interest letters as well as punctuation (as opposed to over 300 Unicode Jamo codepoints and over 11,000 for syllables). That much fits into five bits; as for data transmission, a 1950s 5-channel teletype perforated tape suffices. Each Hangeul syllable needs between 2 and 4 code positions in that scheme, or up to 20 bits; you'd probably have to encode syllable boundaries explicitly to get unambiguous syllabifications; in that case, 25 bits are needed.
That is not necessarily a better way of doing things than what Unicode is doing, but IMHO definitely a better reflection of what that writing system minimally necessitates using present-day computing equipment.
~flow said,
October 7, 2017 @ 8:36 pm
@Prof Breen
I have to apologize if what I said came across as brash. I re-read your statement, and yes, if I read it as "What Twitter uses is UTF-8; therefore, a CJK message needs 3 bytes per character where English needs only typically 1 byte per letter; that encoding is therefore less efficient per code point for CJK" then I can see nothing wrong with it.
I'm still struggling with the second portion, though: "the languages which seem to be more efficient at the character level actually use more bandwidth". Because: In what way do CJK languages seem to be more efficient at the character level? It is not the Hanzi / Kanji / Hanja character inventory, which is vast when compared to the English alphabet. It won't be the comparative graphical complexity of the individual Hanzi when compared to English letters, either. So I can only understand that statement to be about the number of characters needed to transport a given content when compared to the number of letters used for a similar English message. But then the statement cannot be technically true, not 'bitwise' true, because it is a statement about linguistic content, something that neither Unicode nor UTF-8 deal with. Hence, I can only understand the statement in a linguistic context; but then, the 'actually' seems to indicate that yes, indeed, we know for sure that although superficially, CJK looks more efficient than English, where in fact the bytes needed to transmit a given message are fewer for English, more for CJK: here 1 byte per letter, there 3 bytes per character.
But we don't know this: What we see seems to indicate that where English uses 5.1 codepoints / 5.1 bytes, CJK needs around 1.8 codepoints / 5.4 bytes (for one typical word). That is, we suspect it to be so, from what we have seen; change the subject to, say, small towns in Hungary, and Chinese will need vastly more characters and bytes than English; change the language to, say, one that relies a lot on 4-character sayings, and English will need long paraphrases to do the same as a 12 bytes long Chinese message. Or swap English for Russian: Cyrillic is very similar to Latin, but it needs 2 bytes per character in UTF-8, so that's 2 bytes per code point in a moderately-sized alphabet vs 3 bytes per code point for a sizable portion of CJK. Is that a big difference? My impression is that Russian words tend to be rather long, so when you count the bytes of equivalent Russian and Chinese UTF-8 texts, I guess the difference will vanish, maybe Chinese actually uses less bytes.
Keith Ivey said,
October 7, 2017 @ 8:45 pm
The graph in the linked article shows Spanish as much more compact than English for the first five verses of Genesis. Isn't the Spanish version of a text normally somewhat longer than the English version? Certainly being less than two-thirds the length seems surprising.
Stephen Reeves said,
October 8, 2017 @ 1:57 am
Has anyone actually tried tweeting full length tweet in Chinese and then tweeting the same in English ? It is fairly obvious one will, run out space before finishing the English translation
Victor Mair said,
October 8, 2017 @ 8:44 am
@~flow
Once again, you've fallen into your old pattern of self-indulgent, incoherent, highly suppositional logorrhea. When you do that, even when you lamely apologize, as you did here to Jim Breen, you ruin the chances for others to have an intelligent, productive discussion of the issues at hand.
In future, please resist the impulse to engage in such rambling monologues and adhere to our Language Log comments policy before hitting the submit button.
~flow said,
October 8, 2017 @ 9:38 am
My apologies were and are sincere. I can see I did violate point #1 of the rules, be brief. Guess I'm a bit compulsory when it comes to encodings and some other things, https://www.xkcd.com/386/. I probably shouldn't have commented at all, because @Elonkareon and @leoboiko had said it all already.
Victor Mair said,
October 8, 2017 @ 10:02 am
~flow
compulsory –> compulsive
That's another problem with your writing — you often misuse words and have faulty grammar. If you write much less, there's a greater chance of better quality.
Elonkareon and leoboiko said a lot, but they did not, as you claim, say it all. The one who said the most is Jim Breen, and he said it concisely, cogently, and civilly.
Nelson said,
October 8, 2017 @ 10:24 am
'The one who said the most is Jim Breen, and he said it concisely, cogently, and civilly.'
Except that Jim's point, while being accurate as far as it goes, seems to assume that bandwidth is the most relevant, or even a relevant, constraint. For most users, the _only_ relevant constraint are the number of characters, which have been limited at 140 regardless of encoding or anything else. This has been pointed out already, but it's worth emphasizing again that as far as responses to the original article (which was really fine on the whole, and certain in its main points), that is the _only_ thing that matters.
Anything else is simply an anticipatory response to potential further conclusions one might try and draw from this, such as the idea that few 'characters' is somehow a sign of superiority in a script. But while there are a couple of real errors in the article, this particular point isn't one of them, so this criticism looks a bit misdirected: it 's responding to an argument no one made.
Elonkareon said,
October 8, 2017 @ 12:29 pm
Tweet metadata far exceeds the bandwidth consumed by the actual characters in the tweet, regardless of language. As others have said, the only reason the character limit still exists is to preserve the "feel" of Twitter as opposed to other, long-form social media outlets.
Jim Breen said,
October 8, 2017 @ 1:21 pm
@Elonkareon. Quite correct; the metadata is voluminous. Getting back to the original article I hope that in the long term Twitter doesn't attempt to make arbitrary language or script-based differentiation of permitted tweet lengths.
amy said,
October 8, 2017 @ 3:27 pm
This reminds me of a related blog post from several years ago on this subject: http://pugs.blogs.com/audrey/2009/10/our-paroqial-fermament-one-tide-on-another.html
B.Ma said,
October 9, 2017 @ 4:34 am
@amy,
Your linked blog post (and the mess of comments above, which for the first time on LL I have just skipped over as there was nothing of interest there) leads me to an interesting observation:
In Chinese, going into "literary mode" causes your text to become shorter, yet you reveal more about your education; in English, shortening your text makes it seem more casual and informal and even rude (I use "txtspk" to communicate with colleagues at the same level on my internal email, but once one of them accidentally CCd our boss who is a stickler for correct grammar and punctuation….)
Silas S. Brown said,
October 9, 2017 @ 9:45 am
I was surprised that Twitter (and Weibo) didn't limit posts to 50 characters when dealing with CJK scripts.
The origin of the English 140-character limit was the SMS (Short Message Service) used by GSM-standard mobile phones. An alphanumeric SMS (i.e. one with no special punctuation apart from basic commas and things, and with Latin-only text with no accents) has a limit of 160 characters before you have to start worrying about it being split into multiple fragments and reassembled (not always successfully), and Twitter deducted 20 characters to leave room for a username, giving 140 characters.
But these 160 (and hence 140) characters were limited to those specified by the GSM 03.38 code (basically ASCII with some currency symbols added, although some of the less-common ASCII punctuation characters like braces will count as two units when coded as GSM).
If you start writing Chinese, Japanese or Korean over SMS, your SMS will be sent using the SMS Unicode protocol, which is very much like UTF-16, and the "one message limit" of that is 70 characters (assuming you don't include any of the rare characters outside the Basic Multilingual Plane, which will each count as two because they have to be coded using surrogate-pairs). If Twitter still wants to reserve 20 for user IDs, 70 minus 20 is 50, so that was my prediction for the maximum length of a CJK "tweet".
So I was really surprised when I first learned Twitter sets the limit to 140 characters regardless of script. Perhaps the team that made that decision wasn't aware of the original technical reason for the English limit, and/or no longer cared about the original requirement to be able to fit a "tweet" into a single-fragment SMS message but regarded the number 140 as some kind of "branding" that now had to be maintained for its own sake.
But then, I'm afraid I never really did fully understand the point of Twitter.
liuyao said,
October 9, 2017 @ 10:40 am
Weibo has long lifted the limit of 140 characters (it's more like Facebook posts in that regard), and I don't think it has lost its Twitter-like appeal as most posts are still of paragraph length.
It's amusing to imagine that Kim Jung Un and Trump actually had a Twitter war. It'd be interesting to assemble the recent exchanges to illustrate the difference of the scripts.
Victor Mair said,
October 9, 2017 @ 12:32 pm
See now "Information content of text in English and Chinese" (10/9/17)
CPC said,
October 10, 2017 @ 3:19 am
> Chinese characters require two bytes each in UNICODE
Just one more instance of the massive confusion that surrounds the nature of the UNICODE standard.
ProcessorHalt said,
October 11, 2017 @ 9:24 pm
Possibly the best article on Unicode and character encoding:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Jim Breen said,
October 12, 2017 @ 5:37 pm
>> Possibly the best article on Unicode and character encoding…"
Not a bad article, although it shows its age. In matters of handling code-sets and
languages the software industry has come a long way since 2003. As for "best",
well it depends on your criteria. I have major problems with statements like:
"Asian alphabets have thousands of letters".