Language Log

Is there a practical limit to how much can fit in Unicode?

October 27, 2017 @ 5:18 pm · Filed by Victor Mair under Information technology, Language and computers, Writing systems

A lengthy, important article by Michael Erard recently appeared in the New York Times Magazine:

"How the Appetite for Emojis Complicates the Effort to Standardize the World’s Alphabets: Do the volunteers behind Unicode, whose mission is to bring all human languages into the digital sphere, have enough bandwidth to deal with emojis too?" (10/18/17)

The article brought back many vivid memories. It reminded me of my old friend, Joe Becker, who was the seminal designer of the phenomenal Xerox Star's multilingual capabilities in the mid-80s and instrumental in the organization and foundation of the Unicode Consortium in the late 80s and early 90s. Indeed, it was Becker who coined the word "Unicode" to designate the project.

Erard's article also made me think of the International Symposium on East Asian Information Processing held at the University of Pennsylvania in 1990, about half of the papers of which were collected in this volume: Mair, Victor H., and Yongquan Liu, ed., Characters and Computers (Amsterdam, Oxford, Washington, Tokyo: IOS Press, 1991), including James T. Caldwell's "Unicode: A Standard International Character Code for Multilingual Information Processing", which was the first presentation of Unicode to the broader public after it was up and running. What struck me most powerfully about Caldwell's chapter was a graph showing the huge proportion of code points in Unicode that were taken up by Chinese characters. All the other writing systems and symbols in the world that were covered by Unicode occupied only a small amount of the total. It made me feel as though Unicode had been devised primarily to accommodate the enormous number of Chinese characters in existence. I will return to this aspect of Unicode momentarily.

Another familiar name from the past mentioned by Erard is that of Jennifer 8. Lee, whom he refers to as "an entrepreneur and a film producer whose advocacy on behalf of a dumpling emoji inspired her to organize Emojicon." I remember her as the author of a remarkable article in the New York Times for February 1, 2001 titled "In China, Computer Use Erodes Traditional Handwriting, Stirring a Cultural Debate". She was one of the first observers to call attention to the deleterious effect of computer usage on the ability of Chinese to recall the characters and to write them by hand.

"Character Amnesia" (7/22/10)
"Character amnesia revisited" (12/13/12)
"Spelling bees and character amnesia" (8/7/13)
"Character amnesia and the emergence of digraphia" (9/25/13)
"Dumpling ingredients and character amnesia" (10/18/14)
"Character amnesia in 1793-1794" (4/24/14)
"Japanese survey on forgetting how to write kanji" (9/24/12)
"Character amnesia redux" (4/22/16)
"Aphantasia — absence of the mind's eye" (3/24/17)

Erard's article is well documented and includes numerous captivating case studies, starting with the discovery, digitization, and incorporation of an alphabet for the Rohingya of Myanmar that has come to be called Hanifi Rohingya.

He also gives a lively account of how emojis got mixed up in Unicode, and the problems they presented:

By 2009, 974 emojis had been assigned numerical identifiers, which were released the following year.

As the demand for new emojis surged, so, too, did the criticisms. White human figures didn’t reflect the diversity of real skin colors. Many emojis for specific professions (like police officer and construction worker) had only male figures, while icons for foods didn’t represent what people around the world actually ate. Millions of users wanted to communicate using the language of emoji, and as consumers, they expected change to be swift. One thing appeared to be slowing things down: the Unicode Consortium.

…

Unicode committees have been overwhelmed with some 500 submissions in the last three years.

Not everyone thinks that Unicode should be in the emoji business at all. I met several people at Emojicon promoting apps that treat emojis like pictures, not text, and I heard an idea floated for a separate standards body for emojis run by people with nontechnical backgrounds.

…

The issue isn’t space — Unicode has about 800,000 unused numerical identifiers — but about whose expertise and worldview shapes the standard and prioritizes its projects.

Somewhat ironically, China is barely mentioned in the article. One interesting note Erard offers is that "In the early 1990s, the Chinese government objected to the encoding of Tibetan." (Wouldn't you know it?) Erard quotes Mark Bramhill, a student from Rice University: "'Working on characters used in a small province of China, even if it’s 20,000 people who are going to use it, that’s a more important use of their time than deliberating over whether the hand of my yoga emoji is in the right position.'" Judging from the way the word "character" is used elsewhere in the article, the "characters" Bramhill is referring to are not Hànzì 汉字 ("Sinograms; Chinese characters"), but previously undigitized elements of some non-Sinitic script.

That's it. Nothing else about writing in Chinese. Yet surely Chinese characters Hànzì / Kanji / Hanja are the elephant in the system, the gorilla in the works. According to the Wikipedia article on Unicode, "The latest version [of Unicode] contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets." Of these, according to the Wikipedia article on CJK Unified Ideographs (Hànzì / Kanji / Hanja are not "ideographs", so that term needs to be changed), the number of CJK characters, including "Extension F", is currently 87,882, or 64% of the total characters, letters, and symbols in Unicode. This means that one script, Chinese, accounts for nearly two thirds of all the characters, letters, and symbols in Unicode. Is that a cost effective use of bandwidth? Does that slow the system down? Especially considering that over half of the CJK characters are almost never used (we know neither the sound nor the meaning of many of them), so they are really just taking up space.

"How many more Chinese characters are needed?" (10/25/16)
"Chinese character inputting" (10/17/15)

Is the problem with the hundreds of emojis that are creeping into Unicode comparable to the tens of thousands of Hànzì / Kanji / Hanja that are already in it? Is there any other solution for how to deal with the vast numbers of Hànzì / Kanji / Hanja that are never consulted year after year? What is the point or purpose of keeping them active in Unicode when they are almost never or never invoked? My questions may sound like heresy to a Hànzìphile, but what do they sound like to a pragmatic computer scientist or information scientist?

October 27, 2017 @ 5:18 pm · Filed by Victor Mair under Information technology, Language and computers, Writing systems

Permalink

34 Comments

DroppedOutOfAPhdInCS said,

October 27, 2017 @ 5:41 pm

My first instinct is that space is cheap and it doesn't really matter that much? Naively if we removed the Chinese characters we save ~1.5 bits per character, which seems not especially relevant.

Reading the article unicode is stored in multiples of 8 bits so one should be more refined in that argument; but compared to how much information is needed to store video and the fact that video content is almost everywhere I'd be shocked if the space needed for these characters matters in practice.
microtherion said,

October 27, 2017 @ 6:24 pm

Encoding wise, it makes practically no difference. Only about 27000 of the CJK characters are in the basic plane (which can be encoded with 3 UTF-8 bytes). If some of these had been omitted, certain Emojis or Hieroglyphs could have been encoded in 3 instead of 4 bytes.

The other characters are outside the basic plane, and their presence is not detrimental to the encoding of other characters at all.

The big potential cost, I think, is having Font coverage for these characters, but in practice, mainstream fonts don't seem to cover them.

Weighed at this is the big advantage of completeness, that even the most obscure manuscripts can be encoded with Unicode, and in particular, that any character that can be encoded in one of the national standards can be encoded with Unicode.

It seems to me that the more problematic aspect of the CJK unification is whether there should have been an unification at all (Somebody compared this to having a "Latin/Greek/Cyrillic Unified Character Set").
MattF said,

October 27, 2017 @ 7:01 pm

Unicode has largely cleaned up what was a proprietary, unstandardized mess. Thanks to Unicode, software on different systems now agrees about how to do text encoding. And that's a good thing.

It's fair to note that there's a possibility that the end result of the Unicode project will be replacing the old mess with a new, standardized mess. But… why, exactly, would that be a bad thing? Maybe text encoding is just an inherently messy notion.
Jim Breen said,

October 27, 2017 @ 7:04 pm

Not a bad article, but I'm not surprised it doesn't mention hanzi/kanji/hanja because it's really about emoji. I'm a bit annoyed about emoji getting into Unicode, but I guess I'd rather have them inside the tent, as LBJ would have said. The article, BTW, doesn't mention ISO at all. Many people forget that it's effectively a joint standard.

FWIW I'm a long-time fan of the Han Unification, and indeed of the who coding system. After all the struggles with a mish-mash of multiple codesets and coding systems (Microsoft's shocking Shi[f]t-JIS comes to mind), it was a breath of fresh air, and most of those horrible "mojibake" problems are now history. Sure the unification had its problems, and we are stuck with oddities like hankaku kana, but it is far, far better than its myriad of predecessor systems.

Victor, the "ideograph" debate has long been lost. I tend to say something like "incorrectly labelled `ideographs' by Unicode…".
Eidolon said,

October 27, 2017 @ 9:15 pm

The problem with emojis isn't the size of the emoji set, but disagreement on what the standard look of each emoji should be, and which emojis should be included, since there is potentially an infinite number of them, given their *actual* ideographic nature. It's comparable to a font issue, but for the fact that emojis aren't neatly divided into fonts and there aren't any "official" authorities, as there would be for countries' writing systems.

"Is there any other solution for how to deal with the vast numbers of Hànzì / Kanji / Hanja that are never consulted year after year? What is the point or purpose of keeping them active in Unicode when they are almost never or never invoked?"

They are kept, I suppose, because the purpose of the Unicode project is to unify the encoding of all writing systems, so the burden of inclusion, no matter the complexity, is built into the problem. To a computer scientist, the problem is presumed to be justified, and the only question is how to solve the problem. In the case of Unicode, there is enough space in the code space for several Chinese character sets, so that's not much of a problem. However, a more efficient solution, should space be a premium, might be to create a reduced set of commonly used characters and then to supplement it with optional codes as needed to take advantage of sparsity. So, instead of transferring texts entirely through Unicode, you would transfer it through a reduced Simple Unicode, and then, should rare characters arise, you would transfer only the graph for that particular character along, increasing general efficiency.
Mark Liberman said,

October 27, 2017 @ 9:30 pm

"Is there a practical limit to how much can fit in Unicode?"

Short answer: "no". At least, we're many orders of magnitude away from any practical constraints imposed by the sheer number of distinct character types.

Storage and networking bandwidth are cheap enough for audio and video — a (compressed) digital movie or TV show is a gigabyte or so, which would store 250 million 32-bit characters — and 32 bits would be enough for 2^32 = 4,294,967,296 different code points. Even without any compression, a gigabyte would store about a thousand Chinese novels encoded in such 32-bit characters (which are more than twice as large as needed for the purpose). With simple forms of compression, the number of texts equivalent to a movie would increase by an order of magnitude or so.

As I understand it, there are basically two reasons to be careful about extending unicode: the strain on font designers, and concerns about creating addition coding ambiguity.

Running out of bits is not a matter for any concern at all.
microtherion said,

October 27, 2017 @ 10:56 pm

Unicode is actually restricted to a bit more than 2^20 code points, due to the UTF-16 encoding. Still, that should be more than enough for terrestrial needs.

[(myl) It's true that in the current unicode definition there are only 1,114,112 valid code points, of which only about 120,000 are given a standard value. But if there were a good reason to want billions of code points, it would be easy to modify to the spec to allow this — and the effects on storage and processing would not be dire, at least in terms of bit-usage alone.]
Victor Mair said,

October 27, 2017 @ 11:17 pm

@Mark Liberman

"As I understand it, there are basically two reasons to be careful about extending unicode: the strain on font designers, and concerns about creating addition coding ambiguity."

I would be very grateful if these two aspects could be taken up for further, more detailed, discussion.

When I initially wrote the title for this post, it was "Is there a limit to how much can fit in Unicode?" Upon reflection, however, it was because of such concerns as those raised by Mark that I added the word "practical". All along (and from previous discussions) I was aware that mathematically the sheer number of available code points was not a fundamental issue.
Peter Taylor said,

October 28, 2017 @ 2:26 am

Especially considering that over half of the CJK characters are almost never used (we know neither the sound nor the meaning of many of them), so they are really just taking up space.

How often is Linear A used? Enabling scholars to digitise classical texts accurately is a sufficient reason to support rare characters.

An article worth reading on these issues in general is I Can Text You A Pile of Poo, But I Can’t Write My Name.

My opinion, for what it's worth, is that the Unicode consortium should tell emoji fans to form their own consortium and publish a standard which uses one of the private use areas of Unicode. That way characters from the two systems could be intermixed and the Unicode people could get back to working on existing writing systems rather than inventing one on the fly.
Sean M said,

October 28, 2017 @ 3:29 am

Victor: One thing which I notice is that when the Unicode folks try to save space by treating the sinograms in Japanese scripts as the same characters as their Chinese equipment, they get accused of being sinister imperialists (see the Aditya Mukerjee article on modelviewculture above). In my world, its taken for granted that RA is the same character whether it is written in Gudea's more pictographic script and pronounced phonetically in Sumerian, or used as part of a logogram in a Hittite text written concisely with wedges a thousand years later, although I am sure that specialists in those scripts can explain the practical issues.

Surely you don't suggest that your library throw out the 50% of its collection which is seldom or never checked out? That would save space and money too, but it is not what a scholarly library is for.
David Marjanović said,

October 28, 2017 @ 3:31 am

The Tangut script, quite possibly the most wrong-headed idea in human history, is now in Unicode, and one font has been developed for it. The job of scientists working on Tangut, or on comparative Sino-Tibetan, just got a whole lot easier: for one thing, they can now publish their research without having to resort to squeezing images into the text.

Scholars of Old Sinitic have been enjoying this advantage for longer, and as you can imagine they routinely have to use characters that nobody else has used in two thousand years.
Michael said,

October 28, 2017 @ 5:16 am

I think the question is not how many characters can fit, but how many independent requests can be processed. If some rarely-used characters are codified once and never touched, there is no problem. Submissions of codepages for entire scripts are rate-limited by the effort to prepare a hundred or so of character descriptions. Individual requests for single emojis can be a problem just because of their number.

I guess it would be better if the list of Emojis was prepared by people who care about emojis specifically. On the other hand, I doubt they should be confined ot a private use area — if a United Emoji Lobby collected the requests and prepared the descriptions in the Unicode Standard format, one request for inclusion of some thousands more emojis every six months wouldn't be that much of a problem. New kinds of combining emoji characters (like skin tone modifiers) would probably still have to be considered by Unicode Consortium individually.
Carl said,

October 28, 2017 @ 9:24 am

To me, the problem with emoji is that Unicode is forever. Mankind will never get around to replacing Unicode. Unicode’s goal is to encode all human writing systems. Hence there is letter from Egyptian hieroglyphics. The letter is completely obscene, but it’s part of an ancient writing system, so it’s in Unicode. I think in a few hundred years the emoji for “taco” and medium-dark skinned women holding hands with boy will seem like similarly bizarre reflections of culturally particular obsessions.
Carl said,

October 28, 2017 @ 9:26 am

The comment system didn’t display the hieroglyphic I had in my comment. It was this one:
http://www.isthisthingon.org/unicode/index.phtml?glyph=130BA
John Roth said,

October 28, 2017 @ 9:31 am

As Professor Mair points out, there aren't major limitations on the number of code points: utf-8 can be extended to 6 bytes easily (early versions showed the coding) and UTF-16 can also be extended by adding a third extension. UTF-32 can already accommodate upwards of 4 billion code points. The problem there is all of the software that would have to be updated to accommodate it. There are also hardware instructions that are used to handle unicode on a number of architectures. (IBM's z-system comes to mind.) These all have the current limitations built in.

On top of that, you have fonts, which have already been mentioned. Those take up space on systems. Then there are potential changes to rendering algorithms (has Egyptian Hieroglyphic been added to freetype yet?) Then you need font pickers for all systems.

Extending Unicode is possible, but it would be a horrendous task.
艾力·黑膠（Eric） said,

October 28, 2017 @ 12:06 pm

@ microtherion:

I don’t see anything inherently wrong with a unified Latin/Greek/Cyrillic character set. For one thing, it would eliminate URL spoofing, wherein e.g. miсrοsοft.сοm looks identical to microsoft.com, using a Cyrillic S for the C and a Greek omicron for the O, etc. I'm sure it was much debated, and there's a good reason that Α (capital Greek alpha) is a "different letter" than A (capital Latin ay), but I wouldn’t be surprised if combining those solves as many problems as it causes.
艾力·黑膠（Eric） said,

October 28, 2017 @ 12:23 pm

David Marjanović:

for one thing, they can now publish their research without having to resort to squeezing images into the text.

And that text is searchable! in a way that images are not.
MikeA said,

October 28, 2017 @ 1:24 pm

I suspect that I will be dust before people stop sending email with MSFT-specific characters, rather than the Unicode equivalents for e.g. the "smart quotes". Often mislabeling the character set as UTF-8.

The dire consequences of mislabeling character sets and letting just _anybody_ propose characters is covered in

https://www.xkcd.com/380/ (mind the mouseover)
James Wimberley said,

October 28, 2017 @ 6:39 pm

Unicode may be helpful for natural – language cryptosystems, exemplified by the Navajo talkers of the US Marines in WWII in the Pacific. You just have to be quite sure that your adversary – ISIS, say – does NIR have your access to to a pool of fluent speakers of Hittite.
Andrew Usher said,

October 29, 2017 @ 12:45 am

It should be remembered that many programs/users only deal with characters in the base plane (16 bits), but I don't imagine they're intending to put emoji there. The only objection is not the concept of encoding emoji which is not inherently worse than all the other historical scripts etc., but the politics of doing so, since there is no one to set a standard list of them and no limits need be placed on their number – anyone could use a new one at any time and be understood (so long as their graphic rendered properly for the reader). Ultimately, considering emoji as graphics makes more sense.

I can't believe no one else thinks anything strange of the plural 'emojis' used here; I don't even like the word or (very much) the concept, but I know that it's supposed to be indeclinable: 'emoji' not 'emojis'. I must add that there was no reason whatever to prefer the hybrid 'emoji' over the genuinely English 'emoticon', except perhaps uncertainty as to the pronunciation of the latter.

k_over_hbarc at yahoo.com
Victor Mair said,

October 29, 2017 @ 7:26 am

Erard's subtitle is "Do the volunteers behind Unicode, whose mission is to bring all human languages into the digital sphere, have enough bandwidth to deal with emojis too?" It seems that no one wants to answer his rhetorical question about bandwidth in the negative.
Andrew Usher said,

October 29, 2017 @ 10:15 am

If it's a 'rhetorical question' it should not be answered, by definition. But maybe that that no one wants to answer it is because it's so odd: even if one interprets the phrasing 'the volunteers … have enough bandwidth' correctly, as I did not at first.
Peter Taylor said,

October 29, 2017 @ 11:34 am

@Victor Mair, it seems that no-one wants to answer it at all, which makes sense: the only people whose answer is actually worth anything are the volunteers in question.
Sean M said,

October 29, 2017 @ 1:49 pm

Professor Mair: I see several people on this comment thread who suggest that the Unicode Consortium should give up defining new Emojis and focus on describing existing scripts, thus answering the rhetorical question with a "not a good use of resources."

In terms of the tens of thousands of obscure CJK characters, I assume that there are already philologists and paleographers who catalogue them, and the main work is integrating their results with the Unicode approach to typologies. Without a lot of information on the internal structure of the Unicode Consortium, I have no idea whether the CJK scripts pull in more money and brainpower than they cost … but as long as people are interested in reading old CJK texts on a computer, there will need to be some system to digitize their traditional scripts, and for good or bad, Unicode's goal is to be that standard for every script.
Victor Mair said,

October 29, 2017 @ 3:19 pm

@Peter Taylor

"the only people whose answer is actually worth anything are the volunteers in question"

And some of them are readers of LLog.
David Marjanović said,

October 29, 2017 @ 5:23 pm

And that text is searchable! in a way that images are not.

Yes! That's very important; it's one of the main advantages of computers over paper!
John Cowan said,

October 29, 2017 @ 9:22 pm

The Mukerjee article linked above is nonsense. See my rebuttal at Languagehat and the comments following it.

I would add for people interested in the structural aspects of standardization (everyone else, skip to the next comment) that the Unicode Consortium's Technical Committee (UTC) and Working Group 2 (formally ISO-IEC/JTC1/SC2/WG2, or just WG2) work closely together, that requests for characters or scripts can be sent to either, and that nothing gets into Unicode (aka ISO/IEC International Standard 10646) until both groups approve it.

The Emoji Subcommittee (ESC) is a subset of the Unicode Technical Committee that filters emoji requests, which mostly come from individuals; the Ideographic Rapporteur Group (IRG) is a subset of WG2 that processes all ideographic requests, which mostly come from governments and standards bodies. The organizations represented on the IRG are national standards bodies interested in ideographs (not just CJKV but others as well). Everything else goes directly to UTC or WG2.

To make things even more complicated, it also just so happens that the UTC meets jointly with INCITS L2 (or just L2), a committee of INCITS, the U.S. information technology standards organization, which is overseen by ANSI, the U.S. national standards body that participates in ISO. So Unicode looks like the U.S. to ISO, even though some of its members are not U.S. entities.
John Cowan said,

October 29, 2017 @ 9:36 pm

By the way, emoticons are constructed from other characters, whereas emoji are characters in themselves (though there are "emoji ligatures" that contain two emoji and a zero-width joiner, or an emoji and an emoji modifier).
Chas Belov said,

October 30, 2017 @ 12:23 am

I question whether the pile of poo Unicode character is unambiguous, as it exists both in expressionless and smiling glyphs.
Jeroen Mostert said,

October 30, 2017 @ 3:20 am

Twelve years ago I had a discussion with someone on Usenet whether or not the present system of Unicode we have would be sufficient for all the world's characters. They vehemently insisted it would not be, and that, like 32-bit systems, 640K of main memory and all the other seemingly reasonable-at-the-time constraints ever imposed by computers, it would prove to be insufficient and require extension before long. I argued the opposite: that Unicode would not escape its bounds for a long time and likely not ever, because, unlike the trend of ever-increasing hardware capacities that had to be addressable by software, Unicode was encoding characters in human languages, something not undergoing rapid expansion by any measure and subject to reasonable estimates of an upper bound.

The reason I remember this is because, at the time, the argument ended in the usual mature fashion of "we'll see who's right in 10 years", and I turned out to be right. I briefly considered posting a smug confirmation of this in the Usenet group it originally started in, but considered against it. :-P (Note the emoticon, which is all we had in those days…)
DWalker07 said,

October 30, 2017 @ 3:19 pm

It's interesting that you mentioned Jennifer 8. Lee. I was browsing the Web this past weekend for information on naming people, and the restrictions that some countries, states, and local jurisdictions have on naming. I had the impression that made-up or unusual names were much more rare in the past, with hospitals and vital-records-registrars likely controlling what names were "allowed". That impression might have been wrong, but I can't tell.

Jennifer's middle name was mentioned as a non-standard name. The state of California, for example, restricts names to the 26 letters of the U.S. alphabet with no diacritical marks (no accented letters).

Some U.S. states have no laws controlling what names are allowed, except there are practical limits imposed by registrars and by common sense — you would not want to be given a name that's 10,000 characters long.
Tom Bishop said,

October 30, 2017 @ 3:43 pm

One problem is that the people in charge are experts at encoding preexisting conventional written characters, but when it comes to inventing a new language of emoji, they really don't know what they're doing. There's no reasonable set of criteria anymore for what is, or is not, eligible for encoding as a UCS character. There are currently about nine billion people on this planet. Most of them would be capable of inventing characters more valuable than many recent additions. It will be a great improvement when the people of Earth are all empowered to add their own characters to the Universal Character Set, without spending years getting approval from the supposed experts or authorities. Maybe this is the problem some people are describing as insufficient "bandwidth". I see it more as a problem of power. The corporations (with a little help from governments) are deciding on the limited set of characters the people can use. The kind of emoji they're giving us are like junk food.

A related problem is the arbitrary limitation of about one million code points (maximum U+10FFFF). A truly universal character set can't impose a permanent upper limit on the number of code points. Part of the solution to that problem is an extension of the UCS encodings to support an unlimited number of code points, as proposed at ucsx.org. However, that transition will be hard, since so much software now has the U+10FFFF limit built in. The transition from maximum U+FFFF to maximum U+10FFFF has been chaotic, and is still incomplete after about twenty years. There is software still in widespread use that not only fails to display code points beyond U+FFFF (like most emoji), but destroys data when such code points are input. These limitations demonstrate lack of imagination and failure to plan ahead for contingencies or learn from experience.

In principle there might be nothing wrong with assigning code points for things like "butterfly" (U+1F98B), but then the logical, consistent, and responsible thing to do is to anticipate similar encoding for millions of other species of life. The same goes for other categories of things: foods, occupations, religions, skin colors, and so forth.

Unfortunately, the people in charge are dead set against planning for encoding beyond U+10FFFF, even while they keep blundering ahead and filling the remaining code points with an arbitrary mix of emoji, no matter how childish, banal, racist, sexist, etc., and above all, limited.
（Eric） said,

October 30, 2017 @ 4:09 pm

Andrew Usher:

Emoji is not a “hybrid word,” it is a Japanese word, 絵文字, meaning “picture character(s).”

Interesting tidbit: the word has been around since at least the Edo period.

They used emoji-like things to teach prayers to those who could not read or write to teach them. They also used pictures in quizzes… It is e (pictures) + moji (letters).

(I, too, initially assumed it was derived from English emoticon or similar, based on the model of e.g. kanji, romaji.)

Also, I, personally, use emoji as a plural, but I’m not certain emojis is incorrect; cf ninjas, tsunamis.
Michael Erard said,

October 31, 2017 @ 1:29 pm

Thanks, Victor, for the attention to my article and for everyone's comments. As to the question of why Han Unification wasn't mentioned, it's because there just wasn't space. As for the rhetorical question of the subhead, writers don't write these. And as for decisions the magazine made about emoji vs. emoji, the secret is there was no pre-existing style rule to follow, so whatever you see was, I believe, ad hoc.

RSS feed for comments on this post

Is there a practical limit to how much can fit in Unicode?

34 Comments

DroppedOutOfAPhdInCS said,

microtherion said,

MattF said,

Jim Breen said,

Eidolon said,

Mark Liberman said,

microtherion said,

Victor Mair said,

Peter Taylor said,

Sean M said,

David Marjanović said,

Michael said,

Carl said,

Carl said,

John Roth said,

艾力·黑膠（Eric） said,

艾力·黑膠（Eric） said,

MikeA said,

James Wimberley said,

Andrew Usher said,

Victor Mair said,

Andrew Usher said,

Peter Taylor said,

Sean M said,

Victor Mair said,

David Marjanović said,

John Cowan said,

John Cowan said,

Chas Belov said,

Jeroen Mostert said,

DWalker07 said,

Tom Bishop said,

（Eric） said,

Michael Erard said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta