Agbègbè ìpàkíyèsí

« previous post | next post »

According to a recently-released glossary, that's the official Yoruba translation of "notification area", which is "the area on the right side of the Windows taskbar [that] contains shortcuts to programs and important status information".

About four years ago, I discussed an article in the NYT that dealt (in a confused and confusing way) with issues of endangered language preservation, mother-tongue literacy, and computer access in Africa ("African language computer farrago", 11/13/2004). The featured project was Tunde Adegbola's work with the African Languages Technology Initiative (ALT-i).

A post on the Yoruba Affairs newsgroup, which I subscribe to, recently announced that (a draft of?) the Yoruba Glossary for Microsoft's Language Interface Pack has just been released, as a partnership between ALT-i and Microsoft Unlimited Potential (whose acronym is, of course, "UP", not "MUP"). At 196 pages and 2000-3000 terms, this is a substantial document.

In response to my 2004 post about the confused NYT article, Bill Poser added some background about localization efforts in general, and registered a complaint about Microsoft "not localizing their software when they didn't see enough profit in it". But in fairness to Microsoft, they've had a large and effective localization effort for many years. They've certainly done much more than other computer companies have done, and in this case, perhaps more than the free software community has done.

Wazobia Linux is a distribution with (some programs?) localized in Yoruba, Hausa and Igbo. But it is apparently not actually free — only a demo version can be downloaded from the company's site, and those interested in the full version are invited to contact the company by email to discuss prices. The "where to buy" link is "currently under construction", and the Wazobia page at DistroWatch.com characterized this distribution as "dormant". I don't know of any other Linux distributions with a significant amount of localization in Yoruba — for example, the Yoruba pages for KDE localization and for Mandriva Tools localization don't show very much progress.

Joe Wilcox at eWeek explains why Microsoft might want to get out ahead of the OSS movement in serving less developed regions, quoting Roger Kay of Endpoint Technologies:

It's worth noting that selling software at $3 a pop to the next billion is pretty good business …

And of course, not all software sold in the developing world will be (or is) so inexpensive. On this analysis, Microsoft Unlimited Potential is a response to the possibility that the open-source software movement and initiatives like OLPC might otherwise succeed in locking up large potential markets.

My own opinion is that this competition is very much to the benefit of the people whose future software and hardware use is in play — though this opinion depends on the judgment that none of the competitors is likely to succeed in driving all the others out of the game.

[And while we're discussing software support for the world's languages, let me note that the latest Firefox on the latest Ubuntu Linux — which I'm using at the moment to enter this post — still does the Wrong Thing in rendering Unicode 0323 "COMBINING DOT BELOW", as I discussed long ago ("Them old diacritical blues again", 3/21/2004).

This is one of the tiniest of the many problems in text rendered caused by the Unicode Consortium's policy of "Convenience for the Wealthy, Virtue for the Poor". Microsoft solved this simple problem years ago, and has since dealt with most of the long tail of more difficult issues in complex rendering. It's nothing short of a scandal that software providers can't get their act together to deal with the relatively simple problem of rendering diacritics on latin-alphabet characters. (Yes, I know that some software packages do the right thing. But it's discouraging that so many software/OS/font combinations don't.) ]

[Update: Stephen Norris has written to me to point out that the wrong treatment of Unicode 02323 "COMBINING DOT BELOW" by some browser/font combinations, in my post "Them old diacritical blues again", 3/21/2004, is in fact mostly the fault of the fonts — some fonts insist on putting the dot in the wrong place in ways that the rendering program is pretty much powerless to fix, probably as a result of years of accommodating rendering programs that do nothing at all. But this just confirms my opinion that we are still seeing the trail of damage from the Unicode consortium's hypocritical  policies about combining diacritics on Latin characters — to provide precomposed versions for sufficiently powerful constituencies, while insisting that others must be virtuous and wait for rendering engines to get around to doing the combinations on the fly, in the Right Way. Since the powerful constituencies don't need such rendering, software and font developers have little incentive to get around to doing things the Right Way.  For a user's perspective on this situation, see the section under the heading "Why Nanos uses precomposite Greek characters in preference over composite ones" in the documentation for the Greek text editor Nanos (towards the bottom of the cited page).]



32 Comments

  1. Licia Corbolante said,

    December 15, 2008 @ 11:36 am

    An additional resource: the The Yoruba Style Guide can be downloaded from the Microsoft Language Portal.
    It's a document containing guidelines for the localization of Microsoft products into Yoruba.
    Terminology and style guides for most other languages in the Microsoft Unlimited Potential Program are also available.

  2. Jakob said,

    December 15, 2008 @ 12:15 pm

    I actually get the dot where it belongs. See http://img148.imageshack.us/my.php?image=yorubagq0.jpg

  3. Tim said,

    December 15, 2008 @ 1:13 pm

    Clicking through to the 2004 Atilla the Hun post here on Firefox 3.0.4 on Windows, I see the underdots rendered in the right places. Which, since they were intentionally typed incorrectly in the first place, means that the problem still hasn't been fixed for Windows Firefox, either.

  4. Mark Liberman said,

    December 15, 2008 @ 2:22 pm

    @Jakob, @Tim: Trying this on a variety of browsers on a variety of machines in a variety of operating systems with a variety of fonts, I get a wide range of correct and incorrect behaviors.

  5. Tom Vinson said,

    December 15, 2008 @ 2:45 pm

    I'm running the same browser as Tim (Firefox 3.0.4 on Windows XP). The sequence "sạr" produces a dot centered under the "a". "saṛ" puts the dot under the right-hand edge of the "a". Substituting "á" for "a" in the first example moves the dot to the left-hand edge of the "a" (the accent is over the "a" as expected).
    Internet Explorer 7 gives the same results for the first example. But adding the acute accent superimposes the "a" and the "r". The second example, "saṛ" superimposes them with or without an accent. (If I use the Firefox IE Tab option, the result looks like Firefox rather than IE.)

  6. Tom Vinson said,

    December 15, 2008 @ 3:02 pm

    Erratum: the behavior I reported for IE7 happens only when you are zoomed in. If I use the default font size setting (-0) my test page looks the same in Firefox, IE7 and Google Chrome. So it looks like the different browsers are at least trying to do the same thing.

  7. mollymooly said,

    December 15, 2008 @ 3:15 pm

    There are a couple of Mozilla bugs that seem to be relevant for the Firefox erraticness: 197649 and 85373. Apparently mozilla uses Pango.

  8. ᛏᚦ » Blog Archive » I think we should have an Igbo translation said,

    December 15, 2008 @ 3:36 pm

    […] Language Log claims it's not free but I assume this means "you have to pay for it" rather than "it's breaking the GPL". So, maybe we can find someone who's bought a copy and get the .po files from them and merge them upstream? LL claims it's dormant, so we should make an effort to rescue the translations before they vanish forever. […]

  9. Mark Liberman said,

    December 15, 2008 @ 3:50 pm

    Since the results apparently depend on font selection and various other parameters as well, and pages on the old Language Log site are embedded in some Movable Type stuff that might affect things, I thought I'd set up a simpler test page for simple combinations of a latin base character with a couple of combining diacritics (here).

    The results that I get from a few browser/OS combinations around the house are now a bit different: Firefox (3.0.4) works (!) on Ubuntu Linux (8.04); Firefox (3.0.4) works on Mac OS X (10.4.11); Firefox (3.0.4) works on Windows Vista; Internet Explorer (7.0.6001.18000) works on Windows Vista; Safari (3.2) fails (!) on Mac OS X (10.4.11).

    Here's a screenshot of the bad rendering by Safari on the Mac:

    Here's the better rendering by Firefox on the Mac — the acute accents are still not placed very well, but it's not too horrible:

    Here's a screenshot of Firefox in Ubuntu:

    In all cases, as far as I know, the font and other relevant browser settings are "out of the box", i.e. not changed by me.

    This is a good deal better than my experience in earlier years — Firefox 3.0.4 works on this test page in all three OS contexts. IE works. Only Safari fails. Why Firefox still fails on the old LL pages is not clear to me.

    The thing that's consistent with my earlier experience is that I can't simply depend on simple character-combination working across browser/OS/phase-of-the-moon combinations; and I can't easily predict when it's going to work and when it isn't.

  10. Doug said,

    December 15, 2008 @ 4:49 pm

    I'm using Safari 3.2.1 under Mac OS X 10.5.5 and I see your test page rendered quite well. The 1st and 3rd examples have the dot slightly off-centre but the acute is spot-on while in the middle example the dot is placed correctly but the acute appears to touch the 'o'.

    [(myl) Perhaps it's the difference in Safari and OS releases; perhaps you're using a different default font; the depressing thing is that everything seems to matter, and nothing seems to work consistently.].

  11. Bruce Cowan said,

    December 15, 2008 @ 6:44 pm

    AFAIK, Firefox doesn't offically support Pango. I think a patch is applied by Debian and Ubuntu to do it. Perhaps things are different with Gecko 1.9 though.

    If you use programs that use Pango natively (GNOME ones such as gedit), dọ́t renders fine.

  12. Matthew Stuckwisch said,

    December 15, 2008 @ 7:16 pm

    I couldn't copy and paste the HTML sample into InDesign though I could reconstruct it without problem by manually inserting the glyphs. In most fonts I tried, the underdot diacritic was missing, but it was properly positioned with the boxed X glyph. In the font I'm developing, it appears as expected.

    What happens is that Mac OS X's text engine (which there are many versions of, TextEdit oddly is the most advanced one on a default install of OS X) prefers precomposed diacritics. Therefore in TextEdit, the last one will properly have the o acute displayed, and then more or less randomly place the underdot (horizontally speaking). It is vertically placed as placed in the font used which may create differences in vertical placement depending on how the font metrics are done. However, any combination that does not exist precomposed in the font (be that as a unicode entity or as a precomposed ligature) seems to barf more often, and since Safari just flat out doesn't do mark positioning, you get the correctly placed o with acute (because it exists precomposed) but the not-well-placed dot (because of no mark positioning). Safari has a very rudimentary text layout engine, this is probably to keep rendering snappy, but with a reduction in quality.

  13. Mark Liberman said,

    December 15, 2008 @ 8:08 pm

    Matthew Stuckwisch: … any combination that does not exist precomposed in the font (be that as a unicode entity or as a precomposed ligature) seems to barf more often, and since Safari just flat out doesn't do mark positioning, you get the correctly placed o with acute (because it exists precomposed) but the not-well-placed dot (because of no mark positioning).

    If this is the correct explanation, and is still true of many mainstream programs in 2008, then I'm even more appalled at the Unicode Consortium's decades-long policy of refusing to admit even a small number of additional precomposed characters for the convenience of countries like Nigeria.

  14. Matthew Stuckwisch said,

    December 15, 2008 @ 9:25 pm

    Mark Liberman: If this is the correct explanation, and is still true of many mainstream programs in 2008, then I'm even more appalled at the Unicode Consortium's decades-long policy of refusing to admit even a small number of additional precomposed characters for the convenience of countries like Nigeria.

    This would still require font developers to provide a glyph for the precomposed characters, which doesn't always happen. If you make a font with combining marks and all other font metrics but without the precomposed glyphs explicitely included in the font, Safari will not even be able to do an acute-a. The fact that TextEdit provides almost-correct behavior and other programs like InDesign provide correct behavior when using fonts that are designed for such use go to show it's not a problem with Unicode, much rather a problem with font rendering systems.
    Most all of the precomposed glyphs are there only for compatibility sake. Korean could have taken up MUCH less space (in the standard, I mean) had they just stuck with jamo and left composition to rendering systems/font developers, but to provide 1-1 conversion to previous standards, they brought in the precomposed version. (note that not all possible jamo combinations are actually found in the precomposed section). The same with presentation forms like the fi ligatures, or arabic positional forms. I don't think we'd have these problems if there were no precomposed forms, but, I don't think Unicode would be the standard it is if it weren't for its backwards compatibility.

  15. James Chittleborough said,

    December 15, 2008 @ 9:25 pm

    "Mainstream"? Hell, Apple sells its stuff as being the top of the market. And as getting little things like presentation and style right.

  16. Philip Newton said,

    December 16, 2008 @ 3:55 am

    Korean could have taken up MUCH less space (in the standard, I mean) had they just stuck with jamo and left composition to rendering systems/font developers, but to provide 1-1 conversion to previous standards, they brought in the precomposed version. (note that not all possible jamo combinations are actually found in the precomposed section).

    The number of combinations used to be even smaller in Unicode 1.0; as I understand it, the precombined characters (now that they've been shuffled over to a different block in the "Great Korean Unicode Fiasco", or whatever it was called) now includes all possible combinations of modern Korean Jamo, even syllables that aren't attested.

    (You're right that it doesn't include all possible combinations of all Jamo, just those which exists as beginnings or middles or endings in modern Korean.)

  17. Mark Liberman said,

    December 16, 2008 @ 5:37 am

    Matthew Stuckwisch: I don't think we'd have these problems if there were no precomposed forms, but, I don't think Unicode would be the standard it is if it weren't for its backwards compatibility.

    Right, those are the standard arguments for insisting on generative typography except for "backwards compatibility" — but the result, all the same, is that there are dozens of languages and hundreds of millions of people whose native orthographies are still being randomly and routinely screwed up by mainstream computer programs, including web browsers.

    The "backwards compatibility" loophole ensures that the languages with economic clout get what they need, even if that requires hundreds or thousands of extra precomposed characters; it means that software writers have little incentive to do rendering right, since the big markets are covered; and a language like Yoruba, which would need a half a dozen or so additional code points, remains out in the cold.

    Cover stories aside, the principle is clear: convenience for the wealthy, and virtue for the poor.

  18. Nick Lamb said,

    December 16, 2008 @ 8:01 am

    Mark, I don't think you're being very fair here. Not least you've arbitrarily blamed a technical committee for the commercial decisions made by a number of huge American corporations. The committee's decision makes sense technically, and your alternative solution is, frankly, bogus. From the outset this wasn't about politics except in the very loose sense that a workable solution had to be acceptable to enough people to get deployed. There are perhaps a dozen alternatives to ISO 10646 which don't have that buy in, and so were stillborn or are seen only in obscure niche products. Your approach would undoubtedly have resulted in such a stillborn standard.

    Adding "a half dozen or so" codepoints for yet more pre-composed characters doesn't solve your problem, you still need to get software to render these characters, which are used only in Yoruba which they apparently don't care about. It's the same work needed, and the same reasons will apply for it not getting done, none of which have anything to do with the Unicode consortium or the technical committee controlling 10646.

    Let me be quite clear: From a software engineer's point of view, rendering three Unicode codepoints as a single composed element is no different from rendering a single Unicode codepoint as a single composed element using three separate font glyphs. So if you want rendering Yoruba to be as easy and thus as likely to be correct as French (it's never going to be as easy as unaccented American English) then you're asking everyone not to just accept a "half dozen or so" (per language/ writing system/ whatever) extra codepoints, but also to pay their typeface design crew to design a "half dozen or so" extra custom pre-composed glyphs. Unless you legislate (and I don't think you're quite foaming at the mouth enough to think that's a good idea) it won't happen, no matter what the Unicode consortium or the 10646 committee decides to do.

    So, with that clear, let me say that I expect the underlying situation (OS built-in support for rendering text that's not American English) to continue to improve, and that it's not about market size in dollar revenue, it's about politics. If Africans make a big deal about this software that does a poor job of supporting African writing systems and volunteer to help it'll get fixed, if they put up with it, no-one will raise a finger to help them.

    Want proof? Microsoft Windows and Office are massive sellers in the UK. As you well know, American English and British English are mutually understandable but far from identical, particularly in spelling. But Microsoft doesn't make a British English translation of their products, and doesn't support 3rd party translations. You can get it in Welsh though, Microsoft hired people specially to work on that. Because the Welsh politicians kick up a fuss, and their English counterparts don't care. Scarcely anybody uses Windows in Welsh, it's not worth a dime to Microsoft to do this, but it makes PR sense.

    You might want to correct your main article to point the finger at Apple, a large proprietary software company, rather than at "OSS", since as many readers have pointed out, this stuff works for them in e.g. Firefox and doesn't work properly in Apple's products like Safari. But the truth is it's complicated. This is about language, you ought to know by now that the truth is always "I think you'll find it's a bit more complicated than that" (there's a T-shirt with that slogan – good for wearing when those around you are waving banners which propose unrealistic solutions to hard problems).

  19. Mark Liberman said,

    December 16, 2008 @ 8:19 am

    Nick Lamb: "I think you'll find it's a bit more complicated than that".

    I know that the problem is a complex one; and I know from long experience that I'm not going to persuade any Unicode partisans; and I know that things are gradually getting better, in that a gradually increasing proportion of software packages render a gradually increasing proportion of combining characters correctly, a gradually increasing proportion of the time.

    But I also know the simple rendering of basic diacritics on Latin characters still fails to work in a significant fraction of mainstream software products; and I know that the Unicode principle of refusing to add precomposed characters has been applied almost exclusively against those without the economic or political clout to override it; and I can see what the obvious result has been, and continues to be.

    There's a word for that kind of "principle": hypocrisy.

  20. Nick Lamb said,

    December 16, 2008 @ 8:24 am

    Oh, and in my spare time I maintain software explicitly for testing this type of bug in Pango (the most comprehensive text rendering engine presently available in Free Software AFAIK). In my experience there is tremendous variation by font. There are plenty of fonts available that you can't use to write French properly, let alone any language with multiple diacritics on a single character. If the rendering engine uses "fallback" strategies (which Microsoft's i18n people used to be strongly against) then you can read French in these fonts, but it looks like a cartoon ransom note because glyphs are borrowed from other fonts you have installed which do support French but are of a different style. For day-to-day use this would be just as unacceptable as having all your diacritics misplaced. But the font's author probably thinks they did a pretty comprehensive job, because hey, they did put an ampersand in it plus a variety of punctuation and all ten digits. Such is life.

  21. Mark Liberman said,

    December 16, 2008 @ 8:35 am

    Nick Lamb: In my experience there is tremendous variation by font. There are plenty of fonts available that you can't use to write French properly, let alone any language with multiple diacritics on a single character.

    True enough. But for French, and Vietnamese, and Hungarian, and so on, everything is guaranteed to work as long as (1) you use Unicode; (2) you specify a font that contains all the needed precomposed characters; and (3) you use only precomposed characters in your texts — because for all of these languages, the Unicode consortium has added all the precomposed characters that are needed.

    This is not ideal, since things may still fail if your readers lack the needed font. But you can tell them where to get it, and how to use it; and life goes on.

    For the standard orthography of a language like Yoruba — established in the middle of the 19th century, and in widespread use by 25 million people or so — there's still no solution, because there's no way to write Yoruba in Unicode that doesn't require rendering of combining diacritics, and this still doesn't work in lots of mainstream software.

    People find ways to adapt, of course, by limiting their software use, or misspelling their texts, or using more ad hoc solutions like fonts with non-standard code-point assignments. But let's not pretend that the system is working well for them.

  22. Thomas said,

    December 16, 2008 @ 8:35 am

    As an aside: has anyone else the issue that this site is displayed in "Arial Stupid… eh , non-Unicode"?

    I read the comments in NetNewsWire on a Mac, and stumbled over boxes instead of the dot below.
    I checked back with Firefox 3.1b2 and WebKit's newest version (i.e. Safari, developer's build). All three had the same issue.
    And I'm pretty sure my fonts and preferences are fine. I could override this (well, not in Safari, tz) but I wonder how that happens.

    Btw, the test suite looks fine in all three browsers for me.
    I've entered a wide variety of Latin Unicode combinations in Safari for our typological project, and most of them look quite fine – well, using designated fonts like Charis. A recurring issue is i + diacritics above. They look more or less fine, but that 'happens to' major publishers, too – just search for ´ fusing with the dot.
    (Back then, people just draw their ´ on ɔ and ŋ by hand… cheaters!)

    Thomas

  23. Aaron Davies said,

    December 16, 2008 @ 8:48 am

    It's definitely a font thing in Safari. Safari 3.2.1, OS X 10.5.5 (yes I know .6 is out, I'll update just as soon as I'm done with this tab row of Language Log posts :) renders the test page perfectly in Verdana, sloppily in Times New Roman (the acute's placed poorly), and completely wrong in Times (only one diacritic goes on the letter, the other one goes as a separate "letter"). Unfortunately, I think Times is the default font. Luckily, Verdana seems to be the preferred font for (new site, at any rate) Language Log posts, so Safari should get things right, here, going forward.

  24. Thomas said,

    December 16, 2008 @ 9:00 am

    Ah, Verdana. Ok, my immediate problem is solved: I deactivated several fonts some months ago, this is the first time I encounter something needing Verdana.
    For some reason Mac OS considers "Arial Normal" the appropriate substitute – bad luck.

    Thomas

  25. Nick Lamb said,

    December 16, 2008 @ 9:54 am

    I think it's generally a bad idea to accuse people of hypocrisy unless you can pin down just what it is they did which is inconsistent with their stated goals and standards. It's perfectly possible for someone to do something which is entirely consistent with their stated goals and standards and you don't like the results, but that's not hypocrisy.

    The "life goes on" outcome where everybody you want to communicate with has to download and install special fonts doesn't sound like an improvement over the present situation to me, it's scarcely even an improvement over the situation with no universal encoding at all.

    I am intrigued though, 25 million people using Yoruba, for the entire period of modern computers. And no-one standardised an encoding (which would now be a "legacy encoding" qualified for the compatibility rule) to use this popular language with computers? How did this happen? Were the vast majority of these 25 million people illiterate?

  26. mollymooly said,

    December 16, 2008 @ 5:16 pm

    the Unicode principle of refusing to add precomposed characters has been applied almost exclusively against those without the economic or political clout to override it

    I know nothing about Unicode politics, but if the principle is "no precomposed characters except for backwards compatibility", then a consequence is that languages which did not have bespoke digital encodings in the pre-Unicode era will not get precomposed characters. It is plausible if regrettable that Yoruba and other languages widely used in print in the developing world would not have had such bespoke digital encodings, and so will be left out.

    So relevant evidence would beEITHER (1) Unicode adding a precomposed character which never existed in any earlier digital encodingOR (2) Unicode refusing to add a precomposed character which did exist in some earlier digital encoding

    If most/all instances of (1) are for rich-world languages, or if most/all instances of (2) are for developing-world languages, then Prof Liberman has a point. Otherwise, though one may oppose the principle, one cannot denounce the motives of those supporting it.

  27. Matthew Stuckwisch said,

    December 16, 2008 @ 11:17 pm

    Mark, one feature that Safari DOES support is ligatures (but not in remote fonts, only local ones), and thereby, even if Safari can't do mark positioning correctly, then a font developer can take up the slack and create a required ligature in the font that precomposes what you need. This does not require unicode to precompose, simply the font developer to decide that that combination is needed.

    This is more of what Nick and my points were about. Even if the software supports mark positioning, you still have to rely upon the font developer providing the hooks for it (and many developers will only put above-marks and made a few under marks for cedille or others and even these on vowels mainly). There are very few fonts which support correct placement over any letter with any accent from the Unicode table.

    There are tons of precombined letters but not all fonts support them all so I don't think that just because if Unicode were to add more precomposed characters that anything would change.

    This is something where you need a concerted effort between font developers and software developers, but whether Unicode changes its stance or not, I don't foresee that the support for double and triple accented letters will be improved. The two solutions are that either A- software developers support the combining marks (which is easier on font developers) including combination guessing for when positioning information isn't available, or B- font developers spend extra time adding in glyphs for individual combinations.

  28. Mark Liberman said,

    December 17, 2008 @ 7:40 am

    I think that there is a great deal of technical posturing on this point that ignores the historical realities of Unicode politics. With apologies for creating an overlong comment (at some point I'll move this to a separate post), I quote from the documentation for the Greek text editor Nanos, under the heading "Why Nanos uses precomposite Greek characters in preference over composite ones":

    In the early days of Unicode (long before it was adopted by the International Standards Organization) there was a variety of Unicode Greek which made use of so-called "composite characters". These are Greek characters consisting of two or more components, such as lowercase alpha with acute accent. Initially, operating systems and software packages used two or more Unicode characters to represent such composite characters (for example, one Unicode character for the lowercase alpha, and one more Unicode character for the acute accent). This approach (which was the one of Unicode version 1.0) was rejected by the Greek government on the grounds that it resulted in uncontrolled accents and other diacritics floating around on the computer screen and on printed paper, because the same Unicode accent character was placed over all vowel characters. For example, the acute accent that went over the lowercase alpha, was also placed over the lowercase iota, upsilon, and so forth. The results were ugly, and often completely unacceptable. Worse, they were unpredictable, because each operating system and software package was free to invent its own rules for placing the two (or more) components in relation to each other. After the intervention of the Greek national authorities, the Unicode organization abandoned the composite approach for Greek and replaced it with the precomposite approach that had long before been universally adopted for Latin-alphabet varieties such as French, Spanish or Czech. In the precomposite approach (since Unicode version 2.0), each Greek Unicode character is an independent unit, including all diacritics. For example, there is a character that contains both the lowercase alpha and the acute accent. This precomposite approach makes it possible for font designers to craft each character-and-diacritic(s) combination individually. Both in terms of shape as well as in terms of relative positioning, overall character width, and all other character and overall font parameters. So the precomposite approach makes it possible to design Greek fonts with the same professional quality as, say, fonts for French or Spanish, for which the precomposite approach has never been in question. It is only natural that Nanos should therefore follow national Greek requirements and international recommendations to only use precomposite Unicode characters for Greek. In fact, we use the precomposite approach rigorously for all scripts and languages, not just for Greek. The reasons are the same as those which led the Greek government to reject the composite approach in Unicode version 1.0: composite characters are aesthetically unacceptable and lead to technically unpredictable data. They are therefore in direct conflict with the aims of the International Standards Organization. Precomposite characters are not only perfectly and completely designed. They are also unambiguous, individual and fully documented units of writing that fulfil all requirements of research environments, digital archives and libraries.

    The plain historical fact is that the Greek government intervened to overrule the Unicode geeks' principled but entirely impractical insistence on composite characters, as did many other governments, NGOs and individuals. These interventions succeeded in some cases, and failed in others. The difference between success and failure was (mostly) not principle, but power. Hence my statement that the policy has turned out to be "Convenience for the wealthy, virtue for the poor"; and my assertion that technical arguments to the contrary are either hypocrisy or ignorance.

  29. Nick Lamb said,

    December 17, 2008 @ 8:07 pm

    The existence of ill-informed (or simply deceitful) rants about Unicode isn't anything new Mark.

    It so happens that Unicode is a published standard, even back in the Unicode 1.0 days when the character set wasn't an ISO standard. So we can take the central claim from this rant, and check it against reality.

    There is a clear assertion that U+03AC simply didn't exist in Unicode 1.0 (despite having appeared in legacy encodings) and that it was introduced in Unicode 2.0 as a result of intervention from the Greek government. Yet the published Unicode documents record U+03AC from the first version of Unicode. It's there because of ISO 8859-7 a legacy standard for encoding Greek, which includes the same pre-composed character. Both versions of Unicode also include the equivalent non-pre-composed codepoints and (in some now obsolete form) the appropriate decomposition rule.

    The author isn't trying to tell you that the Greek government leaned on some geeks, he's actually excusing a misfeature of his software. A briefer summary would be "This software only works with NFC normalised Unicode and you'll have to lump it". You can see on the same page a rant about OpenType, which claims falsely that unimplemented features will crash your computer – no doubt Nanos also doesn't work properly with OpenType features (it figures, OpenType features are very useful for producing good Unicode text rendering, e.g. supporting Japanese and Chinese variants of CJK unified Han characters in the same document)

    Standards bodies aren't above corruption, look at the underhanded methods used to push OOXML. But that's not how U+03AC got there, and if you're searching for hypocrisy I think you'll need to look much harder.

  30. Mark Liberman said,

    December 18, 2008 @ 8:46 am

    @ Nick Lamb: I agree that it's complicated to allocate blame for particular cases among Unicode code-point choices, font developers, software developers, and others; and the history is complex and not known to outsiders — though whatever the detailed history and the diplomatic excuses, it's surely a pattern that the character sets used by developed nations are included with precombined characters, while the precombined characters needed by less-developed nations have generally been refused.

    But the fact remains: MANY SIMPLE CASES OF COMBINING DIACRITICS WITH LATIN LETTERS STILL DON'T WORK, IN MAINSTREAM SOFTWARE/OS/FONT COMBINATIONS.

    And the fact remains: AS A RESULT, SENSIBLE PEOPLE STILL PREFER TO AVOID COMBINING CHARACTERS, FOR GOOD PRACTICAL REASONS.

    And similarly: LANGUAGES THAT NEED COMPOSITE CHARACTERS NOT IN UNICODE ARE STILL SCREWED.

    Yes, this is (very) gradually getting better; yes, someday it will all be fixed; but the decision to refuse the half dozen extra code points needed for Yoruba, or the hundred or so extra code points needed to solve this problem around the world, still stinks.

  31. Extra buttons - …for the adult in you said,

    December 22, 2008 @ 4:05 pm

    […] holes in their words– so the system will use a glyph from another font as a substitute.  This often leads to ugly rendering, though, even for names from a language as common and well-known as […]

  32. Thomas Thurman said,

    December 22, 2008 @ 4:52 pm

    Something I don't understand here, which perhaps someone will explain, is why there's a connection between the representation of, let's say, alpha-acute as either one or two Unicode codepoints and the way it's rendered. If it's represented as one codepoint, either the font supports it or it doesn't and you get a box or some ugly fallback character. If it's represented as two, and the font contains support for "alpha" and "combining acute accent", the combined result may be ugly. But surely the font can represent "alpha, combining acute accent" as one glyph– essentially a ligature, the way "f i" can be represented as one glyph. So using combining characters gives us the options of having the combined character in the font, having the character built up from parts in the font, or using a fallback, whereas having a single character gives us only the options of having the character or using the fallback. I don't see how this squares with the Greek government's objection.

RSS feed for comments on this post