وزارة-الأتصالات.مصر leads the non-Latin charge

« previous post | next post »

The first Internet domain names using non-Latin characters are being rolled out, a plan put into motion after approval from the Internet Corporation for Assigned Names and Numbers (ICANN). Arabic-speaking nations are the first to reap the orthographic benefits, with new country codes available for Egypt (مصر), Saudi Arabia (السعودية), and the United Arab Emirates (امارات). The Egyptian Ministry of Communications and Information Technology, previously online at <http://www.mcit.gov.eg/>, is blazing the trail with its new URL:

<وزارة-الأتصالات.مصر>

Not everything is fully worked out with the new system, though. Browsers that aren't caught up to speed on the non-Latin domain names will see the addresses rendered as Latinized gobbledygook. The Egyptian Communication Ministry's Arabic-script URL, for instance, currently resolves to <http://xn—-rmckbbajlc6dj7bxne2c.xn--wgbh1c/>. That's not very communicative.

[Update: See the very helpful comments below for an explanation of the Latinized encoding.]



20 Comments

  1. Dan T. said,

    May 6, 2010 @ 1:24 pm

    Interesting… in Firefox, some internationalized URLs, such as Wikipedia pages, actually do show up in the address bar as foreign characters, though these are in the path portion rather than the domain name.

    Both domains you reference actually still work, with the Latin-character version going to the English version of the site and the Arabic-character version going to the Arabic one.

    The internationalized top level domain, when rendered in ASCII characters as encoded for DNS use, happens to contain "WGBH", which is a TV and radio station in Boston.

  2. Jonathan Badger said,

    May 6, 2010 @ 1:28 pm

    What I thought was odd is that in the bbc screenshot of an Arabic domain, the http:// is still on the left and written left-to-right — wouldn't it be awkward that way? That would be at the *end* of the domain name and in reverse layout from the Arabic perspective.

  3. Josh said,

    May 6, 2010 @ 2:09 pm

    The xn-- is called "ASCII Compatible Encoding (ACE) prefix". It's an escape sequence for Punycode–a method of encoding Unicode text as pure ascii suitable for DNS lookup.

    Some relevant links:
    http://en.wikipedia.org/wiki/Internationalized_domain_name
    http://en.wikipedia.org/wiki/Punycode

  4. mp said,

    May 6, 2010 @ 2:28 pm

    Firefox at least is up to speed on the non-Latin domain names, but for the fear of domain name spoofing each top-level domain has to be whitelisted in configuration to display them: http://kb.mozillazine.org/Network.IDN.whitelist.*

  5. naddy said,

    May 6, 2010 @ 2:36 pm

    Showing the "Latinized gobbledygook" is actually a security measure. There is substantial concern over people mimicking popular domains by replacing, say, Latin characters with similar looking Cyrillic ones. While domain registrars are supposed to deny things like "micrоsoft.com", it's not clear we can (or want) to rely on that.

    Oh, and the above expands to "xn--micrsoft-qbh.com"—would you have noticed?

  6. Bob Ladd said,

    May 6, 2010 @ 2:37 pm

    @Jonathan Badger: There are analogous reversals to the out-of-sequence placement of http: in both the Arabic and Roman alphabets. In English we write e.g. $422 but say four hundred and twenty-two dollars. And in Arabic writing, numbers written in digits (like the equivalent of 422) are written L-R even when they occur in the middle of an otherwise R-L line of text.

  7. John Cowan said,

    May 6, 2010 @ 2:37 pm

    Dan T.: The host name and pathname parts of an URL are internationalized using entirely different conventions. The pathname is represented as UTF-8, and then all non-ASCII bytes are converted to triplets of the form %xx, where xx are hex digits. This won't work for host names, where only letters, digits, and hyphen are allowed, so the special-purpose Punycode encoding is used, preceded by the unlikely prefix "xn--". So it's not surprising that existing browsers support the first but not the second.

    Jonathan Badger: The URL is not pure Arabic, but mixed Arabic/Latin, and it's always possible in principle to read such mixed text in two different ways. For example, "الَارَب = the Arabs" could be an Arabic equation with embedded English or an English one with embedded Arabic, depending on the surrounding context. In default of further information, the Unicode bidirectional rendering algorithm uses the first strongly directional letter to set the overall direction of the text. In this case the first letter is "h" and overall left-to-right reading is assumed, so "http://" appears at the left end. In any case, the Latin will always remain locally left-to-right and the Arabic will remain locally right-to-left, which means your eyes have to jump around when reading such text.

    I foresee interesting times when Mongolia begins to register internationalized domains.

  8. mp said,

    May 6, 2010 @ 2:43 pm

    Firefox has an option in the Edit menu to switch the default text direction in input boxes. For the URL bar with the mixed Latin/Arabic URL it gives me something that looks like ar/default.aspx/موقع.وزارة-الأتصالات.مصر//:http

  9. James C said,

    May 6, 2010 @ 10:56 pm

    So, it looks like the domain for Egypt is the full Arabic word for the country, for Saudi Arabia it's "Saudi", and for the UAE it's "emirate". Are abbreviations and acronyms not the done thing in Arabic?

  10. Will Steed said,

    May 7, 2010 @ 2:29 am

    Are abbreviations and acronyms not the done thing in Arabic?

    Not really, no. Acronyms aren't uncommon, but I've never seen much in the way of abbreviation in Arabic.

  11. Philip Newton said,

    May 7, 2010 @ 3:48 am

    James: The domain for UAE is "emirates", plural.

  12. Nick Lamb said,

    May 7, 2010 @ 5:35 am

    “While domain registrars are supposed to deny things like "micrоsoft.com", it's not clear we can (or want) to rely on that.”

    Specifically, what happened is that the standards body said that registries which are properly managed should have a specific policy on which characters are allowed, and have mechanisms in place to ensure that users aren't confused. Many important TLD registries soberly created such policies and put such mechanisms in place, some simply declined to register IDNs. The .com registry instead decided that a free-for-all would maximise their profits and with zero cost (well, zero cost to them, obviously for everyone else it means a marked increase in fraud, but why should they care?)

    Rather than say outright "the .com registry is rogue" several browser vendors hit upon the idea of checking whether registries obey the standard and then whitelisting those that do. Domains from these registries are shown "correctly" and the rest aren't, thus preventing fraud and "punishing" non-compliant registries. If Egypt's registry is well-managed you'll see it added to the whitelist in a future update to your web browser.

    The funny thing about the .com registry is that it's basically an ill-reputed "place"- famously ill-managed, overrun with criminal activity and yet somehow it became established as de facto the right place to register big institutions, banks, retailers and so on. It's as if every bank suddenly closed their other branches and opened one in the dimly lit side street where pick pockets lurk.

    In theory the US government claims ownership of the .com TLD and it is managed on their behalf under contract so perhaps this is just good old American free enterprise.

  13. Bob Violence said,

    May 7, 2010 @ 7:48 am

    I foresee interesting times when Mongolia begins to register internationalized domains.

    Mongolia uses a left-to-right Cyrillic script — it shouldn't pose any more difficulties than Russian. The older vertical script is only official in Inner Mongolia and I'm guessing it'll be a while before ICANN approves domain names in that form.

  14. Frans said,

    May 7, 2010 @ 8:17 am

    The Arabic URIs work fine in Opera.

    @Bob Ladd:

    And in Arabic writing, numbers written in digits (like the equivalent of 422) are written L-R even when they occur in the middle of an otherwise R-L line of text.

    I have absolutely no knowledge of Arabic, but in Dutch, German, as well as older English novels (see writing by George Eliot for instance), 422 would be said as four hundred two-and-twenty. As such it would make perfect sense in Arabic if read, from right to left, as the rough equivalent of two and twenty and four hundred.

    Speaking as a native speaker of Dutch, while growing up I used to think that the way we write our numbers goes somewhat against our natural way of speaking precisely because they were written in an Arabic right to left construct.

    If you happen to know how this is actually said in Arabic I would love to hear it, of course. These are just my uninformed observations.

  15. Rodger C said,

    May 7, 2010 @ 9:34 am

    @Frans: I believe you're right that Arabic says "two and twenty and four hundred."

  16. Dan T. said,

    May 7, 2010 @ 10:06 am

    @Nick Lamb: Not only that, but even entities that don't particularly belong in .com in the first place, like nonprofit organizations, often idiotically put their websites in .com addresses (instead of more logical ones like .org).

  17. Terry Collmann said,

    May 7, 2010 @ 3:04 pm

    Bob Ladd/Frans/Roger C: indeed, Arabic numbering only looks the "right way round" to Westerners because Arabic speakers indeed say numbers in the order units/tens/hundreds, but write them from right to left, which looks the same as saying them in the order hundreds/tens/units but writing them from left to right … telephone numbers, however, ARE written "first number on the right".

  18. Terry Collmann said,

    May 7, 2010 @ 6:34 pm

    Eh, I mean "first number on the left" (sorry – the left/right dyslexia coming through)

  19. John Burgess said,

    May 8, 2010 @ 10:46 am

    Arabic pronounces number in the same order as do western languages: 2010, for instance would be alfain wa ashra, i.e., "two thousand and ten". (Dual form of alf, 1,000).

    But… digits are generally given before decimals: The number 422 would be arba' mia, ithnain wa ashreen, i.e., "four hundred, two and twenty".

    The year 1999 would be alf wa tis'a mia, tis'a wa tis'aeen, i.e. "one thousand and nine hundreds, nine and ninety".

  20. Frans said,

    May 8, 2010 @ 12:21 pm

    @John Burgess: That's very similar to Dutch, then. Flemish Dutch in particular.

    When I learned that Arabic numerals were actually Hindu-Arabic numerals, and also how complicated it is to say numbers over 70 in French (albeit not in Wallonian French) I realized that developing an efficient method of writing down numbers might very well need to be at least somewhat independent from the rest of the language because otherwise you might end up with something like Roman numerals (which mostly preserves the inefficiencies of the spoken word).

RSS feed for comments on this post