Copyright 1998 Jeroen Hellingman. May be freely copied and used, as long as I am informed of any corrections or additions you make, and of any software you produce using this information.
This is always work in progress. Comments, remarks and corrections are most welcome.
Unicode a.k.a. ISO 10646 covers all mayor scripts used in India today. However, the standard has several inconsitencies, short-comings and peculiarities, which need to be known to be handled correctly. This document pin-points the cave-ats, and goes into details on the convertion of ISCII into Unicode. Further, this document also includes a proposal for idioms to be used for rendering variants. These will be worked out to a complete conjunct table for each script, which will eventually appear as an appendix.
Note that because of the way Indic scripts are encoded in Unicode, you cannot change one Indic script to another by simply adding a constant to the character-code. You will need to get an acceptable transcription in another Indic script. To minimize the number of tables required, I propose to use the Devanagari block as a generic Indic block, to and from which all other scripts can be translated. I will give translation tables will in an appendix.
Cultural expected sorting also requires special algorithms. Merely sorting on code-point values will normally not be acceptable. The algorithms and sorting orders for each script will appear as an appendix.
Since many Indian typist may type letters "graphically" instead of phonetically, applications should recognise such graphically typed characters, and convert them to the correct character. A list of such graphically typed character sequences is included.
The Unicode standard explains rendering of Indic scripts with the examples of Devanagari and Hindi. However, an explanation of rendering should be included with each script, as each script has its own peculiarities, which sometimes require a specific 'idiom' for the specific script. Best would be to give a informative overview of each script, with a standarised set of conjuncts, alternative conjuncts and other aspects of rendering.
A default collating order needs to be defined for each script.
Sometimes vowel signs on vowels are used in primary school books, some dictionaries for minority languages, and for transliterating foreign languages into Indic scripts, and in historical usages. Hence, a rendering engine should support the application of vowel signs to full vowels.
In the Unicode standard, two part vowels can be decomposed into their constituent parts. I strongly disagree with this decision. For example in Tamil:
VOWEL SIGN O is declared equivalent with VOWEL SIGN E + VOWEL SIGN AA.
Although in Tamil this may look the same it is not logically the same, and the decomposition must be very much discouraged, as it may cause problems in searching, transliteration to other scripts where this is not the case, and sorting. For all these applications an extra step is needed to remove possible decomposed vowel signs. The problem arises in all scripts using vowel signs with parts before and after a consonant clusters.
This issue is related with graphical typing which is possible in Indic script processing, and causes the same type of problems. This is discussed below.
U+09CB BENGALI VOWEL SIGN O U+09CC BENGALI VOWEL SIGN AU U+0B48 ORIYA VOWEL SIGN AI U+0B4B ORIYA VOWEL SIGN O U+0B4C ORIYA VOWEL SIGN AU U+0B94 TAMIL LETTER AU U+0BCA TAMIL VOWEL SIGN O U+0BCB TAMIL VOWEL SIGN OO U+0BCC TAMIL VOWEL SIGN AU U+0C48 TELUGU VOWEL SIGN AI U+0CC0 KANNADA VOWEL SIGN II U+0CC7 KANNADA VOWEL SIGN EE U+0CC8 KANNADA VOWEL SIGN AI U+0CCA KANNADA VOWEL SIGN O U+0CCB KANNADA VOWEL SIGN OO U+0D4A MALAYALAM VOWEL SIGN O U+0D4B MALAYALAM VOWEL SIGN OO U+0D4C MALAYALAM VOWEL SIGN AU
Many Indian typist, especially those trained on Indian typewriters are used to type their texts in graphical order. Even on computers which support phonetic order typing and automatic vowel-reordering and ligating, it is often still possible to type in graphical order. For example Devanagari VOWEL SIGN O can be encoded as VOWEL SIGN AA + VOWEL SIGN E, and VOWEL AU as VOWEL A + VOWEL SIGN AA + VOWEL SIGN E, or VOWEL AA + VOWEL SIGN O, or VOWEL A + VOWEL SIGN AU. This are four possible spellings which will appear very much the same on screen, but complicate a text search or sort a lot. (In fact, most Devanagari fonts use such decomposition to render VOWEL SIGN AU, but that is a rendering issues which should not be confused with encoding.) Similar examples can be found in all scripts.
(I have once had 5 megabytes of Hindi text in Devanagari, typed by Gujarati typists, having all these artefacts of Gujarati script! Worse, they had also systematically confused the reph with candra e, giving me a headache to get something reasonable out of it)
The decompositions indicated here can best be removed from the representation before attempting to do such things as automatic transcription. The two-part vowels mentioned above are included in this table.
graphical character decomposition(s) name U+0906 0905 093E DEVANAGARI LETTER AA U+090D 090F 0945 DEVANAGARI LETTER CANDRA E U+090E 090F 0946 DEVANAGARI LETTER SHORT E U+0910 090F 0947 DEVANAGARI LETTER AI U+0911 0905 093E 0945 DEVANAGARI LETTER CANDRA O 0905 0949 0906 0945 U+0912 0905 093E 0946 DEVANAGARI LETTER SHORT O 0905 094A 0906 0946 U+0913 0905 093E 0947 DEVANAGARI LETTER O 0905 094B 0906 0947 U+0914 0905 093E 0948 DEVANAGARI LETTER AI 0905 094C 0906 0948 U+0949 093E 0945 DEVANAGARI VOWEL SIGN CANDRA O U+094A 093E 0946 DEVANAGARI VOWEL SIGN SHORT O U+094B 093E 0947 DEVANAGARI VOWEL SIGN O U+094C 093E 0948 DEVANAGARI VOWEL SIGN AU U+0986 0985 09BE BENGALI LETTER AA U+09CB 09C7 09BE BENGALI VOWEL SIGN O U+09CC 09C7 09D7 BENGALI VOWEL SIGN AU U+0A06 0A05 0A3E GURMUKHI LETTER AA U+0A07 0A72 0A3F GURMUKHI LETTER I U+0A08 0A72 0A40 GURMUKHI LETTER II U+0A09 0A73 0A41 GURMUKHI LETTER U U+0A0A 0A73 0A42 GURMUKHI LETTER UU U+0A0F 0A72 0A47 GURMUKHI LETTER EE U+0A10 OA05 0A48 GURMUKHI LETTER AI U+0A14 0A05 0A4C GURMUKHI LETTER AU U+0A42 0A41 0A41 GURMUKHI VOWEL SIGN UU (if stacking) U+0A86 0A85 0ABE GUJARATI LETTER AA U+0A8D 0A85 0AC5 GUJARATI VOWEL CANDRA E U+0A8F 0A85 0AC7 GUJARATI LETTER E U+0A90 0A85 0AC8 GUJARATI LETTER AI U+0A91 0A85 0ABE 0AC5 GUJARATI VOWEL CANDRA O 0A85 0AC9 0A86 0AC5 U+0A93 0A85 0ABE 0AC7 GUJARATI LETTER O 0A85 0ACB 0A86 0AC7 U+0A94 0A85 0ABE 0AC8 GUJARATI LETTER AU 0A85 0ACC 0A86 0AC8 U+0AC9 0ABE 0AC5 GUJARATI VOWEL SIGN CANDRA O U+0ACB 0ABE 0AC7 GUJARATI VOWEL SIGN O U+0ACC UABE 0AC8 GUJARATI VOWEL SIGN AU U+0B06 0B05 0B3E ORIYA LETTER AA U+0B4B 0B47 0B3E ORIYA VOWEL SIGN O U+0B4C 0B47 0B57 ORIYA VOWEL SIGN AU U+0B8A 0B89 0BD7 TAMIL LETTER UU (in some styles) U+0B94 0B93 0BD7 TAMIL LETTER AU U+0BCA 0BC6 0BBE TAMIL VOWEL SIGN O U+0BCB 0BC7 0BBE TAMIL VOWEL SIGN OO U+0BCC 0BC6 0BD7 TAMIL VOWEL SIGN AU U+0C0B 0C2C 0C41 0C41 TELUGU LETTER VOCALIC R U+0C13 0C12 0C55 TELUGU LETTER OO U+0C14 0C12 0C4C TELUGU LETTER AU (U+0C2E 0C35 0C41 TELUGU LETTER MA will not be confused, as the script uses a special rendering of 0C41 in this case. The same is done in several other appearant cases.) U+0C47 0C45 0C55 TELUGU VOWEL SIGN EE U+0C48 0C46 0C57 TELUGU VOWEL SIGN AI U+0C94 0C92 0CCC KANNADA LETTER AU (approximate) (same remark as for Telugu, many cases are disambiguated by rendering) U+0CC0 0CBF 0CD5 KANNADA VOWEL SIGN II U+0CC7 0CC6 0CD5 KANNADA VOWEL SIGN EE U+0CC8 0CC6 0CD6 KANNADA VOWEL SIGN AI U+0CCA 0CC6 0CC2 KANNADA VOWEL SIGN O U+0CCB 0CC6 0CC2 0CD5 KANNADA VOWEL SIGN OO 0CCA 0CD5 U+0D08 0D07 0D57 MALAYALAM LETTER II U+0D0A 0D09 0D57 MALAYALAM LETTER UU U+0D10 0D0E 0D46 MALAYALAM LETTER AI U+0D13 0D12 0D3E MALAYALAM LETTER OO U+0D14 0D12 0D57 MALAYALAM LETTER AU U+0D48 0D46 0D46 MALAYALAM VOWEL SIGN AI (if reordering works twice) U+0D4A 0D46 0D3E MALAYALAM VOWEL SIGN O U+0D4B 0D47 0D3E MALAYALAM VOWEL SIGN OO U+0D4C 0D46 0D57 MALAYALAM VOWEL SIGN AU
Unicode includes various ways to encode so called abstract characters, for example, the letter ü can be represented as a single character, U+00FC or as two, the letter u, U+0075 followed by a combining diaeresis, U+0308. To simplify searching and sorting operations, a single spelling is selected as the preferred, or 'canonical' representation. When a number of combining marks is applied to a single character, these combining marks also have to appear in a standard order, the canonical order. This is of special interest to Indic scripts, as these use a large number of combining marks.
Canonical ordering of combining characters is described in the Unicode standard, v2.0, sec 3.9 and table 4.3. However, the order prescribed here as the canonical order, gives the wrong orders of vowel signs and vowel modifiers. The Unicode standard prescribes on page 6-40, rule R10 (and also corresponding to ISCII) that the vowel modifiers (candrabindu, anusvara) follow the vowel sign. However, in case both the vowel sign and the vowel modifier are combining marks according to the Unicode standard, and the standard canonical ordering algorithm is applied, they will be reordered in the wrong order, for example in Devanagari:
Expected | Reordered |
<KA> <vs E> <anusvara> | <KA> <anusvara> <vs E> |
<KA> <vs I> <anusvara> | (not reordered) |
The same issue arises with the svaras (udatta and anudatta) in Devanagari.
The solution will be to change table 4.3 as follows such that vowel signs come first, then vowel modifiers, then svaras. As the latter may be applied to other Indic scripts as well, they should be put in a higher category than the vowel signs in all Indic scripts.
Related to this is the issue of graphical typing of vowel signs discussed above. This should be un-done for when converting Unicode text to the canonical representation.
Note that the Malayalam vowel signs u, uu, and vocaclic r are no longer combining characters, and should not be re-ordered.
Finally, TAMIL SIGN ANUSVARA is missing in the original list. (TAMIL ANUSVARA is in Unicode, although it is not in current use in Tamil. It can be found in some older Tamil books.)
Proposed is to create a new class, Indic Vowel Modifiers, before the diacritics, and after the fixed position classes. Putting the vowel modifiers in class 230, would result in ANUDATTA being ordered before them, which is incorrect as UDATTA and ANUDATTA are 'true' diacritics.
I would also put all non spacing vowel signs in single classes, depending on relative position, but this is quite arbitrary, as they are normally not combined. (in fact, the whole purpose of fixed position classes escapes me)
summarizing,
10--199 fixed position classes
will become:
40 vowel sign below 50 vowel sign above 60 length mark below 70 length mark above 80 tones 90 ? 100 vowel modifiers
The new proposed table is:
Code Class Name U+0D41 0 MALAYALAM VOWEL SIGN U U+0D42 0 MALAYALAM VOWEL SIGN UU U+0D43 0 MALAYALAM VOWEL SIGN VOCALIC R Indic Non-Spacing Vowel Signs Below U+0941 40 DEVANAGARI VOWEL SIGN U U+0942 40 DEVANAGARI VOWEL SIGN UU U+0943 40 DEVANAGARI VOWEL SIGN VOCALIC R U+0944 40 DEVANAGARI VOWEL SIGN VOCALIC RR U+0962 40 DEVANAGARI VOWEL SIGN VOCALIC L U+0963 40 DEVANAGARI VOWEL SIGN VOCALIC LL U+09C1 40 BENGALI VOWEL SIGN U U+09C2 40 BENGALI VOWEL SIGN UU U+09C3 40 BENGALI VOWEL SIGN VOCALIC R U+09C4 40 BENGALI VOWEL SIGN VOCALIC RR U+09E2 40 BENGALI VOWEL SIGN VOCALIC L U+09E3 40 BENGALI VOWEL SIGN VOCALIC LL U+0A41 40 GURMUKHI VOWEL SIGN U U+0A42 40 GURMUKHI VOWEL SIGN UU U+0AC1 40 GUJARATI VOWEL SIGN U U+0AC2 40 GUJARATI VOWEL SIGN UU U+0AC3 40 GUJARATI VOWEL SIGN VOCALIC R U+0AC4 40 GUJARATI VOWEL SIGN VOCALIC RR U+0B41 40 ORIYA VOWEL SIGN U U+0B42 40 ORIYA VOWEL SIGN UU U+0B43 40 ORIYA VOWEL SIGN VOCALIC R Above U+0945 50 DEVANAGARI VOWEL SIGN CANDRA E U+0946 50 DEVANAGARI VOWEL SIGN SHORT E U+0947 50 DEVANAGARI VOWEL SIGN E U+0948 50 DEVANAGARI VOWEL SIGN AI U+0A47 50 GURMUKHI VOWEL SIGN EE U+0A48 50 GURMUKHI VOWEL SIGN AI U+0A4B 50 GURMUKHI VOWEL SIGN OO U+0A4C 50 GURMUKHI VOWEL SIGN AU U+0AC5 50 GUJARATI VOWEL SIGN CANDRA E U+0AC7 50 GUJARATI VOWEL SIGN E U+0AC8 50 GUJARATI VOWEL SIGN AI U+0B3F 50 ORIYA VOWEL SIGN I (sometimes renders below) U+0BC0 50 TAMIL VOWEL SIGN II (often ligates) U+0C3E 50 TELUGU VOWEL SIGN AA U+0C3F 50 TELUGU VOWEL SIGN I U+0C40 50 TELUGU VOWEL SIGN II U+0C46 50 TELUGU VOWEL SIGN E U+0C47 50 TELUGU VOWEL SIGN EE U+0C4A 50 TELUGU VOWEL SIGN O U+0C4B 50 TELUGU VOWEL SIGN OO U+0C4C 50 TELUGU VOWEL SIGN AU U+0CBF 50 KANNADA VOWEL SIGN I U+0CC6 50 KANNADA VOWEL SIGN E U+0CCC 50 KANNADA VOWEL SIGN AU Indic Non-Spacing Length Marks Below U+0C56 60 TELUGU AI LENGTH MARK Above U+0B56 70 ORIYA AI LENGTH MARK (missing in orginal table) U+0C55 70 TELUGU LENGTH MARK Indic Non-Spacing Vowel Modifiers (all above) U+0902 100 DEVANAGARI SIGN ANUSVARA U+0901 100 DEVANAGARI SIGN CANDRABINDU U+0981 100 BENGALI SIGN ANUSVARA U+0A02 100 GURMUKHI SIGN BINDI U+0A70 100 GURMUKHI SIGN TIPPI U+0A71 100 GURMUKHI SIGN ADDAK U+0A82 100 GUJARATI SIGN ANUSVARA U+0A81 100 GUJARATI SIGN CANDRABINDU U+0B01 100 ORIYA SIGN CANDRABINDU U+0B82 100 TAMIL SIGN ANUSVARA (missing in orginal table) Indic Svaras U+0952 220 DEVANAGARI STRESS SIGN ANUDATTA U+0951 230 DEVANAGARI STRESS SIGN UDATTA The same for Thai and Lao (needs some more research) Thai U+0E31 50 THAI CHARACTER MAI HAN-AKAT U+0E34 50 THAI CHARACTER SARA I U+0E35 50 THAI CHARACTER SARA II U+0E36 50 THAI CHARACTER SARA UE U+0E37 50 THAI CHARACTER SARA UEE U+0E38 40 THAI CHARACTER SARA U U+0E39 40 THAI CHARACTER SARA UU U+0E3A 9 THAI CHARACTER PHINTHU (Pali virama) U+0E47 50 THAI CHARACTER MAITAIKHU (is this really a vowel?) U+0E48 80 THAI CHARACTER MAI EK U+0E49 80 THAI CHARACTER MAI THO U+0E4A 80 THAI CHARACTER MAI TRI U+0E4B 80 THAI CHARACTER MAI CHATTAWA U+0E4C 90 THAI CHARACTER THANTHAKHAT U+0E4D 90 THAI CHARACTER NIKHAHIT U+0E4E 100 THAI CHARACTER YAMAKKAN (what is this?) (Different semantics of Thai vowel signs needs to be taken care of when transliterating Pali from Devanagari to Thai or vice versa) U+0EB1 50 LAO VOWEL SIGN MAI KAN U+0EB4 50 LAO VOWEL SIGN I U+0EB5 50 LAO VOWEL SIGN II U+0EB6 50 LAO VOWEL SIGN Y U+0EB7 50 LAO VOWEL SIGN YY U+0EB8 40 LAO VOWEL SIGN U U+0EB9 40 LAO VOWEL SIGN UU U+0EBB 50 LAO VOWEL SIGN MAI KON U+0EBC 40 LAO SEMIVOWEL SIGN LO U+0EC8 80 LAO TONE MAI EK U+0EC9 80 LAO TONE MAI THO U+0ECA 80 LAO TONE MAI TI U+0ECB 80 LAO TONE MAI CATAWA U+0ECC 90 LAO CANCELLATION MARK U+0ECD 90 LAO NIGGAHITA The same for Tibetan (This still needs some research) TODO: The decompositions need to be checked. U+0F71 40 TIBETAN VOWEL SIGN AA U+0F72 50 TIBETAN VOWEL SIGN I U+0F74 40 TIBETAN VOWEL SIGN U U+0F75 40 TIBETAN VOWEL SIGN UU U+0F7A 50 TIBETAN VOWEL SIGN E U+0F7B 50 TIBETAN VOWEL SIGN EE U+0F7C 50 TIBETAN VOWEL SIGN O U+0F7D 50 TIBETAN VOWEL SIGN OO TODO: more.
(updated 20-DEC-1997)
The glyph for U+095E DEVANAGARI LETTER FA is wrong, shown is a Devanagari combination FRA.
There is also a doubled version of DEVANAGARI (LETTER | VOWEL SIGN) SHORT (E | O), used for transcribing short ai or au in some Indian languages, given in Grierson. However, some confusion is possible, because in this source, SHORT E is rendered with a reversed E, while short AI is represented with the SHORT E currently in Unicode. The reversed E cannot be handled as a mere glyph variation of SHORT E, as both shapes will be required in a document that uses this scheme. I propose to add:
0971 DEVANAGARI LETTER REVERSED E = short e 0972 DEVANGARI LETTER SHORT AU 0973 DEVANAGARI VOWEL SIGN SHORT AI 0974 DEVANAGARI VOWEL SIGN SHORT AU
duplicating SHORT E with the semantic SHORT AI will confuse people in data-entry, and is hence not proposed.
Source: Grierson: A Linguistic Survey of India (introduction, p. 7).
The ISCII-91 standard defines a large number of Vedic accents (which can be compared with Hebrew cantilation marks) These are not included in Unicode, but need to. These are included in a separate proposal. As long as this proposal is not accepted, private area characters will have to be used for this purpose.
Further, there are symbols for representing monetary amounts, as in Bengali. The symbols are no longer in current use, but may be of interest to people researching historical documents. These symbols can be combined, so only the following will be necessary.
0975 DEVANAGARI SYMBOL ONE ANNA 0976 DEVANAGARI SYMBOL TWO ANNAS 0977 DEVANAGARI SYMBOL THREE ANNAS 0978 DEVANAGARI SYMBOL FOUR ANNAS 0979 DEVANAGARI SYMBOL EIGHT ANNAS 097A DEVANAGARI SYMBOL TWELVE ANNAS
source: Grierson.
Why are the Gujarati vowels with candra called
GUJARATI VOWEL CHANDRA E GUJARATI VOWEL CHANDRA O
when all other vowels are called LETTER?
Danda is commonly used in Gujarati, but not encoded. It can be borrowed from the Devanagari block.
[v2.0: Gujarati avagraha: glyph printed upside-down?]
Gurmukhi is used for writting the Panjabi language in India, which is also written in the Arabic script in Pakistan.
In ISCII, the distinction between TIPPI and BINDI is made by context, and both are encoded by the same character.
Why are GURMUKHI LETTER EE and OO not not called E and O?
Given in ISCII-91, and not in Unicode:
0A58 GURMUKHI LETTER QA 0A5D GURMUKHI LETTER RHA
The Gurmukhi alphabetical order is different from the other Indic scripts.
sources: ISCII standard, C. Shackle, Punjabi (Teach yourself books)
ISCII-91 makes a distinction between VA and BA (and uses a slightly different glyph for VA), so I propose to include it for compatibility with ISCII-91 only. Note that the Bengali alphabet simply repeats BA in the place of Devanagari VA.
09B1 BENGALI LETTER VA
There is some confusion possible between YA and YYA.
Both letters are derived from Sanskrit ya, but Bengali YA is pronounced like JA /ja/. To indicated the that YA is to be pronounced as /ya/, a dot was added in the 19th century. The secondary forms of YA and YYA (YA-Phaala) seem to be the same, but sources differ on this issue. According to Chatterji, the secondary form is YYA, though many others maintain it is YA. The most sensible way of handling this seems to use the secondary form Ya-Phaala for both characters when they appear in secondary position.
Further confusion may arise from the ISCII-91 encoding of both characters, which is based on phonetic value. ISCII-91 encodes Bengali YYA parallel with YA in Devanagari, and Bengali YA with Devanagari YYA, while the Unicode encoding is based on the graphical appearance, and puts Bengali YA parallel with Devanagari YA, and Bengali YYA with Devanagari YYA. An ISCII implementation I have used (LEAP from C-DAC) only generates YA-Phaala when YYA is typed.
This issue should be taken care of when converting ISCII to Unicode.
sources: ISCII-91 standard, S.K. Chatterji, Bengali Self-taught.
In Bengali, the secondary YA sometimes follows the full vowels A and O. There is no indicated way of encoding the secondary YYA (ya-phala), following the vowels A and O. I propose to use the following idiom:
A + VIRAMA + YYA => A + secondary YYA O + VIRAMA + YYA => O + secondary YYA
source: S.K. Chatterji, Bengali Self-taught.
Danda is commonly used in Bengali, but not encoded. It can be borrowed from the Devanagari block.
ISCII-91 makes a distinction between VA and BA (and uses a different glyph for VA), so I suggest including
0B31 ORIYA LETTER VA
This character is graphically an O with a subscript BA, so it can also be represented as O + VIRAMA + BA; see the remark below.
Besides this character, an Oriya letter BA with a dot inside is also used for va, as shown in the following alphabet book I snatched somewhere from the net.
Also see under Bengali script. The same confusion is possible here. The correct mapping from ISCII is:
0B2F ORIYA LETTER YA <-> ISCII-91 CE Consonsant JYA 0B5F ORIYA LETTER YYA <-> ISCII-91 CD Consonant YA
In an ISCII implementation I've studied, YYA will appear as the secondary form, YA not.
The decomposition of YYA into YA NUKTA in the 2.014 table seems incorrect. No Oriyan will consider YYA a YA with a NUKTA (dot).
The secondary YYA may appear after the vowels E and O, and the secondary letter BA can appear after O. This can be encoded using VIRAMA.
E + VIRAMA + YYA => E + secondary YYA + ai length mark O + VIRAMA + YYA => O + secondary YYA + ai length mark O + VIRAMA + BA => O + secondary BA
The exact use of such vowels is still unclear. The letters AI and AU can be seen as graphical contractions of the two first combinations.
(see also the comments on Bengali)
I have not been able to find vowel signs for the vowels VOCALIC RR, VOCALIC L, VOCALIC LL in literature available to me. Still they might be required for printing Sanskrit in Oriya script. Can somebody please try to locate these symbols.
anusvara: anusvara visarga: bisarga candrabindu: candrabindu, anunasika avagraha: abagraha
Danda is commonly used in Oriya, but not encoded. It can be borrowed from the Devanagari block.
ORIYA SIGN CANDRABINDU is sometimes treated as a spacing mark.
(updated 29-OCT-1997)
Note that TAMIL AU LENGTH MARK looks like a TAMIL LETTER LLA.
The explanations in the Unicode standard, v 2.0 are based on traditional script as it is in use in Sri Lanka. India currently uses the reformed script, which drops a number of irregular glyphs.
virama: pulli
Tamil numerals do not combine like decimal digits, but rather like ideographic numbers, that is, there is no zero, but there are signs for 10, 100, 1000. To represent 1980, one writes
TAMIL DIGIT ONE TAMIL NUMBER ONE THOUSAND TAMIL DIGIT NINE TAMIL NUMBER ONE HUNDRED TAMIL DIGIT EIGHT TAMIL NUMBER TEN,
i.e 1*1000 + 9*100 + 8*10 = 1980
Nowadays, international numerals are used in Tamil in India and Sri Lanka, However genuine decimal Tamil numerals are in use in Mauritius. Modern Mauritian banknotes use Tamil numerals as decimal numerals, with a European style zero. [see Pick, World Paper Money, vol. 2, 7th ed., p. 841] This will require:
0BE6 TAMIL DIGIT ZERO
There are special symbols for year, month and day in Tamil. These can be added after the numerals
0BF3 TAMIL SYMBOL FOR YEAR 0BF4 TAMIL SYMBOL FOR MONTH 0BF5 TAMIL SYMBOL FOR DAY
Tamil Om sign, consisting of an O with inscribed MA with ANUSVAR.
0BF6 TAMIL OM SIGN
Tamil "Shri" is often written with a grantha conjunct, which cannot be produced with the ordinary composition rules.
0BF7 TAMIL SHRI SIGN
I can provide samples of the glyphs.
Tamil Grantha is not included Unicode, but can be encoded in parallel with Tamil. However, a separate encoding seems to be in the making already. I will work out a proposal for Tamil Grantha.
Most Telugu consonants have a little hacek like check on top of them. By some school books this is considered as the vowel sign for A, and rendered as a separate character. Why not add TELUGU SIGN CHECK MARK (or rendering rule to get it)?
candrabindu: ardhasunna anusvara: sunna visarga: visarga virama: valapalagilaka
Source: Lakshmi V.S. Mukkavilli, _TeluguTeX_.
@ dependent vowel signs 0C62 TELUGU VOWEL SIGN VOCALIC L 0C63 TELUGU VOWEL SIGN VOCALIC LL @ various signs TELUGU ARASUNNA : telugu sign candrabindu TELUGU SUNNA : telugu sign anusvara TELUGU VISARGA : telugu sign visarga ???? TELUGU SIGN ARDHAVISARGA visarga of which the circles are open below 0C3D TELUGU SIGN AVAGRAHA ???? TELUGU ? sign of which no name is given, looks like an upside down ???? TELUGU SIGN NAKARAPOLLU TELUGU SIGN VALAPALAGILAKA : telugu sign virama TELUGU STRESS SIGN UDATTA : devanagari stress sign udatta TELUGU STRESS SIGN DOUBLE UDATTA : vedic long svarita TELUGU STRESS SIGN ANUDATTA : devanagari stress sign anudatta
Nakanishi gives symbols for one quarter, one half, and three quarters.
TELUGU SYMBOL FOR ONE QUARTER TELUGU SYMBOL FOR ONE HALF TELUGU SYMBOL FOR THREE QUARTERS
ALA-LC Romanization Tables include:
???? TELUGU LETTER CCA ???? TELUGU LETTER JJA
which look like CA and JA respectively, but have an upside down vowel sign aa above them.
???? TELUGU SIGN UPSIDE DOWN MATRA AA (better name needed)
Danda is sometimes used in Telugu, but not encoded. It can be borrowed from the Devanagari block.
(Kannada rra and fa are not in ISCII-91)
anusvara: bindu, sonne
Why not add KANNADA SIGN CHECK MARK (or rendering rule to get it)?
Danda is sometimes used in Kannada, but not encoded. It can be borrowed from the Devanagari block.
(updated 27-OCT-1997)
Malayalam uses ligatures of the virama with some characters (so called cillu letters). These should be distinquished from the same letters with a virama. I suggest using <character><virama><zwj> to produce these letters, and <character><virama> for the explicit virama. This is in parallel with the ISCII standard and Unicode conventions (when cillu letters are seen as a kind of half letters). ISCII uses soft halant for this purpose as well as for forcing half-consonants in several North Indian scripts. In some cases, this cillu letter can also appear in a conjunct. For example N + RRA. The proposed idiom for this combination is: N<virama><zwj><virama>RRA. (this is admittedly ugly)
The description in the Unicode standard is based on traditional script. Since 1974 reformed script has been in use. In reformed Malayalam script, VOWEL SIGN U, VOWEL SIGN UU, and VOWEL SIGN VOCALIC R are no longer non-spacing marks, and that VOWEL SIGN VOCALIC RR, and VOCALIC RR, VOCALIC L, VOCALIC LL have been abolished.
Traditional Malayalam script allows placing a virama on a cluster already carrying the u matra, to notify a short u sound.
(Malayalam has also been written in Arabic script, using some extra letters, I will try to find out details)
@ dependent vowel signs 0D44 MALAYALAM VOWEL SIGN VOCALIC RR 0D62 MALAYALAM VOWEL SIGN VOCALIC L 0D63 MALAYALAM VOWEL SIGN VOCALIC LL
The vowel sign for VOCALIC RR is in Gundert, p, 290, entry kRR, and p. 697 pRR. The dependent vowel signs for VOCALIC L and VOCALIC LL are the the same glyphs subscribed. The sign for VOCALIC LL can be found in Gundert on p. 290, entry kLLptam.
In current use is a sign for ordinals, which can be found in newspapers and in many Malayalam fonts. Its use corresponds with -th in English.
@ additional symbols 0D70 MALAYALAM ORDINAL SIGN
looks like letter n with curl on right side.
Malayalam numerals are hardly ever used nowadays.
Also found a single reference to a sign for ONE HALF [Frohnmeyer]. Possibly it is something like the Bengali currency numerators.
Malayalam names for characters (TODO)
TODO (canonical ordering)
TODO (canonical ordering)
TODO (canonical ordering, composition, meaning of symbols in English)
Today, ISCII is, in its various incarnations, the most widely used character set for Indian languages. However, round trip convertion between the latest version of ISCII, from 1991, and Unicode is not possible, for the following reasons:
The first two problems are not within the scope of Unicode, the third is partly in scope of Unicode (as the variants are sometimes significant), and the last definitly is, and needs fixing.
Some further issues are described with the respective scripts above.
It should be noted that Unicode is based on an older version of ISCII. This is no problem, as translation tables/routines should be used in any case. It should also be noted that translation tables/routines should be used to translate from one Indic script to another. A mere addition of a constant to the character codes does not yield defined Unicode characters in all cases.
ISCII-91 uses the doubling of some characters, and nukta to indicate rendering variants. These idioms will have to be translated to their Unicode equivalents. The idioms proposed here are choosen such that they follow Unicode conventions already in place, and such that they have minimal impact on systems that do not use these conventions: both joiner and non-joiner will not print. Some of these conventions are not part of ISCII, but implemented by its major promoters, C-DAC in Pune in its products.
Character type ISCII Code(s) Unicode C - consonant H - halant/viram N - nukta J - joiner X - non-joiner V - vowel M - vowel sign D - vowel modifier S - non-spacing diacritic de facto proposed ISCII-91 Unicode Semantic idiom idiom C1 H C2 C1 H C2 Use standard conjunct of C1 and C2. C1 H N C2 C1 H J C2 Use half-form of C1, followed by full C2 C1 H H C2 C1 H X C2 Use C1 with halant, followed by C2 (no conjunct) C1 H H H C2 C1 H J J C2 Use variant conjunct of C1 and C2 (applications may use more J to indicate further variants) C1 H N C1 H J (Malayalam) Use cillu letter C1 H N N C2 C1 H J H C2 (Malayalam) Use secondary consonant under cillu letter C M C M Use optionally ligating vowel-sign C M M C X M Use standard vowel sign (no ligating) C M M M C M J Use ligating variant of vowel sign (applications may use more J to indicate further variants)
in the above three cases in ISCII this behaviour is only implemented for a few consonant--vowel sign combinations, for example in Devanagari: Ru, Ruu, Hr and Gujarati: Ru, Ruu.
Further idioms can be used for special usages.
V M apply vowel sign to a vowel. V H C apply secondary consonant to a vowel. C M H apply both vowel sign and halant (as is used in Malayalam)
Further, ISCII uses a number of code-extention techniques to access a number of lesser used characters. The following combinations should be noted:
anusvara + nukta -> om sign danda + nukta -> avagraha i + nukta -> vocalic l ii + nukta -> vocalic ll vocalic r + nukta -> vocalic rr vs i + nukta -> vs vocalic l vs ii + nukta -> vs vocalic ll vs vocalic r + nukta -> vs vocalic rr
word ::= initial-syllable syllables.
(note: in Bengali, word-initial vowel sign e, o, etc. are sometimes rendered with a different glyph, hence the difference in the syntax)
syllables ::= syllable syllables | empty syllable ::= base ending base ::= vowel | consonant | cluster cluster ::= consonant halant cluster consonant ::= C [ N ] halant ::= H [ J ] [ J | X ] ending ::= [ [ X ] M ] [ J ] [ D ] | [ [ M ] H ]
Many people in India know more than one language, but only one or two scripts. For example, somebody living in Orissa may have no problem understanding Bengali and Hindi, but may not be able to read their scripts. For this reason, script convertion will often be beneficial. For this purpose, tables have to be prepared to convert one script to another. I propose here a set of tables, which map to and from the Devanagari block, to allow such convertions. The Devanagari block is choosen, because it is the most complete block in the standard, and the most widely known script in India. It should be noted that these tables do not give round-trip convertions.
The tables or software using them should take into account the possibility of decomposition of NNNA into N + NUKTA, etc.
DEVANAGARI -> GUJARATI -> DEVANAGARI
DEVANAGARI -> ORIYA -> DEVANAGARI
DEVANAGARI -> BENGALI -> DEVANAGARI
DEVANAGARI -> ASSAMESE -> DEVANAGARI
DEVANAGARI -> GURMUKHI -> DEVANAGARI
DEVANAGARI -> MALAYALAM -> DEVANAGARI
DEVANAGARI -> TAMIL -> DEVANAGARI
DEVANAGARI -> KANNADA -> DEVANAGARI
DEVANAGARI -> TELUGU -> DEVANAGARI
In addition, the following tables will be produced:
DEVANAGARI -> ARABIC -> DEVANAGARI (DV->AR is not Hindi->Urdu, because
the Urdu orthography differs)
GURMUKHI -> ARABIC -> GURMUKHI
DEVANAGARI -> ROMAN -> DEVANAGARI (using diacritics)
DEVANAGARI -> ROMAN (without diacritics)
and all other scripts to Roman.
characters not given in the table should not be changed.
The transliteration given is the one most commonly seen in Indological publications. Where clear concensus exist on a certain transcription, common alternatives are also given.
Devanagari Transliteration Devanagari Transliteration Character Character(s) Name Name(s) ------------------------------------------------------------------------------ 0901 1E41 candrabindu m with dot above 006D 0310 m + candrabindu 0902 1E43 anusvara m with dot below 0903 1E25 visarga h with dot below 0905 0061 a a 0906 0101 aa a with macron 0907 0069 i i 0908 012B ii i with macron 0909 0075 u u 090A 016B uu u with macron 090B 1E5B vocalic r r with dot below 0072 0325 r with circle below 090C 1E37 vocalic l l with dot below 006C 0325 l with circle below 090D 00EA candra e e with circumflex 090E 0115 short e e with breve 090F 0065 e e 0910 0061 0069 ai a + i [see note 1] 0911 014F candra o o with circumflex 0912 014D short o o with macron 0913 006F o o 0914 0061 0075 au a + u [see note 2] 0915 006B k k [see note 3] 0916 006B 0068 kh k + h 0917 0067 g g 0918 0067 0068 gh g + h 0919 1E45 ng n with dot above 091A 0063 c c 091B 0063 0068 ch c + h 091C 006A j j 091D 006A 0068 jh j + h 091E 00F1 n n with tilde 091F 1E6D tt t with dot below 0920 1E6D 0068 tth t with dot below + h 0921 1E0D dd d with dot below 0922 1E0D 0068 ddh d with dot below + h 0923 1E47 nn n with dot below 0924 0074 t t 0925 0074 0068 th t + h 0926 0064 d d 0927 0064 0068 dh d + h 0928 006E n n 0929 1E49 n n with line below 092A 0070 p p 092B 0070 0068 ph p + h 092C 0062 b b 092D 0062 0068 bh b + h 092E 006D m m 092F 0079 y y 0930 0072 r r 0931 1E5B rr r with dot below 1E5F r with line below 0932 006C l l 0933 1E37 ll l with dot below 0934 1E3B lll l with line below 006C 0324 l + double dot below 0935 0076 v v 0936 015B s s with acute 0937 1E63 s s with dot below 0938 0073 s s 0939 0068 h h 093C nukta [see note 4] 093E avagraha [TODO vowel signs same as corresponing vowels] 0958 0071 q q 0959 1E35 0068 0331 kkh k with line below + h + macron below [TODO] 0960 1E5D vocalic rr r with dot below and macron 0072 0325 0304 r with circle below and macron 0961 vocalic ll
Indian names are normally transliterated in an ad-hoc fashion. Normally an Indian uses a single Roman spelling of his name, while another Indian carrying the same name may transliterate it in a different way. For example, the name Chauduri can be found in at least eight different spellings. One individual preferring Chowdury, the other Chaudhuri. Indian telephone directories normally sort all such names as if they had been spelled the same, and add cross-references with the other spellings.
[note 1] The sequence ai should be disambiguated with a diaeresis on the i if necessary.
[note 2] au likewise.
[note 3] Depending on context, a should be added after the consonants.
[note 4] Nukta following some letters modifies their sounds. The combination letter + nukta should be transliterated as a unit.
A comma separated table specifies how Devanagari can be converted into any of the other Brahmi derived scripts in use in India. Note that the translations are not one-to-one and are not reversible. Approximate translations are given in parentheses, alternatives are given in case of ambiguities. Characters borrowed from the Devanagari block are given in brackets. When converting character data from one script to another, the following should be noted:
[1] Gurmukhi: when a consonant is doubled, this is indicated with U+0A71 for example 0915 094D 0915 maps to 0A15 0A71.
01-FEB-1998 Added notes on Telugu. 20-DEC-1997 Added note on wrong glyph in Devanagari table, Worked on tables (JH) 30-OCT-1997 Converted to HTML, added illustrations (JH) 29-OCT-1997 Updated remarks on Tamil (JH) 27-OCT-1997 Updated remarks on Malayalam, Devanagari (JH) 08-SEP-1997 Added TAMIL SHRI SIGN, changed some remarks with Oriya (JH) 04-SEP-1997 First posting on unicode@unicode.org (JH) 29-AUG-1997 Removed comments on Unicode 1.1 now resolved (JH) Added remarks on Thai, Lao, and Tibetan. 26-AUG-1997 Changed canonical ordering table (JH) 25-AUG-1997 Added correction of canonical ordering table, and list of graphical decompositions (JH) 17-AUG-1997 Added note on canonical ordering of Indic characters Several minor modifications. (JH) 16-JUN-1997 Added more notes on convertion from ISCII to Unicode (JH) 19-MAY-1997 Added notes on convertion from ISCII to Unicode (JH) 11-MAY-1997 Revision with respect to Unicode 2.014 (JH) 22-JUN-1994 Revision (JH)
I like to express my thanks to the following persons, who provided feed-back on previous versions of this document: