Indian Scripts and Unicode

Indian scripts and Unicode

This is always work in progress. Comments, remarks and corrections are most welcome.

Introduction
General Remarks
Individual Scripts
Conversions
Transliteration

History
Acknowledgements

Introduction

Unicode a.k.a. ISO 10646 covers all mayor scripts used in India today. However, the standard has several inconsitencies, short-comings and peculiarities, which need to be known to be handled correctly. This document pin-points the cave-ats, and goes into details on the convertion of ISCII into Unicode. Further, this document also includes a proposal for idioms to be used for rendering variants. These will be worked out to a complete conjunct table for each script, which will eventually appear as an appendix.

Note that because of the way Indic scripts are encoded in Unicode, you cannot change one Indic script to another by simply adding a constant to the character-code. You will need to get an acceptable transcription in another Indic script. To minimize the number of tables required, I propose to use the Devanagari block as a generic Indic block, to and from which all other scripts can be translated. I will give translation tables will in an appendix.

Cultural expected sorting also requires special algorithms. Merely sorting on code-point values will normally not be acceptable. The algorithms and sorting orders for each script will appear as an appendix.

Since many Indian typist may type letters "graphically" instead of phonetically, applications should recognise such graphically typed characters, and convert them to the correct character. A list of such graphically typed character sequences is included.

General Remarks

The Unicode standard explains rendering of Indic scripts with the examples of Devanagari and Hindi. However, an explanation of rendering should be included with each script, as each script has its own peculiarities, which sometimes require a specific 'idiom' for the specific script. Best would be to give a informative overview of each script, with a standarised set of conjuncts, alternative conjuncts and other aspects of rendering.

A default collating order needs to be defined for each script.

Vowel signs on vowels

Sometimes vowel signs on vowels are used in primary school books, some dictionaries for minority languages, and for transliterating foreign languages into Indic scripts, and in historical usages. Hence, a rendering engine should support the application of vowel signs to full vowels.

Two part vowels

In the Unicode standard, two part vowels can be decomposed into their constituent parts. I strongly disagree with this decision. For example in Tamil:

VOWEL SIGN O is declared equivalent with VOWEL SIGN E + VOWEL SIGN AA.

Although in Tamil this may look the same it is not logically the same, and the decomposition must be very much discouraged, as it may cause problems in searching, transliteration to other scripts where this is not the case, and sorting. For all these applications an extra step is needed to remove possible decomposed vowel signs. The problem arises in all scripts using vowel signs with parts before and after a consonant clusters.

This issue is related with graphical typing which is possible in Indic script processing, and causes the same type of problems. This is discussed below.

List of decomposed two part vowel signs

U+09CB  BENGALI VOWEL SIGN O
U+09CC  BENGALI VOWEL SIGN AU
U+0B48  ORIYA VOWEL SIGN AI
U+0B4B  ORIYA VOWEL SIGN O
U+0B4C  ORIYA VOWEL SIGN AU
U+0B94  TAMIL LETTER AU
U+0BCA  TAMIL VOWEL SIGN O
U+0BCB  TAMIL VOWEL SIGN OO
U+0BCC  TAMIL VOWEL SIGN AU
U+0C48  TELUGU VOWEL SIGN AI
U+0CC0  KANNADA VOWEL SIGN II
U+0CC7  KANNADA VOWEL SIGN EE
U+0CC8  KANNADA VOWEL SIGN AI
U+0CCA  KANNADA VOWEL SIGN O
U+0CCB  KANNADA VOWEL SIGN OO
U+0D4A  MALAYALAM VOWEL SIGN O
U+0D4B  MALAYALAM VOWEL SIGN OO
U+0D4C  MALAYALAM VOWEL SIGN AU

Graphical Typing

Many Indian typist, especially those trained on Indian typewriters are used to type their texts in graphical order. Even on computers which support phonetic order typing and automatic vowel-reordering and ligating, it is often still possible to type in graphical order. For example Devanagari VOWEL SIGN O can be encoded as VOWEL SIGN AA + VOWEL SIGN E, and VOWEL AU as VOWEL A + VOWEL SIGN AA + VOWEL SIGN E, or VOWEL AA + VOWEL SIGN O, or VOWEL A + VOWEL SIGN AU. This are four possible spellings which will appear very much the same on screen, but complicate a text search or sort a lot. (In fact, most Devanagari fonts use such decomposition to render VOWEL SIGN AU, but that is a rendering issues which should not be confused with encoding.) Similar examples can be found in all scripts.

(I have once had 5 megabytes of Hindi text in Devanagari, typed by Gujarati typists, having all these artefacts of Gujarati script! Worse, they had also systematically confused the reph with candra e, giving me a headache to get something reasonable out of it)

List of graphically decomposed Indic characters

The decompositions indicated here can best be removed from the representation before attempting to do such things as automatic transcription. The two-part vowels mentioned above are included in this table.

            graphical
character   decomposition(s)    name

U+0906      0905 093E           DEVANAGARI LETTER AA
U+090D      090F 0945           DEVANAGARI LETTER CANDRA E
U+090E      090F 0946           DEVANAGARI LETTER SHORT E
U+0910      090F 0947           DEVANAGARI LETTER AI
U+0911      0905 093E 0945      DEVANAGARI LETTER CANDRA O
            0905 0949
            0906 0945
U+0912      0905 093E 0946      DEVANAGARI LETTER SHORT O
            0905 094A
            0906 0946
U+0913      0905 093E 0947      DEVANAGARI LETTER O
            0905 094B
            0906 0947
U+0914      0905 093E 0948      DEVANAGARI LETTER AI
            0905 094C
            0906 0948
U+0949      093E 0945           DEVANAGARI VOWEL SIGN CANDRA O
U+094A      093E 0946           DEVANAGARI VOWEL SIGN SHORT O
U+094B      093E 0947           DEVANAGARI VOWEL SIGN O
U+094C      093E 0948           DEVANAGARI VOWEL SIGN AU

U+0986      0985 09BE           BENGALI LETTER AA
U+09CB      09C7 09BE           BENGALI VOWEL SIGN O
U+09CC      09C7 09D7           BENGALI VOWEL SIGN AU

U+0A06      0A05 0A3E           GURMUKHI LETTER AA
U+0A07      0A72 0A3F           GURMUKHI LETTER I
U+0A08      0A72 0A40           GURMUKHI LETTER II
U+0A09      0A73 0A41           GURMUKHI LETTER U
U+0A0A      0A73 0A42           GURMUKHI LETTER UU
U+0A0F      0A72 0A47           GURMUKHI LETTER EE
U+0A10      OA05 0A48           GURMUKHI LETTER AI
U+0A14      0A05 0A4C           GURMUKHI LETTER AU
U+0A42      0A41 0A41           GURMUKHI VOWEL SIGN UU (if stacking)

U+0A86      0A85 0ABE           GUJARATI LETTER AA
U+0A8D      0A85 0AC5           GUJARATI VOWEL CANDRA E
U+0A8F      0A85 0AC7           GUJARATI LETTER E
U+0A90      0A85 0AC8           GUJARATI LETTER AI
U+0A91      0A85 0ABE 0AC5      GUJARATI VOWEL CANDRA O
            0A85 0AC9
            0A86 0AC5
U+0A93      0A85 0ABE 0AC7      GUJARATI LETTER O
            0A85 0ACB
            0A86 0AC7
U+0A94      0A85 0ABE 0AC8      GUJARATI LETTER AU
            0A85 0ACC
            0A86 0AC8
U+0AC9      0ABE 0AC5           GUJARATI VOWEL SIGN CANDRA O
U+0ACB      0ABE 0AC7           GUJARATI VOWEL SIGN O
U+0ACC      UABE 0AC8           GUJARATI VOWEL SIGN AU

U+0B06      0B05 0B3E           ORIYA LETTER AA
U+0B4B      0B47 0B3E           ORIYA VOWEL SIGN O
U+0B4C      0B47 0B57           ORIYA VOWEL SIGN AU

U+0B8A      0B89 0BD7           TAMIL LETTER UU (in some styles)
U+0B94      0B93 0BD7           TAMIL LETTER AU
U+0BCA      0BC6 0BBE           TAMIL VOWEL SIGN O
U+0BCB      0BC7 0BBE           TAMIL VOWEL SIGN OO
U+0BCC      0BC6 0BD7           TAMIL VOWEL SIGN AU

U+0C0B      0C2C 0C41 0C41      TELUGU LETTER VOCALIC R
U+0C13      0C12 0C55           TELUGU LETTER OO
U+0C14      0C12 0C4C           TELUGU LETTER AU
(U+0C2E     0C35 0C41           TELUGU LETTER MA will not be confused,
                                as the script uses a special rendering
                                of 0C41 in this case. The same is
                                done in several other appearant cases.)
U+0C47      0C45 0C55           TELUGU VOWEL SIGN EE
U+0C48      0C46 0C57           TELUGU VOWEL SIGN AI

U+0C94      0C92 0CCC           KANNADA LETTER AU (approximate)
(same remark as for Telugu, many cases are disambiguated by rendering)
U+0CC0      0CBF 0CD5           KANNADA VOWEL SIGN II
U+0CC7      0CC6 0CD5           KANNADA VOWEL SIGN EE
U+0CC8      0CC6 0CD6           KANNADA VOWEL SIGN AI
U+0CCA      0CC6 0CC2           KANNADA VOWEL SIGN O
U+0CCB      0CC6 0CC2 0CD5      KANNADA VOWEL SIGN OO
            0CCA 0CD5

U+0D08      0D07 0D57           MALAYALAM LETTER II
U+0D0A      0D09 0D57           MALAYALAM LETTER UU
U+0D10      0D0E 0D46           MALAYALAM LETTER AI
U+0D13      0D12 0D3E           MALAYALAM LETTER OO
U+0D14      0D12 0D57           MALAYALAM LETTER AU
U+0D48      0D46 0D46           MALAYALAM VOWEL SIGN AI (if reordering works twice)
U+0D4A      0D46 0D3E           MALAYALAM VOWEL SIGN O
U+0D4B      0D47 0D3E           MALAYALAM VOWEL SIGN OO
U+0D4C      0D46 0D57           MALAYALAM VOWEL SIGN AU

Canonical Representation

Unicode includes various ways to encode so called abstract characters, for example, the letter ü can be represented as a single character, U+00FC or as two, the letter u, U+0075 followed by a combining diaeresis, U+0308. To simplify searching and sorting operations, a single spelling is selected as the preferred, or 'canonical' representation. When a number of combining marks is applied to a single character, these combining marks also have to appear in a standard order, the canonical order. This is of special interest to Indic scripts, as these use a large number of combining marks.

Canonical ordering of combining characters is described in the Unicode standard, v2.0, sec 3.9 and table 4.3. However, the order prescribed here as the canonical order, gives the wrong orders of vowel signs and vowel modifiers. The Unicode standard prescribes on page 6-40, rule R10 (and also corresponding to ISCII) that the vowel modifiers (candrabindu, anusvara) follow the vowel sign. However, in case both the vowel sign and the vowel modifier are combining marks according to the Unicode standard, and the standard canonical ordering algorithm is applied, they will be reordered in the wrong order, for example in Devanagari:

Expected Reordered

<KA> <vs E> <anusvara> <KA> <anusvara> <vs E>

<KA> <vs I> <anusvara> (not reordered)

The same issue arises with the svaras (udatta and anudatta) in Devanagari.

The solution will be to change table 4.3 as follows such that vowel signs come first, then vowel modifiers, then svaras. As the latter may be applied to other Indic scripts as well, they should be put in a higher category than the vowel signs in all Indic scripts.

Related to this is the issue of graphical typing of vowel signs discussed above. This should be un-done for when converting Unicode text to the canonical representation.

Proposed changes to table 4-3.

Note that the Malayalam vowel signs u, uu, and vocaclic r are no longer combining characters, and should not be re-ordered.

Finally, TAMIL SIGN ANUSVARA is missing in the original list. (TAMIL ANUSVARA is in Unicode, although it is not in current use in Tamil. It can be found in some older Tamil books.)

Proposed is to create a new class, Indic Vowel Modifiers, before the diacritics, and after the fixed position classes. Putting the vowel modifiers in class 230, would result in ANUDATTA being ordered before them, which is incorrect as UDATTA and ANUDATTA are 'true' diacritics.

I would also put all non spacing vowel signs in single classes, depending on relative position, but this is quite arbitrary, as they are normally not combined. (in fact, the whole purpose of fixed position classes escapes me)

summarizing,

10--199 fixed position classes

will become:

 40 vowel sign below
 50 vowel sign above
 60 length mark below
 70 length mark above
 80 tones
 90 ?
100 vowel modifiers

The new proposed table is:

Code   Class   Name
U+0D41   0      MALAYALAM VOWEL SIGN U
U+0D42   0      MALAYALAM VOWEL SIGN UU
U+0D43   0      MALAYALAM VOWEL SIGN VOCALIC R

Indic Non-Spacing Vowel Signs

Below
U+0941  40      DEVANAGARI VOWEL SIGN U
U+0942  40      DEVANAGARI VOWEL SIGN UU
U+0943  40      DEVANAGARI VOWEL SIGN VOCALIC R
U+0944  40      DEVANAGARI VOWEL SIGN VOCALIC RR
U+0962  40      DEVANAGARI VOWEL SIGN VOCALIC L
U+0963  40      DEVANAGARI VOWEL SIGN VOCALIC LL
U+09C1  40      BENGALI VOWEL SIGN U
U+09C2  40      BENGALI VOWEL SIGN UU
U+09C3  40      BENGALI VOWEL SIGN VOCALIC R
U+09C4  40      BENGALI VOWEL SIGN VOCALIC RR
U+09E2  40      BENGALI VOWEL SIGN VOCALIC L
U+09E3  40      BENGALI VOWEL SIGN VOCALIC LL
U+0A41  40      GURMUKHI VOWEL SIGN U
U+0A42  40      GURMUKHI VOWEL SIGN UU
U+0AC1  40      GUJARATI VOWEL SIGN U
U+0AC2  40      GUJARATI VOWEL SIGN UU
U+0AC3  40      GUJARATI VOWEL SIGN VOCALIC R
U+0AC4  40      GUJARATI VOWEL SIGN VOCALIC RR
U+0B41  40      ORIYA VOWEL SIGN U
U+0B42  40      ORIYA VOWEL SIGN UU
U+0B43  40      ORIYA VOWEL SIGN VOCALIC R

Above
U+0945  50      DEVANAGARI VOWEL SIGN CANDRA E
U+0946  50      DEVANAGARI VOWEL SIGN SHORT E
U+0947  50      DEVANAGARI VOWEL SIGN E
U+0948  50      DEVANAGARI VOWEL SIGN AI
U+0A47  50      GURMUKHI VOWEL SIGN EE
U+0A48  50      GURMUKHI VOWEL SIGN AI
U+0A4B  50      GURMUKHI VOWEL SIGN OO
U+0A4C  50      GURMUKHI VOWEL SIGN AU
U+0AC5  50      GUJARATI VOWEL SIGN CANDRA E
U+0AC7  50      GUJARATI VOWEL SIGN E
U+0AC8  50      GUJARATI VOWEL SIGN AI
U+0B3F  50      ORIYA VOWEL SIGN I (sometimes renders below)
U+0BC0  50      TAMIL VOWEL SIGN II (often ligates)
U+0C3E  50      TELUGU VOWEL SIGN AA
U+0C3F  50      TELUGU VOWEL SIGN I
U+0C40  50      TELUGU VOWEL SIGN II
U+0C46  50      TELUGU VOWEL SIGN E
U+0C47  50      TELUGU VOWEL SIGN EE
U+0C4A  50      TELUGU VOWEL SIGN O
U+0C4B  50      TELUGU VOWEL SIGN OO
U+0C4C  50      TELUGU VOWEL SIGN AU
U+0CBF  50      KANNADA VOWEL SIGN I
U+0CC6  50      KANNADA VOWEL SIGN E
U+0CCC  50      KANNADA VOWEL SIGN AU

Indic Non-Spacing Length Marks
Below
U+0C56  60      TELUGU AI LENGTH MARK

Above
U+0B56  70      ORIYA AI LENGTH MARK (missing in orginal table)
U+0C55  70      TELUGU LENGTH MARK

Indic Non-Spacing Vowel Modifiers (all above)
U+0902  100     DEVANAGARI SIGN ANUSVARA
U+0901  100     DEVANAGARI SIGN CANDRABINDU
U+0981  100     BENGALI SIGN ANUSVARA
U+0A02  100     GURMUKHI SIGN BINDI
U+0A70  100     GURMUKHI SIGN TIPPI
U+0A71  100     GURMUKHI SIGN ADDAK
U+0A82  100     GUJARATI SIGN ANUSVARA
U+0A81  100     GUJARATI SIGN CANDRABINDU
U+0B01  100     ORIYA SIGN CANDRABINDU
U+0B82  100     TAMIL SIGN ANUSVARA (missing in orginal table)

Indic Svaras
U+0952  220     DEVANAGARI STRESS SIGN ANUDATTA

U+0951  230     DEVANAGARI STRESS SIGN UDATTA

The same for Thai and Lao (needs some more research)

Thai

U+0E31  50      THAI CHARACTER MAI HAN-AKAT
U+0E34  50      THAI CHARACTER SARA I
U+0E35  50      THAI CHARACTER SARA II
U+0E36  50      THAI CHARACTER SARA UE
U+0E37  50      THAI CHARACTER SARA UEE
U+0E38  40      THAI CHARACTER SARA U
U+0E39  40      THAI CHARACTER SARA UU
U+0E3A   9      THAI CHARACTER PHINTHU (Pali virama)
U+0E47  50      THAI CHARACTER MAITAIKHU (is this really a vowel?)
U+0E48  80      THAI CHARACTER MAI EK
U+0E49  80      THAI CHARACTER MAI THO
U+0E4A  80      THAI CHARACTER MAI TRI
U+0E4B  80      THAI CHARACTER MAI CHATTAWA
U+0E4C  90      THAI CHARACTER THANTHAKHAT
U+0E4D  90      THAI CHARACTER NIKHAHIT
U+0E4E  100     THAI CHARACTER YAMAKKAN (what is this?)

(Different semantics of Thai vowel signs needs to be taken care of when
transliterating Pali from Devanagari to Thai or vice versa)

U+0EB1  50      LAO VOWEL SIGN MAI KAN
U+0EB4  50      LAO VOWEL SIGN I
U+0EB5  50      LAO VOWEL SIGN II
U+0EB6  50      LAO VOWEL SIGN Y
U+0EB7  50      LAO VOWEL SIGN YY
U+0EB8  40      LAO VOWEL SIGN U
U+0EB9  40      LAO VOWEL SIGN UU
U+0EBB  50      LAO VOWEL SIGN MAI KON
U+0EBC  40      LAO SEMIVOWEL SIGN LO
U+0EC8  80      LAO TONE MAI EK
U+0EC9  80      LAO TONE MAI THO
U+0ECA  80      LAO TONE MAI TI
U+0ECB  80      LAO TONE MAI CATAWA
U+0ECC  90      LAO CANCELLATION MARK
U+0ECD  90      LAO NIGGAHITA

The same for Tibetan

(This still needs some research)

TODO: The decompositions need to be checked.

U+0F71  40      TIBETAN VOWEL SIGN AA
U+0F72  50      TIBETAN VOWEL SIGN I
U+0F74  40      TIBETAN VOWEL SIGN U
U+0F75  40      TIBETAN VOWEL SIGN UU
U+0F7A  50      TIBETAN VOWEL SIGN E
U+0F7B  50      TIBETAN VOWEL SIGN EE
U+0F7C  50      TIBETAN VOWEL SIGN O
U+0F7D  50      TIBETAN VOWEL SIGN OO

TODO: more.

Devanagari

(updated 20-DEC-1997)

wrong glyph in table

The glyph for U+095E DEVANAGARI LETTER FA is wrong, shown is a Devanagari combination FRA.

additional vowels

There is also a doubled version of DEVANAGARI (LETTER | VOWEL SIGN) SHORT (E | O), used for transcribing short ai or au in some Indian languages, given in Grierson. However, some confusion is possible, because in this source, SHORT E is rendered with a reversed E, while short AI is represented with the SHORT E currently in Unicode. The reversed E cannot be handled as a mere glyph variation of SHORT E, as both shapes will be required in a document that uses this scheme. I propose to add:

0971    DEVANAGARI LETTER REVERSED E
          = short e
0972    DEVANGARI LETTER SHORT AU
0973    DEVANAGARI VOWEL SIGN SHORT AI
0974    DEVANAGARI VOWEL SIGN SHORT AU

duplicating SHORT E with the semantic SHORT AI will confuse people in data-entry, and is hence not proposed.

Source: Grierson: A Linguistic Survey of India (introduction, p. 7).

Vedic accents

The ISCII-91 standard defines a large number of Vedic accents (which can be compared with Hebrew cantilation marks) These are not included in Unicode, but need to. These are included in a separate proposal. As long as this proposal is not accepted, private area characters will have to be used for this purpose.

Currency symbols

Further, there are symbols for representing monetary amounts, as in Bengali. The symbols are no longer in current use, but may be of interest to people researching historical documents. These symbols can be combined, so only the following will be necessary.

0975    DEVANAGARI SYMBOL ONE ANNA
0976    DEVANAGARI SYMBOL TWO ANNAS
0977    DEVANAGARI SYMBOL THREE ANNAS
0978    DEVANAGARI SYMBOL FOUR ANNAS
0979    DEVANAGARI SYMBOL EIGHT ANNAS
097A    DEVANAGARI SYMBOL TWELVE ANNAS

source: Grierson.

Gujarati

Why are the Gujarati vowels with candra called

GUJARATI VOWEL CHANDRA E
GUJARATI VOWEL CHANDRA O

when all other vowels are called LETTER?

Danda is commonly used in Gujarati, but not encoded. It can be borrowed from the Devanagari block.

[v2.0: Gujarati avagraha: glyph printed upside-down?]

Gurmukhi

Gurmukhi is used for writting the Panjabi language in India, which is also written in the Arabic script in Pakistan.

In ISCII, the distinction between TIPPI and BINDI is made by context, and both are encoded by the same character.

Why are GURMUKHI LETTER EE and OO not not called E and O?

Given in ISCII-91, and not in Unicode:

0A58 GURMUKHI LETTER QA
0A5D GURMUKHI LETTER RHA

The Gurmukhi alphabetical order is different from the other Indic scripts.

sources: ISCII standard, C. Shackle, Punjabi (Teach yourself books)

Bengali

BA and VA

ISCII-91 makes a distinction between VA and BA (and uses a slightly different glyph for VA), so I propose to include it for compatibility with ISCII-91 only. Note that the Bengali alphabet simply repeats BA in the place of Devanagari VA.

09B1   BENGALI LETTER VA

YA and YYA

There is some confusion possible between YA and YYA.

Both letters are derived from Sanskrit ya, but Bengali YA is pronounced like JA /ja/. To indicated the that YA is to be pronounced as /ya/, a dot was added in the 19th century. The secondary forms of YA and YYA (YA-Phaala) seem to be the same, but sources differ on this issue. According to Chatterji, the secondary form is YYA, though many others maintain it is YA. The most sensible way of handling this seems to use the secondary form Ya-Phaala for both characters when they appear in secondary position.

Further confusion may arise from the ISCII-91 encoding of both characters, which is based on phonetic value. ISCII-91 encodes Bengali YYA parallel with YA in Devanagari, and Bengali YA with Devanagari YYA, while the Unicode encoding is based on the graphical appearance, and puts Bengali YA parallel with Devanagari YA, and Bengali YYA with Devanagari YYA. An ISCII implementation I have used (LEAP from C-DAC) only generates YA-Phaala when YYA is typed.

This issue should be taken care of when converting ISCII to Unicode.

sources: ISCII-91 standard, S.K. Chatterji, Bengali Self-taught.

Secondary YA after vowels

In Bengali, the secondary YA sometimes follows the full vowels A and O. There is no indicated way of encoding the secondary YYA (ya-phala), following the vowels A and O. I propose to use the following idiom:

A + VIRAMA + YYA => A + secondary YYA
O + VIRAMA + YYA => O + secondary YYA

source: S.K. Chatterji, Bengali Self-taught.

Danda

Danda is commonly used in Bengali, but not encoded. It can be borrowed from the Devanagari block.

Oriya

BA and VA

ISCII-91 makes a distinction between VA and BA (and uses a different glyph for VA), so I suggest including

0B31 ORIYA LETTER VA

This character is graphically an O with a subscript BA, so it can also be represented as O + VIRAMA + BA; see the remark below.

Besides this character, an Oriya letter BA with a dot inside is also used for va, as shown in the following alphabet book I snatched somewhere from the net.

YA and YYA

Also see under Bengali script. The same confusion is possible here. The correct mapping from ISCII is:

0B2F ORIYA LETTER YA  <-> ISCII-91 CE Consonsant JYA
0B5F ORIYA LETTER YYA <-> ISCII-91 CD Consonant YA

In an ISCII implementation I've studied, YYA will appear as the secondary form, YA not.

The decomposition of YYA into YA NUKTA in the 2.014 table seems incorrect. No Oriyan will consider YYA a YA with a NUKTA (dot).

Secondary YA after vowels

The secondary YYA may appear after the vowels E and O, and the secondary letter BA can appear after O. This can be encoded using VIRAMA.

E + VIRAMA + YYA  => E + secondary YYA + ai length mark
O + VIRAMA + YYA  => O + secondary YYA + ai length mark
O + VIRAMA + BA   => O + secondary BA

The exact use of such vowels is still unclear. The letters AI and AU can be seen as graphical contractions of the two first combinations.

(see also the comments on Bengali)

Vowel signs for VOCALIC RR, L, and LL

I have not been able to find vowel signs for the vowels VOCALIC RR, VOCALIC L, VOCALIC LL in literature available to me. Still they might be required for printing Sanskrit in Oriya script. Can somebody please try to locate these symbols.

Oriya character names

anusvara: anusvara
visarga: bisarga
candrabindu: candrabindu, anunasika
avagraha: abagraha

Danda

Danda is commonly used in Oriya, but not encoded. It can be borrowed from the Devanagari block.

ORIYA SIGN CANDRABINDU is sometimes treated as a spacing mark.

Tamil

(updated 29-OCT-1997)

Note that TAMIL AU LENGTH MARK looks like a TAMIL LETTER LLA.

Traditional versus reformed script

The explanations in the Unicode standard, v 2.0 are based on traditional script as it is in use in Sri Lanka. India currently uses the reformed script, which drops a number of irregular glyphs.

Tamil characcter names

virama: pulli

Tamil numerals

Tamil numerals do not combine like decimal digits, but rather like ideographic numbers, that is, there is no zero, but there are signs for 10, 100, 1000. To represent 1980, one writes

TAMIL DIGIT ONE
TAMIL NUMBER ONE THOUSAND
TAMIL DIGIT NINE
TAMIL NUMBER ONE HUNDRED
TAMIL DIGIT EIGHT
TAMIL NUMBER TEN,

i.e 1*1000 + 9*100 + 8*10 = 1980

Nowadays, international numerals are used in Tamil in India and Sri Lanka, However genuine decimal Tamil numerals are in use in Mauritius. Modern Mauritian banknotes use Tamil numerals as decimal numerals, with a European style zero. [see Pick, World Paper Money, vol. 2, 7th ed., p. 841] This will require:

0BE6 TAMIL DIGIT ZERO

Additional symbols

There are special symbols for year, month and day in Tamil. These can be added after the numerals

0BF3 TAMIL SYMBOL FOR YEAR
0BF4 TAMIL SYMBOL FOR MONTH
0BF5 TAMIL SYMBOL FOR DAY

Tamil Om sign, consisting of an O with inscribed MA with ANUSVAR.

0BF6 TAMIL OM SIGN

Tamil "Shri" is often written with a grantha conjunct, which cannot be produced with the ordinary composition rules.

0BF7 TAMIL SHRI SIGN

I can provide samples of the glyphs.

Tamil Grantha

Tamil Grantha is not included Unicode, but can be encoded in parallel with Tamil. However, a separate encoding seems to be in the making already. I will work out a proposal for Tamil Grantha.

Telugu

Most Telugu consonants have a little hacek like check on top of them. By some school books this is considered as the vowel sign for A, and rendered as a separate character. Why not add TELUGU SIGN CHECK MARK (or rendering rule to get it)?

Telugu character names

candrabindu: ardhasunna
anusvara: sunna
visarga: visarga
virama: valapalagilaka

Additional characters

Source: Lakshmi V.S. Mukkavilli, _TeluguTeX_.

@ dependent vowel signs
0C62 TELUGU VOWEL SIGN VOCALIC L
0C63 TELUGU VOWEL SIGN VOCALIC LL

@ various signs
TELUGU ARASUNNA
: telugu sign candrabindu
TELUGU SUNNA
: telugu sign anusvara
TELUGU VISARGA
: telugu sign visarga
???? TELUGU SIGN ARDHAVISARGA
visarga of which the circles are open below
0C3D TELUGU SIGN AVAGRAHA
???? TELUGU ?
sign of which no name is given, looks like an upside down
???? TELUGU SIGN NAKARAPOLLU
TELUGU SIGN VALAPALAGILAKA
: telugu sign virama
TELUGU STRESS SIGN UDATTA
: devanagari stress sign udatta
TELUGU STRESS SIGN DOUBLE UDATTA
: vedic long svarita
TELUGU STRESS SIGN ANUDATTA
: devanagari stress sign anudatta

Nakanishi gives symbols for one quarter, one half, and three quarters.

TELUGU SYMBOL FOR ONE QUARTER
TELUGU SYMBOL FOR ONE HALF
TELUGU SYMBOL FOR THREE QUARTERS

ALA-LC Romanization Tables include:

???? TELUGU LETTER CCA
???? TELUGU LETTER JJA

which look like CA and JA respectively, but have an upside down vowel sign aa above them.

???? TELUGU SIGN UPSIDE DOWN MATRA AA (better name needed)

Danda is sometimes used in Telugu, but not encoded. It can be borrowed from the Devanagari block.

Kannada

(Kannada rra and fa are not in ISCII-91)

Kannada character names

anusvara: bindu, sonne

Why not add KANNADA SIGN CHECK MARK (or rendering rule to get it)?

Danda is sometimes used in Kannada, but not encoded. It can be borrowed from the Devanagari block.

Malayalam

(updated 27-OCT-1997)

cillu letters

Malayalam uses ligatures of the virama with some characters (so called cillu letters). These should be distinquished from the same letters with a virama. I suggest using <character><virama><zwj> to produce these letters, and <character><virama> for the explicit virama. This is in parallel with the ISCII standard and Unicode conventions (when cillu letters are seen as a kind of half letters). ISCII uses soft halant for this purpose as well as for forcing half-consonants in several North Indian scripts. In some cases, this cillu letter can also appear in a conjunct. For example N + RRA. The proposed idiom for this combination is: N<virama><zwj><virama>RRA. (this is admittedly ugly)

traditional versus reformed script

The description in the Unicode standard is based on traditional script. Since 1974 reformed script has been in use. In reformed Malayalam script, VOWEL SIGN U, VOWEL SIGN UU, and VOWEL SIGN VOCALIC R are no longer non-spacing marks, and that VOWEL SIGN VOCALIC RR, and VOCALIC RR, VOCALIC L, VOCALIC LL have been abolished.

Traditional Malayalam script allows placing a virama on a cluster already carrying the u matra, to notify a short u sound.

(Malayalam has also been written in Arabic script, using some extra letters, I will try to find out details)

Additional characters in Malayalam

@ dependent vowel signs
0D44 MALAYALAM VOWEL SIGN VOCALIC RR
0D62 MALAYALAM VOWEL SIGN VOCALIC L
0D63 MALAYALAM VOWEL SIGN VOCALIC LL

The vowel sign for VOCALIC RR is in Gundert, p, 290, entry kRR, and p. 697 pRR. The dependent vowel signs for VOCALIC L and VOCALIC LL are the the same glyphs subscribed. The sign for VOCALIC LL can be found in Gundert on p. 290, entry kLLptam.

In current use is a sign for ordinals, which can be found in newspapers and in many Malayalam fonts. Its use corresponds with -th in English.

@ additional symbols
0D70 MALAYALAM ORDINAL SIGN

looks like letter n with curl on right side.

Malayalam numerals are hardly ever used nowadays.

Also found a single reference to a sign for ONE HALF [Frohnmeyer]. Possibly it is something like the Bengali currency numerators.

Malayalam names for characters (TODO)

Thai

TODO (canonical ordering)

Lao

TODO (canonical ordering)

Tibetan

TODO (canonical ordering, composition, meaning of symbols in English)

Converting ISCII to Unicode

Today, ISCII is, in its various incarnations, the most widely used character set for Indian languages. However, round trip convertion between the latest version of ISCII, from 1991, and Unicode is not possible, for the following reasons:

1. ISCII makes a distinction between Bengali and Assamese,
2. ISCII encodes font and style changes,
3. ISCII uses various techniques to encode variant renderings.
4. ISCII encodes several characters not in Unicode.

The first two problems are not within the scope of Unicode, the third is partly in scope of Unicode (as the variants are sometimes significant), and the last definitly is, and needs fixing.

Some further issues are described with the respective scripts above.

It should be noted that Unicode is based on an older version of ISCII. This is no problem, as translation tables/routines should be used in any case. It should also be noted that translation tables/routines should be used to translate from one Indic script to another. A mere addition of a constant to the character codes does not yield defined Unicode characters in all cases.

Semantics of Joiner and Non-Joiner in Relation with ISCII Usage

ISCII-91 uses the doubling of some characters, and nukta to indicate rendering variants. These idioms will have to be translated to their Unicode equivalents. The idioms proposed here are choosen such that they follow Unicode conventions already in place, and such that they have minimal impact on systems that do not use these conventions: both joiner and non-joiner will not print. Some of these conventions are not part of ISCII, but implemented by its major promoters, C-DAC in Pune in its products.

Character type                   ISCII Code(s)           Unicode 

C - consonant
H - halant/viram
N - nukta
J - joiner
X - non-joiner
V - vowel
M - vowel sign
D - vowel modifier
S - non-spacing diacritic

de facto         proposed
ISCII-91         Unicode           Semantic
idiom            idiom

C1 H C2          C1 H C2           Use standard conjunct of C1 and C2.
C1 H N C2        C1 H J C2         Use half-form of C1, followed by full C2
C1 H H C2        C1 H X C2         Use C1 with halant, followed by C2 (no conjunct)
C1 H H H C2      C1 H J J C2       Use variant conjunct of C1 and C2
                                   (applications may use more J to indicate further variants)
C1 H N           C1 H J            (Malayalam) Use cillu letter
C1 H N N C2      C1 H J H C2       (Malayalam) Use secondary consonant under cillu letter

C M              C M               Use optionally ligating vowel-sign
C M M            C X M             Use standard vowel sign (no ligating)
C M M M          C M J             Use ligating variant of vowel sign
                                   (applications may use more J to indicate further variants)

in the above three cases in ISCII this behaviour is only implemented for a few consonant--vowel sign combinations, for example in Devanagari: Ru, Ruu, Hr and Gujarati: Ru, Ruu.

Further idioms can be used for special usages.

                V M               apply vowel sign to a vowel.
                V H C             apply secondary consonant to a vowel.
                C M H             apply both vowel sign and halant (as is used in Malayalam)

Further, ISCII uses a number of code-extention techniques to access a number of lesser used characters. The following combinations should be noted:

anusvara + nukta     -> om sign
danda + nukta        -> avagraha
i + nukta            -> vocalic l
ii + nukta           -> vocalic ll
vocalic r + nukta    -> vocalic rr
vs i + nukta         -> vs vocalic l
vs ii + nukta        -> vs vocalic ll
vs vocalic r + nukta -> vs vocalic rr

The syntax of an Indic syllable

word ::=
initial-syllable syllables.

(note: in Bengali, word-initial vowel sign e, o, etc. are sometimes rendered with a different glyph, hence the difference in the syntax)

syllables ::=
syllable syllables
| empty

syllable ::=
base ending

base ::=
vowel
| consonant
| cluster

cluster ::=
consonant halant cluster

consonant ::=
C [ N ]

halant ::=
H [ J ] [ J | X ]

ending ::=
[ [ X ] M ] [ J ] [ D ]
| [ [ M ] H ]

Converting from One Indic Script to Another

Many people in India know more than one language, but only one or two scripts. For example, somebody living in Orissa may have no problem understanding Bengali and Hindi, but may not be able to read their scripts. For this reason, script convertion will often be beneficial. For this purpose, tables have to be prepared to convert one script to another. I propose here a set of tables, which map to and from the Devanagari block, to allow such convertions. The Devanagari block is choosen, because it is the most complete block in the standard, and the most widely known script in India. It should be noted that these tables do not give round-trip convertions.

The tables or software using them should take into account the possibility of decomposition of NNNA into N + NUKTA, etc.

DEVANAGARI -> GUJARATI -> DEVANAGARI

DEVANAGARI -> ORIYA -> DEVANAGARI

DEVANAGARI -> BENGALI -> DEVANAGARI

DEVANAGARI -> ASSAMESE -> DEVANAGARI

DEVANAGARI -> GURMUKHI -> DEVANAGARI

DEVANAGARI -> MALAYALAM -> DEVANAGARI

DEVANAGARI -> TAMIL -> DEVANAGARI

DEVANAGARI -> KANNADA -> DEVANAGARI

DEVANAGARI -> TELUGU -> DEVANAGARI

In addition, the following tables will be produced:

DEVANAGARI -> ARABIC -> DEVANAGARI (DV->AR is not Hindi->Urdu, because

the Urdu orthography differs)

GURMUKHI -> ARABIC -> GURMUKHI

DEVANAGARI -> ROMAN -> DEVANAGARI (using diacritics)

DEVANAGARI -> ROMAN (without diacritics)

and all other scripts to Roman.

characters not given in the table should not be changed.

Appendix 1 Transliteration of Devanagari to Roman

The transliteration given is the one most commonly seen in Indological publications. Where clear concensus exist on a certain transcription, common alternatives are also given.

Devanagari    Transliteration  Devanagari	       Transliteration
Character 	    Character(s)     Name             Name(s)
------------------------------------------------------------------------------
0901          1E41             candrabindu      m with dot above
              006D 0310                         m + candrabindu
0902          1E43             anusvara         m with dot below
0903          1E25             visarga          h with dot below

0905          0061             a                a
0906          0101             aa               a with macron
0907          0069             i                i
0908          012B             ii               i with macron
0909          0075             u                u
090A          016B             uu               u with macron
090B          1E5B             vocalic r        r with dot below
              0072 0325                         r with circle below
090C          1E37             vocalic l        l with dot below
              006C 0325                         l with circle below
090D          00EA             candra e         e with circumflex
090E          0115             short e          e with breve
090F          0065             e                e
0910          0061 0069        ai               a + i [see note 1]
0911          014F             candra o         o with circumflex
0912          014D             short o          o with macron
0913          006F             o                o
0914          0061 0075        au               a + u [see note 2]
0915          006B             k                k [see note 3]
0916          006B 0068        kh               k + h
0917          0067             g                g
0918          0067 0068        gh               g + h
0919          1E45             ng               n with dot above
091A          0063             c                c
091B          0063 0068        ch               c + h
091C          006A             j                j
091D          006A 0068        jh               j + h
091E          00F1             n                n with tilde
091F          1E6D             tt               t with dot below
0920          1E6D 0068        tth              t with dot below + h
0921          1E0D             dd               d with dot below
0922          1E0D 0068        ddh              d with dot below + h
0923          1E47             nn               n with dot below
0924          0074             t                t
0925          0074 0068        th               t + h
0926          0064             d                d
0927          0064 0068        dh               d + h
0928          006E             n                n
0929          1E49             n                n with line below
092A          0070             p                p
092B          0070 0068        ph               p + h
092C          0062             b                b
092D          0062 0068        bh               b + h
092E          006D             m                m
092F          0079             y                y
0930          0072             r                r
0931          1E5B             rr               r with dot below
              1E5F                              r with line below
0932          006C             l                l
0933          1E37             ll               l with dot below
0934          1E3B             lll              l with line below
              006C 0324                         l + double dot below
0935          0076             v                v
0936          015B             s                s with acute
0937          1E63             s                s with dot below
0938          0073             s                s
0939          0068             h                h

093C                           nukta [see note 4]
093E                           avagraha

[TODO vowel signs same as corresponing vowels]

0958          0071             q                q
0959          1E35 0068 0331   kkh              k with line below + h + macron below

[TODO]

0960          1E5D             vocalic rr       r with dot below and macron
              0072 0325 0304                    r with circle below and macron
0961                           vocalic ll

Indian names are normally transliterated in an ad-hoc fashion. Normally an Indian uses a single Roman spelling of his name, while another Indian carrying the same name may transliterate it in a different way. For example, the name Chauduri can be found in at least eight different spellings. One individual preferring Chowdury, the other Chaudhuri. Indian telephone directories normally sort all such names as if they had been spelled the same, and add cross-references with the other spellings.

[note 1] The sequence ai should be disambiguated with a diaeresis on the i if necessary.

[note 2] au likewise.

[note 3] Depending on context, a should be added after the consonants.

[note 4] Nukta following some letters modifies their sounds. The combination letter + nukta should be transliterated as a unit.

Appendix 2 Transliteration of Devanagari to other Indic scripts

A comma separated table specifies how Devanagari can be converted into any of the other Brahmi derived scripts in use in India. Note that the translations are not one-to-one and are not reversible. Approximate translations are given in parentheses, alternatives are given in case of ambiguities. Characters borrowed from the Devanagari block are given in brackets. When converting character data from one script to another, the following should be noted:

[1] Gurmukhi: when a consonant is doubled, this is indicated with U+0A71 for example 0915 094D 0915 maps to 0A15 0A71.

History

    01-FEB-1998 Added notes on Telugu.
    20-DEC-1997 Added note on wrong glyph in Devanagari table,
                Worked on tables (JH)
    30-OCT-1997 Converted to HTML, added illustrations (JH)
    29-OCT-1997 Updated remarks on Tamil (JH)
    27-OCT-1997 Updated remarks on Malayalam, Devanagari (JH)
    08-SEP-1997 Added TAMIL SHRI SIGN, changed some remarks with Oriya (JH)
    04-SEP-1997 First posting on unicode@unicode.org (JH)
    29-AUG-1997 Removed comments on Unicode 1.1 now resolved (JH)
                Added remarks on Thai, Lao, and Tibetan.
    26-AUG-1997 Changed canonical ordering table (JH)
    25-AUG-1997 Added correction of canonical ordering table, and
                list of graphical decompositions (JH)
    17-AUG-1997 Added note on canonical ordering of Indic characters
                Several minor modifications. (JH)
    16-JUN-1997 Added more notes on convertion from ISCII to Unicode (JH)
    19-MAY-1997 Added notes on convertion from ISCII to Unicode (JH)
    11-MAY-1997 Revision with respect to Unicode 2.014 (JH)
    22-JUN-1994 Revision (JH)

Acknowledgements

I like to express my thanks to the following persons, who provided feed-back on previous versions of this document:

Anthony P Stone <stone_catend@compuserve.com> (Various scripts, especially Bengali and Oriya)
Cibu C.J. <cibu@miel.mot.com> (Malayalam)

Expected	Reordered
<KA> <vs E> <anusvara>	<KA> <anusvara> <vs E>
<KA> <vs I> <anusvara>	(not reordered)