Issues in translating Indic fonts

In mapping an Indic font/encoding to ISCII or Unicode, there are basically two steps:

identify the glyph at each of about 200 active 'code points' in the font (by looking at a font map such as http://ldc.upenn.edu/myl/Shree708pdf), and figure out what the corresponding ISCII or Unicode sequence is;
re-arrange the order of codes (in input or output) to correspond to ISCII or Unicode conventions ("logical" rather than "graphical").

There are two approaches to doing this -- one is to set up a simple code-to-string-of-codes mapping, and then use logic to accomplish the needed re-arrangements; the other is to compile it all into a big table of string-to-string mappings (which has to be several thousand entries long to really cover everything).

Unfortunately, each font/encoding make different choices about which devanagari "aksharas" to put in the font and which to create by overstriking, and also different choices about where in the code space to locate the glyphs that they choose.

The following illustrative example from my current endeavor may be helpful. (I hope it is correct! my knowledge of devanagari is pretty shallow...)

The Shree708 sequence consisting of the three characters with hex values 7b dd d0 is reasonably common -- it's about the 1000th most common trigram, out of 44K distinct trigrams observed in a 2.7M-character sample, occurring about once per 7000 characters. It's (an especially complex) representative of a set of interacting issues (out-of-order vowels, dead consonants, conjunct consonants and half-consonants) that come up several times in nearly every sentence.

If we look at the glyphs for these three characters, we can identify them (very painfully for those of us who don't read devanagari!) as

i_matra half-consonant_na conjunct_da+ra

In the header file for my conversion program, these three codes
and their ISCII counterpart sequences are identified as:

/* 7b: vowel i (matra)   */  "\xDA",          /* i_matra */
/* dd: half-consonant na */  "\xC6\xE8\xE9",  /* na halant nukta */
/* d0: conjunct da+ra    */  "\xC4\xE8\xCF",  /* da halant ra */

I believe that this Shree708 sequence encodes the phonetic sequence /ndri/.

If we just concatenated the ISCII outputs, we'd get

i_matra na halant nukta da halant ra

But in fact we need to move the i_matra to the end of the string:

na halant nukta da halant ra i_matra

(or in actual values "\xC6\xE8\xE9\xC4\xE8\xCF\xDA")

The corresponding Unicode sequence is also 7 characters, with the
same order but a slightly different idiom for the half-consonant:

na halant zero-width-joiner da halant ra i_matra

Explanation:

The short /i/ vowel is printed before the consonants in devanagari (and therefore appears in that order in typical fonts/encodings such as Shree708). In ISCII and in Unicode, it appears in "logical" or "phonetic" order, after the glyphs for the consonant or cluster of consonants that follow it graphically.

The regular consonant symbols (such as "da" and "na") carry an implicit short vowel /a/ with them. To cancel this vowel, we need to add a "halant". If explicitly rendered, the halant would be is a stroke under the consonant symbol -- also sometimes called "virama" -- but here it is just a sort of control code, indicating that the two characters around it are to be rendered as a single glyph.

So in both ISCII and Unicode, "da halant ra" is the way to encode the da+ra conjunct glyph. In isolation this glyph has the phonetic value /dra/, i.e. carrying its own implicit short /a/, but in the current example, the short /i/ from the preceding i_matra substitutes for this, giving /dri/.

To make a "half-consonant" (a vowel-less combining form that is nevertheless an independent glyph), ISCII places "halant nukta" (= "soft halant") after the consonant letter, while Unicode place "halant ZWJ". The "nukta" is otherwise an underdot diacritic, but here ISCII is just using it as a control character. The "halant nukta" (or "halant ZWJ" sequences are not rendered, but just serve to signal that the preceding consonant (or consonant cluster) is to be rendered in the "half" form.

Given that there are dozens if not hundreds of distinct fonts/encodings for online material in Hindi (and other Indic languages), it would be an interesting machine learning problem to try to automatically induce a mapping from the encoding(s) used on one web site to the encodings used on another...