Notes on mapping into ITRANS from ISCII and Unicode.

 ITRANS offers a widely-used 7-bit phonetically mnemonic encoding for Indian languages, which appears to be losslessly interconvertible with the ISCII and Unicode encodings for these languages -- assuming one keeps careful track of what parts of the output are really ITRANS encodings and what parts are not. Note that if you don't do this, the results will NOT be safely round-trippable, for obvious reasons.

If one wanted a safely round-trippable roman-alphabet encoding of (for instance) Hindi, one would need to use techniques like those employed by the (once?) widely-used HZ encoding scheme for Chinese.

ITRANS 5.3  software converts from ITRANS to UTF-8 (among other things). However, it doesn't do any mappings into ITRANS. For folks who can't read devanagari easily, mapping into ITRANS from other common encodings of Hindi may be helpful.

1. Conversion from ISCII to ITRANS

This uses a simple python script due to Arun Sharma (arun@sharma-home.net).

We start with a sample from the start of the IIIT English-Hindi dictionary, processed Xiaoyi Ma to contain lines of the form

  ENGLISH-WORD   POS   HINDI-WORD

where the HINDI-WORD is encoded in ISCII.

Testing

$  ./iscii2itrans.py <hdictsample1

produces

a    Det    Eka
aback    Adv    pIChE
aback    Adv    -arka-taprabha
abacus    N    ginataAraA
abandon    V    ChODa??~dEnaA
abandoned    Adj    ChODa??A~-arkauA
abandonment    N    parityaAga
abase    V    avamaAnita~karanaA
abashed    Adj    lajjita
abate    V    kama~-arkaOnaA
abatement    N    kamI
abattoir    N    vadhashaAlaA
abbess    N    maThaAdhyakssaA
abbey    N    IsaAiyO~kaA~maTha
abbot    N    maThaAdhyakssa
abbreviate    VT    saamkssipta~karanaA
abbreviation    N    saamkssipti
abdicate    V    svEchChaA~sE~ChODa??naA
abdication    N    pada~tyaAga
abdomen    N    pETa

which looks OK.

2. Conversion from UTF-8 to ITRANS

We start with a sample from the Naidunia newswire (from 1999), translated to UTF-8 and SGML by Kevin Walker and Dave Graff.

The iconverter software from IIT Kanpur is used to convert to from UTF-8 to ISCII, reducing the problem to the previously solved one:

$  utf2unicode naidunia_sample.utf8 naidunia_sample.uni
$  iconverter -e unicode_iscii_dev naidunia_sample.uni naidunia_sample.isc
$  ./iscii2itrans.py <naidunia_sample.isc

produces

<!DOCTYPE NEWSWDAY SYSTEM "cynewulf.dtd">
<NEWSWDAY>
<DOC>
<DOCID>NAI19990901.6707</DOCID>
<HEADER>
<DATE>19990901</DATE>
<LANG>Hindi</LANG>
<CATE>1</CATE>
<SRCE>Naidunia</SRCE>
</HEADER>
<BODY>
<HEADLINE>
kuCha laEgaEam nE raAjanIti kaE dhaamdhaA banaA liyaA
svayaam kE -arkaitaEam kE liE gaThabaamdhana aAIra maEchaE banE aha saEniyaAjI
</HEADLINE>
<TEXT>
<P>
 maamgalavaAra agastaa kaAamgarEsa adhyakssa shrImatI saEniyaA gaAandhI nE ka-arka-A -arkaAI ki kuCha laEgaEam nE raAjanIti kaE vyavasaAya banaA liyaA -arkaAIa EsE laEgaEam nE svayaam kE -arkaitaEam kE liE gaThabaamdhana aAIra maEchaE banaA liE -arkaAIama un-arka-EamnE pradhaAnamaamtrI shrI aTalabi-arka-ArI vaAjapEyI aAIra rakssaAmaamtrI shrI ja??rja phanaArraDIsa para kaAragila muddE kaE lEkara  5-arka-malaA 2 baElaAa un-arka-EamnE aAraEpa lagaAyaA ki paAkistaAna dvaAraA kaAragila mEam ghusapAITha kaA mukaAbalaA karanE mEam bhI sarakaAra asaphala ra-arkaI -arkaAIa un-arka-EamnE ka-arka-A ki paAka sE chInI aAyaAta muddE kaE vE janataA kE bIcha lE jaAEangIa
</P>
<P>


which looks OK.

For more kinds of encoding and font interconversion for varieties of digital Hindi, see the relevant section of my notes on Hindi resources.