Language Log

Kazakh

December 12, 2014 @ 8:08 pm · Filed by Victor Mair under Language and computers, Translation

Google Translate just keeps getting bigger and bigger and better and better. As of today, it now includes Kazakh. And here's the first word that I typed in Google Translate + Kazakh:

Қазақ

You enter text with the official Kazakh Cyrillic alphabet, which is accessible through a keypad at the bottom left of the entry box. If you click on the "Ä" button, it will give a phonetic reading of the words in the box. You can also click on the speaker symbol to hear an audio reading of the words.

From Adam Grode:

Over the course of this past year, I, along with my translation partner, Ainelya Musina, were commissioned to complete Google Translate + Kazakh by translating and rating nearly 20,000 phrases.

By my count, there are 90 languages listed under the "Detect language" pull-down bar. But Google Translate includes both simplified and traditional Chinese characters, so that ups the total to 91. And Google also supports Cantonese inputting, as is demonstrated in this remarkable video.

I never cease to marvel at what Google Translate can do.

December 12, 2014 @ 8:08 pm · Filed by Victor Mair under Language and computers, Translation

Permalink

25 Comments

Chris C. said,

December 12, 2014 @ 10:06 pm

Well, doing the ol' round-trip test on the Kazakh translator, the opening sentence of "Pride and Prejudice" came back, "It is owned by one person, it is generally accepted truth a good fortune, must be under the age of his wife."

Which is better than I had expected, to be honest.
Leo E said,

December 12, 2014 @ 11:24 pm

This is fantastic. It's one of ten new languages added, the others being Chichewa, Malagasy, Sesotho, Malayalam , Myanmar, Sinhala, Sundanese, Tajik, and Uzbek. I can't remember when Azerbaijani and Mongolian were added, but it's great to see some more Central Asian representation (as well as Africa, India, and southeast Asia). I've been waiting for Uzbek for quite some time now, and according to a briefing by Google they are hoping to add Kyrghyz soon (fingers crossed for Uyghur and Tibetan next). Happy to see they incorporated Shavkat Butaev's dictionary into Uzbek, and I'll be interested to see what I find looking into the Kazakh.

Aside from its usefulness and better success rate at translating isolated words and phrases, as well as converting text, pronunciation, and its very useful input methods, Translate's more daunting task is syntax. I was curious as to how it would tackle the notoriously labyrinthine sentences of Turkic, so I looked at how it handled a sentence from the BBC's Uzbek news page.

Истанбулдаги Ўзбеклар Бирлиги ташкилотининг раҳбари ўзбекистонлик имом Абдуллоҳ Бухорийнинг ўлими ортида Ўзбекистон ва Россия махфий хизматларини гумон қилаётганини билдирди.
Istanbuldagi O'zbeklar Birligi tashkilotining rahbari o'zbekistonlik imom Abdulloh Buxoriyning o'limi ortida O'zbekiston va Rossiya maxfiyy xizmatlarini gumon qilayotganini bildirdi.

which I translate as:
"The leader of the Uzbek Union organization in Istanbul has announced that he suspects the Uzbek and Russian secret services being behind the death of the Uzbek Imam Abdulloh Bukhoriy."

Google gives: "Istanbul, the Uzbek Uzbek behind the death of Imam Abdullah Bukhari, head of the Union of Uzbekistan and the Russian secret services, said the suspects."

Alright, can't really blame them for the relative and possessive clauses which can stretch between phrases and be resolved pretty far down the line. So breaking it up into what I think are relatively isolated clauses:

Истанбулдаги Ўзбеклар Бирлиги ташкилотининг раҳбари
Istanbuldagi O'zbeklar Birligi tashkilotining rahbari
"the head of the Uzbeks' Unity organization in Istanbul"
Google: "Uzbek League head of the Istanbul"

ўзбекистонлик имом Абдуллоҳ Бухорийнинг ўлими ортида
o'zbekistonlik imom Abdulloh Buxoriyning o'limi ortida
"behind the death of the Uzbek Imam Abdulloh Bukhoriy"
Google: "Uzbek behind the death of Imam Abdullah Bukhari"

Ўзбекистон ва Россия махфий хизматларини гумон қилаётгани [-ни]
O'zbekiston va Rossiya maxfiyy xizmatlarini gumon qilayotgani [-ni]
"his suspecting ["doing suspicion of"] the secret services of Uzbekistan and Russia"
Google: "Uzbekistan and the Russian secret services suspect"

билдирди.
bildirdi
"[has] announced, said"
Google: "said/expressed"

There are still some problems, especially with words beginning clauses, and the above is of course an extremely crude test drive of their software and not intended to be a doleful denunciation of their efforts. I'm excited to see how their Turkic syntax develops, and wish that I were at all able to look at how it handles Turkish or whether Kazakh, Uzbek, Azeri, and Turkish have any overlap in how they are programmed. At least it's a reliable dictionary and input method for now, and someday there will be cyborgs speaking perfectly accented Kazakh and we will have Google to thank for the initial development.
Leo E said,

December 13, 2014 @ 12:21 am

One odd thing I forgot to mention is that (unless I'm crazy and not seeing it) there is no input key for the letter h in Uzbek (there is in Kazakh) – it differentiates between ҳ (h) and х (kh). Makes input kind of awkward since it is a very common letter.
michael farris said,

December 13, 2014 @ 3:43 am

In other Turkic news, Erdogan wants to revive Ottoman Turkish. This should be interesting…..

http://www.washingtonpost.com/blogs/worldviews/wp/2014/12/12/why-turkeys-president-wants-to-revive-the-language-of-the-ottoman-empire/
Keith said,

December 13, 2014 @ 6:51 am

@Leo, and others

I barely ever use Google Translate's input method, because I'm almost always at my computers that I truly own, or at my employer's computer that I have almost total control over; as a result, I can set up the keyboard layout exactly how I want in Linux, or almost how I want when I'm forced to use Windows 8 (thought the keyboard handlers in 8 are a big step back from what they were in XP).

Even my new phone (Android 4.4) has input methods for more languages that you could shake a stick at (including, for example, Georgian, Kazak and Armenian).

So I can type letter like қ, к, ң, ә, х, ү, ұ, ա, զ, ղ, ց, ր, ե etc, no problem.

I would think that Google Translate's input method will be less and less used, as people use smartphones more and more, and use cybercafés less and less.
Bart said,

December 13, 2014 @ 8:42 am

Prompted by Leo E's tip, I went straight to try out the Sundanese translation facility – something I've long been waiting for.
But, oh no! The Google Translate menu was offering Sudanese, not Sundanese.
I tried it anyway, and yes, it really was Sundanese.
Levantine said,

December 13, 2014 @ 9:45 am

michael farris, I don't think even Erdoğan is crazy enough to think that Ottoman Turkish could be revived in the sense of becoming a viable competitor to the modern iteration of the language. But I do agree with him (words I never thought I'd be saying!) that Turks should be given access to their own literary and linguistic heritage, and teaching Ottoman in schools is the only way to do that. Imagine kids in the anglophone world never being exposed to the English of Shakespeare or Chaucer.
michael farris said,

December 13, 2014 @ 10:21 am

I was under the impression that Ottoman lessons are in fact available for those interested (esp at universities).

I think Erdogan's less interested in young turkish people than in attacking one more (and one of the most enduringly popular) western reforms of Ataturk.
Levantine said,

December 13, 2014 @ 10:40 am

I agree that his motivations are ideological, and I'm worried about how such lessons might be framed. But it is a great shame that the majority of modern Turks have no way of reading or understanding their language as it existed less than a hundred years ago. Istanbul is a city full of Ottoman inscriptions, which may as well be in Latin for all that they mean to most people now. Learning how to read Ottoman is not particularly difficult, and as for the existing availability of lessons in university contexts, do you think that young people in the English-speaking world should have to wait until entering higher education before being exposed to their literary heritage?
Ben Zimmer said,

December 13, 2014 @ 12:20 pm

I also tried out Sundanese, and it has the same growing pains with respect to place names that we saw back in 2008 with European languages ("Made in USA == Made in Austria|France|Italy", 3/23/2008; "Austria == Ireland?", 3/24/2008; "Why Austria is Ireland", 3/24/2008; "The (probable) truth about Austria and Ireland", 3/24/2008; etc.). So the Sundanese Wikipedia page for "Jawa Kulon" (West Java), when translated, is titled "Arkansas." And the list of regencies in West Java takes a Latin American tilt, including Ecuador, Paraguay, Venezuela, Belize, and Brazil.
Rod Johnson said,

December 13, 2014 @ 12:50 pm

It says "Sundanese" now.
Victor Mair said,

December 13, 2014 @ 1:34 pm

Which is harder — learning Latin or Middle English if you already know Modern English OR learning Ottoman Turkish if you already know Modern Turkish?

But there's another question here: how many speakers of Modern English can read Latin / Middle English without too much difficulty? And, finally, does it matter whether all / most speakers of Modern English are able to read Latin / Middle English or not?
Levantine said,

December 13, 2014 @ 2:48 pm

Professor Mair, I didn't suggest that speakers of Modern English should know or learn Latin, and though I did mention Chaucer, I did so only to make the point that most people in the anglophone world seem to agree that school children benefit from being taught something about their linguistic and literary heritage. I don't think my posts implied that speakers of Modern English are generally able to read either Latin or Middle English (though I remember being gratified as a schoolboy that, with a bit of effort, I could at least make some sense of The Miller's Tale).

In any case, Ottoman Turkish can hardly be equated with Latin or Middle English; it is how Turkish was spoken and written until 1928. The picture is complicated somewhat by the fact that Ottoman encompassed a variety of registers ranging from the everyday spoken language, which is not so far removed from today's Turkish, to a highfalutin literary style that would have been abstruse to all but the most educated speakers. Between these extremes were such texts as newspapers and personal letters.

My answer to your final question is no, it doesn't. But if it matters for speakers of English to be able to understand something written in their own language less than a hundred years ago, then I would say the same ability matters no less to speakers of Turkish. My mother recently showed me a mid-twentieth-century family photo that included a number of individuals whose identities she was unsure of. She asked if I could clarify things by reading the inscription on the back, which she took to be Arabic (some of her family are Arab). It was, in fact, written in Ottoman Turkish (several decades after official Latinisation), and not only did it identify all the people in the picture, but it also revealed that my uncle's birth name was different from his name as my mother knew it. This was fascinating and valuable information that did not pertain to a long-lost culture, but to my mother's immediate relatives. The inscription required nothing more to decipher than knowledge of the Ottoman script.

This brings me to your opening question. As a heritage speaker of Turkish, I was able to read printed Ottoman texts with the aid of a dictionary after only one class of Ottoman Turkish (and by class, I mean a single session rather than the entire course). Yes, handwritten documents can be a struggle (though this goes for any language), and yes, I still rely on my dictionary when reading more grandiloquent texts, but to learn the Ottoman writing system, which is the gateway to understanding all varieties of written Ottoman Turkish, is not difficult for anyone who knows the modern language.
Leo E said,

December 13, 2014 @ 3:14 pm

I have very limited exposure to Ottoman, and what I do know about it is by way of Chaghatay which was also a Turkic language of a highly Persified courtly culture, though not used as extensively in bureaucracy as Ottoman. One thing that I would suggest complicates learning Ottoman is that the kinds of documents it was used to write also contain a large amount of Arabic – not just loan words but full Arabic passages. Chaghatay is the same way in assuming the reader will be able to understand an Arabic sentence which spontaneously occurs in a passage, and part of the distancing of modern Turkish from Ottoman, as the WP article mentions, was finding Turkic words to replace a lot of the overtly Islamicate forms of Ottoman. So what does it mean for Ottoman to be revived? Are there documents in Ottoman that don't use Arabic, or would reviving Ottoman somehow also tacitly include reviving Arabic in some way? Any information on the specifics of the proposed Ottoman curriculum? As an aside, I know someone from Turkey who described being harassed on the bus for studying Ottoman in public and said that there was a good deal of popular antipathy for it years ago, but Falcon Man (Erdoğan) probably doesn't care too much about that.

@Keith – yeah, that's the smart way to do it. I'm stuck with Windows 8 and haven't ever tried using Linux or individually setting the characters of the input method, though apparently I should.
Simon P said,

December 13, 2014 @ 3:14 pm

The Cantonese input method is fascinating for several reasons. One is the "write sort of how it sounds to you, spelled in a Western language that you know" basis of the system. There are no children educated in/about Cantonese, so the vast majority of users will have little knowledge of the sound system of the language, nor will they know any of the romanization systems (jyutping is the new standard, but the Google input system shows yale as the "standard"). It also allows for sound changes, so you can write "ley" and the software will know that an 'l' initial can just as well be an 'n' in the romanization system.

Another aspect is the growing acceptance of Cantonese as an actual language. Google and Apple have always toed the Party line that (written) Cantonese doesn't exist, and it has previously not been possible to input it on phones except by using systems that were designed for Mandarin, nor has Cantonese been availible on Google Translate. However, with the new trend of Cantonese novels, the pride in Cantonese shown by HK youth and the overwhelming evidence of HKers using Cantonese to write text messages and Facebook posts, these companies are beginning to see there's money to be made from catering to these customers. An underreported little revolution happened with the release of iOS8. Now, if you use the handwriting input to write a Cantonese character like 佢 (he/she/it), the suggestion box doesn't gape empty the way it did before. The software will now suggest a 地, making the plural compound 佢地 (though I prefer to write it 佢哋, but that's just a matter of habit), meaning "they". Apple has silently admitted the existence of written Cantonese, and now Google is doing the same.
Levantine said,

December 13, 2014 @ 3:55 pm

Leo E, Ottoman Turkish was used to write texts of all varieties, from the kind of Arabic-laden documents you describe to earthier texts that are quite easy for modern speakers to understand in transliteration. It's interesting to note that fifteenth-century written Ottoman is often closer to the modern idiom; it wasn't until somewhat later that a more bombastic register with a preponderance of foreign borrowings began to dominate in educated circles. Though certain Ottoman texts are undoubtedly challenging (and they would have been so even in their own time), a speaker of modern Turkish who knows the Ottoman writing system and is armed with a dictionary has instant access to many types of document. In some cases (as in the inscribed photo I mentioned in my last post), the difference really is one of script only.
Lane said,

December 13, 2014 @ 4:58 pm

Geoffrey Lewis called Ataturk's reform "A catastrophic success" in this absorbing lecture: http://www.turkishlanguage.co.uk/jarring.htm

As for the improvements on Google Translate, I applaud the latin-character-input-to-Cyrillic-character-output now available for Russian. For those like me who haven't got the Cyrillic keyboard down pat but who know some Russian, inputting it is a snap – just type how it sounds in Latin letters and watch Google serve up the Cyrillic flawlessly.
leoboiko said,

December 13, 2014 @ 6:15 pm

Well Google Translate does have Latin these days, for better or worse. I for one hope one day they also add Middle and Old English, Sanskrit, Classical Chinese, Attic Greek, Old Tupi…
Mark Stephenson said,

December 13, 2014 @ 10:05 pm

I see the Uzbek paragraph from the BBC is twice as long as the (human) translation into English. Is this typical for Uzbek?
Levantine said,

December 13, 2014 @ 10:19 pm

Mark Stephenson, the Uzbek quotation given above contains the same text twice, once in the Cyrillic alphabet and once in the Latin. Turkic languages (well, Turkish at least) actually tend to take up less room on the page than their English translations.
Darragh McCurragh said,

December 14, 2014 @ 8:48 am

To my knowledge "Ottoman" Turkish is no different to "modern" Turkish except that the alphabet, formerly Arabic which did not fit the Turkic language(s) at all, was "Latinized". And because the shift happened in 1928 (if I'm not mistaken) and transliterations were made from the spoken word, there has been practically no "drift" since then, i.e. there are likely far less dyslexics in Turkey (as well as they learn to count two years earlier as they have an "orthogonal" number system – it's not "twelve" or "eleven" it's "ten-one, ten-two etc." – the French on the other hand with their "quatre-vingt-dix" [four-times-twenty-then-add-ten"] for 90 learn it two years later than average). That said, even if Erdowan (that's how it's "pronounced") goes overboard with some of his claims of mosques in Cuba at the time Columbus, seems t suggest nothing more than teaching the pupils the Arabian alphabet and how to read Turkish through that "lens". When i was at school, I greatly profited from being taught the Germanic letters (at the time most students would still have parents or at least grandparents whose letters they would otherwise not be able to read. And am grateful I can read German literature from the 18th or 19th centuries and what was printed during the Third Reich …
Levantine said,

December 14, 2014 @ 11:35 am

Darragh McCurragh, it isn't that simple. Most registers of written Ottoman Turkish (and early Republican Turkish for that matter) are quite unlike their modern counterparts, particularly in lexical terms. The shift in script was accompanied by the abandonment of many terms derived from Arabic and Persian, and the creation of numerous neologisms to fill the void. Again, the effect of all these changes was more profound in written Turkish than in spoken. I recommend reading the excellent lecture by Geoffrey Lewis that Lane provided a link to above.

As for your assertion that the Arabic script is unsuited to the writing of Turkic languages, this does not square with the fact that it has successfully been used for that purpose for the past millennium. To be sure, the Latin alphabet fits the needs of Turkish especially well, but the Ottoman script worked just fine its own right, and many of those who were educated in it before the language reforms continued using it in their personal writings after Latinisation.
michael farris said,

December 14, 2014 @ 11:51 am

"the Ottoman script worked just fine its own right"

Then why the big jump in literacy after the shift to Latin? Or is that a myth or caused by something else.

Knowing the Arabic alphabet and some basic Turkish I really can't imagine a script working that doesn't use more vowels than is customary in Arabic.

That said, the cyrillic alphabet also works very well.

My basic understanding/opinion is that the shift from Arabic to Latin was a major success while the "purification" of the vocabulary was a lot less so.

In effect it has created a new disconnet, not between a literate minority and illiterate masses but between generations (Turkish university students now have trouble reading text books from 50 years ago).

They also, when asked, expressed … dislike for the idea of reviving the Arabic alphabet for Turkish (this was a couple of years ago, maybe Erdogan's march into past glories is more popular with young people now).
Levantine said,

December 14, 2014 @ 12:18 pm

The Latinised Turkish script is, objectively speaking, easier to read, write, and learn, since it basically transcribes the sounds of the spoken language. Written Ottoman, by contrast, resembles English in that the orthography is not always transparent and has to be learnt. But just as literate anglophones are able to deal with such orthographic challenges, so too were literate Ottomans. The low levels of literacy before 1928 should be attributed to poor schooling more than anything else. This isn't to say that Latinisation didn't facilitate the subsequent jump in literacy, because it most certainly did. But such orthographic simplification would make literacy easier in any language, including English.

Regarding the lack of vowels in written Arabic, Ottoman found many ways of dealing with this issue, including the use of different consonants to indicated the intended vocalisation. Thus the suffix -mek is written with the Arabic letters m[im]-k[af], while -mak is written m[im]-q[af]. Had the script been as unsuitable as many seem to believe, we would not have the evidence we do for its continued use in personal writings after 1928.

To be clear, I don't think that Ottoman should be revived as a working script. This is simply not going to happen, and neither should it. But to give Turks the ability to read hundreds of years worth of their linguistic heritage is surely no bad thing.
Aaron Posehn said,

December 16, 2014 @ 5:43 am

This is fantastic! I've been waiting for Kazakh for about three years now, so I'm very happy it's finally here.

RSS feed for comments on this post

Kazakh

25 Comments

Chris C. said,

Leo E said,

Leo E said,

michael farris said,

Keith said,

Bart said,

Levantine said,

michael farris said,

Levantine said,

Ben Zimmer said,

Rod Johnson said,

Victor Mair said,

Levantine said,

Leo E said,

Simon P said,

Levantine said,

Lane said,

leoboiko said,

Mark Stephenson said,

Levantine said,

Darragh McCurragh said,

Levantine said,

michael farris said,

Levantine said,

Aaron Posehn said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta