Language Log

Scream cipher

March 3, 2025 @ 8:52 am · Filed by Mark Liberman under Linguistics in the comics, Orthography

Mouseover title: "AAAAAA A ÃA̧AȂA̦ ǍÅÂÃĀÁȂ AAAAAAA!"

It's easy to see that the exchange depicted in the cartoon is "hello" and "hi". I'll leave it to the readers to decode the Mouseover title…

I amused myself on a recent plane trip by (trying to) turn the strip's letter-correspondence list into a Unicode table:

a A
b Å
c A̡
d A̱
e Á
f A̮
g A̋
h A̰
i Ả
j A̓
k Ạ
l Ă
m Ǎ
n Â
o Å
p A̯
q A̤
r Ȃ
s Ã
t Ā
u Ä
v À
w Ȁ
x A̽
y A̦
z Ⱥ

Some diacritics are similar enough, and Unicode is confusing enough, that I'm not sure I got exactly what Randall Munroe intended. But anyhow, such a table makes it easy to write encoding and decoding programs, so that (according to my table) "Language Log" (after mono-casing) becomes

ĂAÂA̋ÄAA̋Á ĂÅA̋

This silly exercise led me to wonder which language's conventional orthography has the most alternative diacritizations of any single letter.

The champion among the orthographic systems I'm familiar with is Vietnamese Chữ Quốc ngữ. The Wikipedia article tells us that

The Vietnamese alphabet contains 29 letters, including 7 letters using four diacritics: ⟨ă⟩, ⟨â⟩, ⟨ê⟩, ⟨ô⟩, ⟨ơ⟩, ⟨ư⟩, and ⟨đ⟩. There are an additional 5 diacritics used to designate tone (as in ⟨à⟩, ⟨á⟩, ⟨ả⟩, ⟨ã⟩, and ⟨ạ⟩).

Adding the plain-vowel version, there are then three diacritically-distinguished versions of orthographic a, each of which can also bear any of five tonal diacritics, for a tonal of 15 diacritically-distinguished alternatives.

Yoruba has two optionally-underdotted vowels (e,o) each of which can be any of three tones, high (acute accent), low (grave accent), and mid (plain). That yields six possible versions of e, and six versions of o.

Wikipedia says that

In addition to the underdots, three further diacritics are used on vowels and syllabic nasal consonants to indicate the language's tones: an acute accent ⟨´⟩ for the high tone, a grave accent ⟨`⟩ for the low tone, and an optional macron ⟨¯⟩ for the middle tone. These are used in addition to the underdots in ⟨ẹ⟩ and ⟨ọ⟩. When more than one tone is used in one syllable, the vowel can either be written once for each tone (for example, *⟨òó⟩ for a vowel [o] with tone rising from low to high) or, more rarely in current usage, combined into a single accent. In this case, a caron ⟨ˇ⟩ is used for the rising tone (so the previous example would be written ⟨ǒ⟩), and a circumflex ⟨ˆ⟩ for the falling tone.

That would raise the number of diacritically-distinguished tonal options to five. The rising and falling tones are the (reliable phonetic) result of the pitch transitions being delayed in low-to-high and high-to-low sequences on adjacent syllables. But I've never seen the caron or the cirumflex used in Yoruba orthography, nor the macron for mid tone either — the standard orthography, as far as I know, just uses underdot, grave, and acute diacritics, e.g. here or here.

Pinyin for Mandarin Chinese has four tonal diacritics for vowels, plus no diacritic for "neutral tone", so five possible versions of eligible vowels.

Others?

March 3, 2025 @ 8:52 am · Filed by Mark Liberman under Linguistics in the comics, Orthography

Permalink

16 Comments

Jonathan Smith said,

March 3, 2025 @ 9:29 am

wow and the diacritics resemble the target letter shapes where possible in a clever/systematic way, meaning purdy easy to learn to read — and you could use whatever/random base letters to really cipher it up.
Yuval said,

March 3, 2025 @ 9:49 am

If vocalization isn't cheating, then Hebrew Shin/Sin (ש) would have 2 (left/right) x 2 (dagesh or not) x 6 (vowels/schwa) for 24, and with Bible cantillation marks up to x 27 for 648…

And if you love diacritization, and happen upon NAACL later this year, come hear Kyle talk about them in the context of language models.
Jonathan Smith said,

March 3, 2025 @ 9:57 am

N̮OͦT̑H̏IN̑G̠ ŤÓ S̄ḚÉ H̡ER̃Ḛ
Magnus said,

March 3, 2025 @ 11:44 am

The good people over at Explain XKCD have also come up with a table: https://explainxkcd.com/wiki/index.php/3054:_Scream_Cipher
Ross Presser said,

March 3, 2025 @ 11:46 am

The mouseover text "AAAAAA A ÃA̧AȂA̦ ǍÅÂÃĀÁȂ AAAAAAA!" translates to "aaaaaa a s̃A̧ary m̌o̊n̂s̃t̄er aaaaaaa!" Apparently there is at least one misencoding. And how did I end up with diacritics on the output?

Using Mark's "unicode table" and the direct mouseover title text from the xkcd comic.

Try my C# decoder app online!
Ross Presser said,

March 3, 2025 @ 11:50 am

The translators linked from https://explainxkcd.com/wiki/index.php/3054:_Scream_Cipher do a better job than mine.
Andreas Johansson said,

March 3, 2025 @ 3:37 pm

Mandarin surely has ten versions of 'u', with and without the umlaut times five tonal versions?
Victor Mair said,

March 3, 2025 @ 6:34 pm

For the purposes of this exercise, Mandarin "u" and "ü" are not the same letter.

See also here:

https://www.archchinese.com/chinese_pinyin.html
JMGN said,

March 3, 2025 @ 6:05 pm

Invaluable resource:
https://shapecatcher.com/
Peter Cyrus said,

March 4, 2025 @ 5:20 am

According to my Yoruba translator, Yoruba actually petitioned the Unicode Consortium to add the precomposed letters e and o with underdot AND tone mark (acute or grave), but their request was denied. Without the precomposed versions, there are FOUR different encodings of the glyph: dotted vowel+accent, accented vowel+dot, vowel+accent+dot, and vowel+dot+accent. Without normalizing software, those four versions don't match each other when looking things up, like looking up your name in a database, although they are impossible to distinguish visually.

Yoruba in Benin uses different letters for dotted e and o (ɛ ɔ), so there is less ambiguity. Unicode DOES offer an accented epsilon in the Greek page, but I don't believe it's used in Benin, so they always encode using vowel+accent.

In my fantasy universe where we fix Unicode, I hope they abolish precomposed letters completely (including the sixth of Unicode dedicated to precomposed Hangeul blocks, many of which don't even correspond to words).

I'm sure you all know that the original charter of the IPA included principle #6: "Diacritic marks should be avoided, being trying for the eyes and troublesome to write.["
Peter Cyrus said,

March 4, 2025 @ 5:24 am

As an aside, it would be a more successful (secretive) cipher if the base letters spelled a distracting message. It would look like it said "the treasure is in the attic", when it really said "the treasure is in the basement", spelled using the diacritics.
KeithB said,

March 4, 2025 @ 9:17 am

"As an aside, it would be a more successful (secretive) cipher if the base letters spelled a distracting message."

But then you lose the scream cipher pun.
Gokul Madhavan said,

March 4, 2025 @ 12:37 pm

An exhaustive transliteration of Vedic Sanskrit into Roman script would require 18 symbols for each of the a, i, and u classes of vowels: each comes in three lengths and three accents, and can be either nasalized or not. (The hymns of the Sāma Veda, when written down, uses the numerals 1–7 to indicate subtle gradations of tone, but at that point I would say we are closer to musical notation than to linguistic transliteration.)

Additionally, if we are being particularly fastidious, Pāṇinian grammar uses theoretical bimoraic and trimoraic schwas in place of the actually realized [aː], so we could potentially add a diacritic to indicate those 12 theoretical sounds as well.
Christian Weisgerber said,

March 4, 2025 @ 1:36 pm

The title is also a pun on the cryptography term "stream cipher".
https://en.wikipedia.org/wiki/Stream_cipher
David Morris said,

March 4, 2025 @ 2:21 pm

I would like to hear a spoken (or even screamed) rendition of this code (not that I would understand it).

Given that the letter A is common to all letters, why not drop the As and just use the diacritics (larger)?
David Marjanović said,

March 9, 2025 @ 5:36 pm

In my fantasy universe where we fix Unicode, I hope they abolish precomposed letters completely

The point of precomposed letters is backwards compatibility: every existing encoding can be losslessly transformed into Unicode, as far as I'm aware. So that's not going away.

In other words, Unicode was invented a few decades too late.

For the purposes of this exercise, Mandarin "u" and "ü" are not the same letter.

For the purposes of this exercise, ü is u with two dots on, regardless of whether it's Mandarin. Compare the Vietnamese and Yorùbá examples above.

RSS feed for comments on this post

Scream cipher

16 Comments

Jonathan Smith said,

Yuval said,

Jonathan Smith said,

Magnus said,

Ross Presser said,

Ross Presser said,

Andreas Johansson said,

Victor Mair said,

JMGN said,

Peter Cyrus said,

Peter Cyrus said,

KeithB said,

Gokul Madhavan said,

Christian Weisgerber said,

David Morris said,

David Marjanović said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta