Scream cipher

« previous post |

A recent xkcd:

Mouseover title: "AAAAAA A ÃA̧AȂA̦ ǍÅÂÃĀÁȂ AAAAAAA!"

It's easy to see that the exchange depicted in the cartoon is "hello" and "hi". I'll leave it to the readers to decode the Mouseover title…

I amused myself on a recent plane trip by (trying to) turn the strip's letter-correspondence list into a Unicode table:

a A
b Å
c A̡
d A̱
e Á
f A̮
g A̋
h A̰
i Ả
j A̓
k Ạ
l Ă
m Ǎ
n Â
o Å
p A̯
q A̤
r Ȃ
s Ã
t Ā
u Ä
v À
w Ȁ
x A̽
y A̦
z Ⱥ

Some diacritics are similar enough, and Unicode is confusing enough, that I'm not sure I got exactly what Randall Munroe intended. But anyhow, such a table makes it easy to write encoding and decoding programs, so that (according to my table) "Language Log" (after mono-casing) becomes

ĂAÂA̋ÄAA̋Á ĂÅA̋

This silly exercise led me to wonder which language's conventional orthography has the most alternative diacritizations of any single letter.

The champion among the orthographic systems I'm familiar with is Vietnamese Chữ Quốc ngữ. The Wikipedia article tells us that

The Vietnamese alphabet contains 29 letters, including 7 letters using four diacritics: ⟨ă⟩, ⟨â⟩, ⟨ê⟩, ⟨ô⟩, ⟨ơ⟩, ⟨ư⟩, and ⟨đ⟩. There are an additional 5 diacritics used to designate tone (as in ⟨à⟩, ⟨á⟩, ⟨ả⟩, ⟨ã⟩, and ⟨ạ⟩).

Adding the plain-vowel version,  there are then three diacritically-distinguished versions of orthographic a, each of which can also bear any of five tonal diacritics, for a tonal of 15 diacritically-distinguished alternatives.

Yoruba has two optionally-underdotted vowels (e,o) each of which can be any of three tones, high (acute accent), low (grave accent), and mid (plain). That yields six possible versions of e, and six versions of o.

Wikipedia says that

In addition to the underdots, three further diacritics are used on vowels and syllabic nasal consonants to indicate the language's tones: an acute accent ⟨´⟩ for the high tone, a grave accent ⟨`⟩ for the low tone, and an optional macron ⟨¯⟩ for the middle tone. These are used in addition to the underdots in ⟨ẹ⟩ and ⟨ọ⟩. When more than one tone is used in one syllable, the vowel can either be written once for each tone (for example, *⟨òó⟩ for a vowel [o] with tone rising from low to high) or, more rarely in current usage, combined into a single accent. In this case, a caron ⟨ˇ⟩ is used for the rising tone (so the previous example would be written ⟨ǒ⟩), and a circumflex ⟨ˆ⟩ for the falling tone.

That would raise the number of diacritically-distinguished tonal options to five. The rising and falling tones are the (reliable phonetic) result of the pitch transitions being delayed in low-to-high and high-to-low sequences on adjacent syllables. But I've never seen the caron or the cirumflex used in Yoruba orthography, nor the macron for mid tone either — the standard orthography, as far as I know, just uses underdot, grave, and acute diacritics, e.g. here or here.

Pinyin for Mandarin Chinese has four tonal diacritics for vowels, plus no diacritic for "neutral tone", so five possible versions of eligible vowels.

Others?

 



6 Comments »

  1. Jonathan Smith said,

    March 3, 2025 @ 9:29 am

    wow and the diacritics resemble the target letter shapes where possible in a clever/systematic way, meaning purdy easy to learn to read — and you could use whatever/random base letters to really cipher it up.

  2. Yuval said,

    March 3, 2025 @ 9:49 am

    If vocalization isn't cheating, then Hebrew Shin/Sin (ש) would have 2 (left/right) x 2 (dagesh or not) x 6 (vowels/schwa) for 24, and with Bible cantillation marks up to x 27 for 648…

    And if you love diacritization, and happen upon NAACL later this year, come hear Kyle talk about them in the context of language models.

  3. Jonathan Smith said,

    March 3, 2025 @ 9:57 am

    N̮OͦT̑H̏IN̑G̠ ŤÓ S̄ḚÉ H̡ER̃Ḛ

  4. Magnus said,

    March 3, 2025 @ 11:44 am

    The good people over at Explain XKCD have also come up with a table: https://explainxkcd.com/wiki/index.php/3054:_Scream_Cipher

  5. Ross Presser said,

    March 3, 2025 @ 11:46 am

    The mouseover text "AAAAAA A ÃA̧AȂA̦ ǍÅÂÃĀÁȂ AAAAAAA!" translates to "aaaaaa a s̃A̧ary m̌o̊n̂s̃t̄er aaaaaaa!" Apparently there is at least one misencoding. And how did I end up with diacritics on the output?

    Using Mark's "unicode table" and the direct mouseover title text from the xkcd comic.

    Try my C# decoder app online!

  6. Ross Presser said,

    March 3, 2025 @ 11:50 am

    The translators linked from https://explainxkcd.com/wiki/index.php/3054:_Scream_Cipher do a better job than mine.

RSS feed for comments on this post · TrackBack URI

Leave a Comment