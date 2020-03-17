« previous post |

From Molly Des Jardin:

In the midst of our stressful times, I'm writing to share a distraction that is somehow still relevant. Given the kind of things you have noted on Language Log historically, I wondered if you observed this hashtag:

#COVIDー19

You will probably notice immediately that it contains a full-width dash, in other words a Unicode (probably Chinese-origin?) character. For some reason, this is all over Twitter in posts from Anglophone people I am almost completely sure have no input method installed that can actually produce it. But I was recommended this hashtag by Twitter itself when I went to post about how odd it is (which I won't share here because I now have a pseudonymous account, but it won't add much to this email anyway). I was able to choose it from a suggested list when I typed COVID without having to switch to Japanese or other input, which of course I have on my phone — but I am a pretty niche demographic here. Anyway, I'm guessing this is exactly how it spread throughout social media (and I don't have Facebook to check if it's the case there too, but who knows).

I saw someone posting about their irritation wrangling with this hashtag last week, namely because this dash doesn't register as punctuation on whatever system they were using to analyze them (if I remember correctly, but regardless it caused some problem that made them look at the dash more closely and realize it is not what you'd expect in English). The researcher was doing some text analysis involving hashtags (probably using them to harvest tweets to then analyze), who clearly had not worked with non-Western languages before. There were numerous replies, not all of them accurate, about what this dash is both intellectually and as a piece of text in the computer. I am not going so far as to say there is much meaning in this, but it's yet another odd artifact of Unicode computing that I notice because I worked with (fought with?) CJK text for so long.

It's not a real dash at all but a "Katakana-Hiragana prolonged sound mark":

U+30FC

ー

(source)

Here are the Unicode designations for each element of the hashtag under discussion:

#COVIDー19

|#| 0x0023 "NUMBER SIGN"

|C| 0x0043 "LATIN CAPITAL LETTER C"

|O| 0x004F "LATIN CAPITAL LETTER O"

|V| 0x0056 "LATIN CAPITAL LETTER V"

|I| 0x0049 "LATIN CAPITAL LETTER I"

|D| 0x0044 "LATIN CAPITAL LETTER D"

|ー| 0x30FC "KATAKANA-HIRAGANA PROLONGED SOUND MARK"

|1| 0x0031 "DIGIT ONE"

|9| 0x0039 "DIGIT NINE"

So, the question is, how do we account for this use of the Katakana-Hiragana prolonged sound mark in the hashtag of tweets that is causing much consternation on Twitter and could not have been produced with an Anglophone IME (and probably other Western orthographical IMEs as well)?

Side-note: Like Facebook, Twitter is illegal for Chinese citizens, though many privileged persons and government officials have accounts.

[Thanks to Mark Liberman and Mark Swofford]

