Hashtag of note

« previous post | next post »

From Molly Des Jardin:

In the midst of our stressful times, I'm writing to share a distraction that is somehow still relevant. Given the kind of things you have noted on Language Log historically, I wondered if you observed this hashtag:

 #COVIDー19

You will probably notice immediately that it contains a full-width dash, in other words a Unicode (probably Chinese-origin?) character. For some reason, this is all over Twitter in posts from Anglophone people I am almost completely sure have no input method installed that can actually produce it. But I was recommended this hashtag by Twitter itself when I went to post about how odd it is (which I won't share here because I now have a pseudonymous account, but it won't add much to this email anyway). I was able to choose it from a suggested list when I typed COVID without having to switch to Japanese or other input, which of course I have on my phone — but I am a pretty niche demographic here. Anyway, I'm guessing this is exactly how it spread throughout social media (and I don't have Facebook to check if it's the case there too, but who knows).

I saw someone posting about their irritation wrangling with this hashtag last week, namely because this dash doesn't register as punctuation on whatever system they were using to analyze them (if I remember correctly, but regardless it caused some problem that made them look at the dash more closely and realize it is not what you'd expect in English). The researcher was doing some text analysis involving hashtags (probably using them to harvest tweets to then analyze), who clearly had not worked with non-Western languages before. There were numerous replies, not all of them accurate, about what this dash is both intellectually and as a piece of text in the computer. I am not going so far as to say there is much meaning in this, but it's yet another odd artifact of Unicode computing that I notice because I worked with (fought with?) CJK text for so long.

It's not a real dash at all but a "Katakana-Hiragana prolonged sound mark":

U+30FC

ー

(source)

Here are the Unicode designations for each element of the hashtag under discussion:

#COVIDー19

|#| 0x0023 "NUMBER SIGN"

|C| 0x0043 "LATIN CAPITAL LETTER C"

|O| 0x004F "LATIN CAPITAL LETTER O"

|V| 0x0056 "LATIN CAPITAL LETTER V"

|I| 0x0049 "LATIN CAPITAL LETTER I"

|D| 0x0044 "LATIN CAPITAL LETTER D"

|ー| 0x30FC "KATAKANA-HIRAGANA PROLONGED SOUND MARK"

|1| 0x0031 "DIGIT ONE"

|9| 0x0039 "DIGIT NINE"

So, the question is, how do we account for this use of the Katakana-Hiragana prolonged sound mark in the hashtag of tweets that is causing much consternation on Twitter and could not have been produced with an Anglophone IME (and probably other Western orthographical IMEs as well)?

Side-note:  Like Facebook, Twitter is illegal for Chinese citizens, though many privileged persons and government officials have accounts.

[Thanks to Mark Liberman and Mark Swofford]



23 Comments

  1. ke said,

    March 17, 2020 @ 2:08 am

    Dashes (ASCII minus, en-dash, em-dash…) can’t be part of hashtags on Twitter, so I guess that’s why people are using a character that looks similar but can be.

  2. Philip Taylor said,

    March 17, 2020 @ 2:18 am

    In Windows 7, the sequence can be created very simply by typing "#COVID", ALT-numeric-"+", "30FC", then releasing the ALT key and typing "19". It works fine in (e.g.,) Notepad, from which the following was copied and pasted : #COVIDー19, but if typed directly into (e.g.,) this browser (Seamonkey), the "F" of "30FC" launches the "File" drop-down menu; other browsers may vary.

  3. Chris Button said,

    March 17, 2020 @ 2:24 am

    It's not that different from an "em dash" though

  4. Ross Presser said,

    March 17, 2020 @ 2:27 am

    I'm going to take a wild guess that someone (who had good Unicode knowledge) used the character deliberately because it is not considered punctuation and they wanted their hashtag to remain a unit. Once it's out there, it replicates with very little friction as you saw.

  5. Ross Presser said,

    March 17, 2020 @ 2:28 am

    For what it's worth, my Windows 10 machine can't produce this character with the ALT-PLUS mechanism in Notepad. If I were tasked with generating this character, I'd reach right for charmap.exe on Windows 10.

  6. Yuval said,

    March 17, 2020 @ 3:06 am

    I was sure this is an em-dash, which is relatively easily input by many mobile devices, e.g. Gboard where you just press the hyphen button for a second and it pops up. This is the same method by which you get every non-English latin character (accents, umlauts, etc.) and all other non-shift-number symbols, so I'd wager many people are familiar with it.

  7. Twill said,

    March 17, 2020 @ 3:31 am

    @Chris Button Graphically, no, but functionally it is starkly different: a fundamental part of the katakana orthography used to represent vowel length, which is generously used in loanword transcriptions that represent much of that script's usage. It also isn't encoded as a dash so there's no potential coding issues that dashes can cause.

  8. Mark Liberman said,

    March 17, 2020 @ 3:48 am

    A technique for accessing this character that works on any computer is cut-and-paste from the Wikipedia entries for Katakana or Choonpu

    And by the way, this character

    |ー| 0x30FC "KATAKANA-HIRAGANA PROLONGED SOUND MARK"

    is different from all the many other Unicode dash-like entities, including (among others):

    |-| 0x002D "HYPHEN-MINUS"
    |–| 0x2013 "EN DASH"
    |—| 0x2014 "EM DASH"
    |‒| 0x2012 "FIGURE DASH"
    |―| 0x2015 "HORIZONTAL BAR"
    |‐| 0x2010 "HYPHEN"
    |᠆| 0x1806 "MONGOLIAN TODO SOFT HYPHEN"
    |‐| 0x2011 "NON-BREAKING HYPHEN"
    |一| 0x4E00 "CJK UNIFIED IDEOGRAPH-4E00"
    | | 0x1680 "OGHAM SPACE MARK"
    |➖| 0x2796 "HEAVY MINUS SIGN"
    |─| 0x2500 "BOX DRAWINGS LIGHT HORIZONTAL"
    |⏤| 0x23E4 "STRAIGHTNESS"

    There's also 0x10191 "ROMAN UNCIA SIGN", which mysteriously erases the last few lines in this comment if I try to actually insert it…

  9. Philip Taylor said,

    March 17, 2020 @ 4:03 am

    I don't have any Windows 10 machines, but I tried under Windows Server 2019 (which uses the Windows 10 codebase) and found that I needed to add a registry entry, then log out and log in again. The required registry entry is HKCU\Control Panel\Input Method\EnableHexNumpad, type REG_SZ, value 1.

  10. mg said,

    March 17, 2020 @ 5:38 am

    I was originally using #COVIDー19 when it was offered, but otherwise using #COVID19. Have switched to only the latter, just for the sake of consistency, but visually the #COVIDー19 is closer to the correct name, COVID-19. I wondered why the dash was so long – thanks for explaining.

  11. Chris Button said,

    March 17, 2020 @ 7:39 am

    Interesting that they use hiragana in the name KATAKANA-HIRAGANA PROLONGED SOUND MARK. I suppose it is occasionally encountered in hiragana too, but hardly enough to warrant inclusion in the name given its atypical use.

  12. Garrett Wollman said,

    March 17, 2020 @ 8:22 am

    The relevant feature of U+30FC is that it has the "letter" property. The regular expression Twitter uses to identify the end of a hashtag allows latters, digits, non-spacing modifiers (e.g., diacritics), and underscores, but not punctuation — it's assumed that punctuation marks are being used to delimit hashtags rather than mentioned in them. I believe (I haven't scoured the tables) that U+30FC is the only character with a dash-like graphical representation that is formally a letter.

  13. Garrett Wollman said,

    March 17, 2020 @ 8:24 am

    @myl: U+10191 ROMAN UNCIA SIGN shows up like this if I paste it from another window:

    Since it's an astral-plane character, support for it may be limited on some platforms. (We'll see if this form even submits properly.)

  14. Michael Watts said,

    March 18, 2020 @ 4:12 am

    Some other Unicode dash-like characters that suggest themselves are |-| U+FF0D (the actual full-width dash) and |ㄧ| U+3127, the zhuyin character derived from |一| (U+4E00, already mentioned).

  15. Michael Watts said,

    March 18, 2020 @ 4:18 am

    U+3127 and U+4E00 are both letters. Zhuyin is relatively obscure, but anyone who knows enough Japanese to think of a vowel length marker also knows enough to think of the character for "one". I guess the kana character is slightly shorter? Or perhaps it's more likely to appear as a dash even in fancy serifed fonts?

  16. Garrett Wollman said,

    March 18, 2020 @ 9:31 am

    Michael Watts: no reason to believe that the originator did anything other than try different dash-like graphemes in their character picker until they found one that the Twitter app highlighted as part of the hashtag.

  17. Twill said,

    March 18, 2020 @ 3:28 pm

    @Chris Button I would say the chouonpu is used more than rarely in hiragana, if only to represent non-phonemic lengthening (あーんして, へー, えーと). In these cases I suppose it's interchangeable with the wave dash ~.

    Of course the main distinguishing feature of the chouonpu is that it is vertical in vertical writing, not that that bears much on this particular usage.

  18. Michael Watts said,

    March 18, 2020 @ 5:13 pm

    no reason to believe that the originator did anything other than try different dash-like graphemes in their character picker until they found one that the Twitter app highlighted as part of the hashtag.

    I think there is. Character pickers do not generally work by showing you every option that looks visually similar to something. They work by showing you code pages, or by showing you a curated selection based on a preselected language. I see no reason to believe that someone with no knowledge of Japanese would have been able to find this character in a character picker in the first place.

  19. Chris Button said,

    March 19, 2020 @ 2:57 am

    Do any specialists in the evolution of kana anyone know the history of why the prolonged sound mark was adopted primarily just for katakana but not by hiragana? In a weird way its like someone is saying that hiragana counts moras but katakana counts syllables (which of course isn't the case and we do still have the "n" coda after all)!?

  20. Chris Button said,

    March 19, 2020 @ 2:58 am

    Apologies for the typos!

  21. Twill said,

    March 19, 2020 @ 11:16 pm

    @Chris Button There's no special reason why the hiragana and katakana scripts should be used as they are (with katakana used mainly to represent [non-Sinitic] loanwords and inflection and particles being exclusively the realm of hiragana) or that the chouonpu should be used with said loanwords but not (as was the case in the 1900 orthographic reform) with Sino-Japanese lexemes, other than that is how they came to be standardized, and the use of chouonpu is to my knowledge more or less an accident itself of those.

  22. Chris Button said,

    March 20, 2020 @ 4:08 pm

    Some interesting info from Japanese Wikipedia regarding earlier use of the prolonged sound mark:

    長音符は外国語を表すのに使われたのが始まりといわれ、江戸の儒学者なども使っていたが、明治時代に一般的となった。引く音の「引」の右側の旁(つくり)から取られたという説がある。

    1900年(明治33年)、小学校令施行規則によって小学校の教科書に棒引き仮名遣いを使うことが定められた。これは漢字音や感動詞の長音を「ー」を使って表すというもので、「校長」を「こーちょー」、「ああ」を「あー」、「いいえ」を「いーえ」とするような仮名遣いであった。しかし、1908年(明治41年)に文部省令で廃止された。

  23. Chris Button said,

    March 25, 2020 @ 6:46 pm

    For anyone still reading:

    https://languagelog.ldc.upenn.edu/nll/?p=46527#comment-1572430

RSS feed for comments on this post