Pinyin spam text message

« previous post | next post »

From David Moser:

Just got this spam text, all in pinyin, to avoid spam detectors. The usual spam offering fake certificates and chops, plus their Weixin contact. What's novel is the tone markings, don't see that very often.

The corresponding characters would be:

bànzhèng 办证 ("accreditation; [we] handle / process certificates")

kèzhāng 刻章("[we] carve seals / stamps / chops")

wēixìn 微信 ("WeChat") (a mobile text and voice messaging communication app)

In China, you need certificates and seals to process countless transactions.  Consequently, advertisements for such fake documentation (consisting of 办证, 刻章, and a cell phone contact number) are ubiquitous.  You see them plastered all over the place:  on sidewalks, on footbridges, on buildings….  In some ways, they are comparable to graffiti in other parts of the world, but they are neither creative nor artistic, and they certainly are not political (that could get you put away for a very long time).

Nowadays, when practically everybody has a cell phone and is addicted to messaging services, the makers of fake certificates and fake seals have naturally moved their business onto that platform as well.  Of course, it didn't take long for the mobile phone service providers to figure out ways to block much of this spam, but now the fake documentation purveyors have figured out a way to circumvent the spam filters:  pinyin!  Since the filters are set up to catch 办证 and 刻章 (i.e., Chinese characters describing their services) the fakers can avoid detection.  Or perhaps I should say "could" avoid detection and blocking.

Let me explain.  The use of pinyin to get around the internet censors in China is common.  See, for example:

The widespread use of pinyin to  circumvent internet censors and spam filters attests to two things:

  1. the bulk of the population is familiar with romanization
  2. the censors and filterers have not yet figured out a way to block supposedly subversive / pornographic / spam / etc. content written in pinyin

Now, when the Chinese internet police and mobile service providers get hip to pinyin, that will be proof positive that romanization has really made it in China.  Indeed, I wonder whether they are already starting to catch on, and that this is the reason the faker in this instance has added tones (as David pointed out above, generally they omit tones), since the latest filters might be able to catch "ban zheng" and "ke zhang", but not "bàn zhèng" and "kè zhāng".  Or maybe they just wanted to be precise or were being pedantic.

Let's see how this might work in a specific instance.  大宪章 ("Magna Carta") is currently being censored in Chinese web searches, but, if you enter Dà xiànzhāng  or Dà xiàn zhāng or da xianzhang or da xian zhang, etc., you probably wouldn't get caught — at least not yet.

In any event, the resort to pinyin on the part of actual spammers and alleged subverters complements well the recent post on "Chinese character inputting" (10/17/15), where I showed that the use of pinyin as the primary method for that purpose is approaching 100%.


  1. Ken Miner said,

    October 18, 2015 @ 7:34 pm

    What's novel is the tone markings, don't see that very often.

    As I've observed before (maybe not here), you can't get people to write tone marks, whatever the (tone) language. No one seems to know why.

  2. Victor Mair said,

    October 18, 2015 @ 8:20 pm

    @Ken Miner

    Good point!

    In some ways, it's like getting people to mark stress in non-tonal languages. Stress, by the way, is of various types, and can be quite complicated, so it's not easy for someone untrained in phonology to mark it accurately.

    As for romanization and tones in Mandarin, I think that teachers heavily emphasize the basic spelling and tend to gloss over the tones. Also, regional variation makes it difficult to achieve consistency, even within Mandarin, where tonal configurations vary greatly.

    I have hundreds of older friends from all over China who speak Mandarin with a wide variety of regional accents. If they're over about 50 and not trained in linguistics, few of them can transcribe their speech in romanization, and they have nary a clue about how to indicate tones. Those under 45 or so who received a proper elementary and secondary education usually can write things down in romanization, and they're even pretty good with tones, because they've been taught that in schools.

    When it comes to a language like Cantonese, for which there is no formal education in the schools, I've met very few native speakers (for the most part other than professional linguists or language teachers) who feel comfortable transcribing their speech in romanization, and many of them are uncertain about how many tones are in their language, much less how to mark them in a systematic way.

  3. michael farris said,

    October 19, 2015 @ 12:54 am

    "As I've observed before (maybe not here), you can't get people to write tone marks, whatever the (tone) language. No one seems to know why."

    In my experience Vietnamese speakers are very conscientious about writing tones and do not willingly leave them out. They leave them out at times when it's too hard to produce (on somone else's computer without the right fonts for example) but they prefer to write them.

    Before unicode became the internet default I remember seeing Vietnamese language discussion boards where they adopted various ASCII workarounds for writing the tones.

    So Vietnamese at least is a very big counter example to the idea that speakers of tone languages tend to leave tone marking out.

    Perhaps because the tones are treated as part of the inherent spelling? (traditional spelling out loud includes the tones).

  4. Ken Miner said,

    October 19, 2015 @ 1:55 am

    @ michael farris Sounds like a valid counter-example to me. My experience was mainly with African and Amerindian lgs, where the relative novelty of the writing systems may be a factor.

  5. Matt_M said,

    October 19, 2015 @ 2:14 am

    Thai spelling, as well, fully indicates tones (although in a more indirect fashion than Vietnamese spelling). Even when Kids These Days use playful respellings on their Facebook posts (in the manner of "nite" for "night", for example), the new spellings accurately indicate tone — often more accurately (in terms of representing colloquial speech) than the standard spelling does.

  6. Bob Ladd said,

    October 19, 2015 @ 2:27 am

    I think the omission of tone marks is a specific case of a more general tendency to omit diacritics of any sort. Case in point: Romanian. Romanian has a very consistent surface-phonemic standard orthography that includes five letters with diacritics: â, î, ă, ş, and ţ. Before Ceauşescu was overthrown, there were essentially no unregulated typewriters or computers, and the printing presses used the standard orthography consistently, with all the diacritics. After 1989, as word processing rapidly became available, it became normal to omit the diacritics, because including them was seriously inconvenient (and still is, as composing this comment has made me aware). As a result, even quite a lot of fairly official stuff appears without diacritics. About four years ago I did an informal survey of Romanian university websites and found that only about a third of them used the diacritics consistently; most of the other two-thirds omitted them completely.

    Incidentally, I have been told by a usually reliable source that Vietnamese diacritics (some of which indicate vowel quality rather than tone) are often omitted. I don't know whether Michael Ferris or my source is closer to correct.

  7. John Swindle said,

    October 19, 2015 @ 3:02 am

    Chinese writing, whether in characters or pinyin, normally represents or pretends to represent Mandarin, so let's compare Vietnamese to Mandarin. Vietnamese has more tones; has a tone for every syllable (no "neutral" tone); and has more vowels, many of them distinguished in writing by diacritics. All these factors make Vietnamese tone marks and other diacritics more important for reading comprehension than Hanyu Pinyin tone marks.

  8. Matt_M said,

    October 19, 2015 @ 4:09 am

    @Bob Ladd: interesting point about the inconvenience of typing diacritics for Roman-based alphabets. On a standard Thai keyboard, on the other hand, the two most common tone diacritics are typed by pressing the keys corresponding to "h" and "j" — right in the middle of the home row, and no messing around with the Shift key or anything like that. So I guess keyboard layout might have consequences for how people actually end up spelling words.

  9. flow said,

    October 19, 2015 @ 4:51 am

    I think what the discussion so far points at is that (1) alphabetic writing, for all its benefits and simplicity, is not easy to learn if you didn't start early on; (2) to put it with T.A. Edison, alphabetic writing is 90% orthography and 10% sound-writing (probably exaggerated for most languages, but maybe true for e.g. English and Thai); (3) when faced between a decision between a correct, but hard way and an easy, but still intelligible and acceptable alternative, people tend to take that short cut. When on a US keyboard doing informal stuff I don't think twice about replacing German umlauts ä, ö, ü with the more accessible ae, oe, ue and leave out all French accents entirely. I don't know about Vietnamese but if omitting all those extra dots makes the writing unacceptable and/or opaque, I'd rather find a way to put that data back in. (4) Everybody (except for a proportion of native speakers, it would seem) gets easily overwhelmed by diacritics in foreign language orthographies. Newspapers and even serious, dedicated books simply omit all of them.

    In my experience, people have a relatively hard time to analyze their own speech sounds into clearly demarcated and categorized units on a conscious level. For 2nd L learners of, say, Mandarin or Cantonese, it may come as a surprise that native speakers are at a loss to tell how many tones their native language has, but then it is not so simple (e.g. Mandarin has both four or five tones, or alternatively more, according to how you count), and—quick, how many vowels does your native tongue have? Few people will be able to answer that question.

  10. michael farris said,

    October 19, 2015 @ 4:53 am

    "All these factors make Vietnamese tone marks and other diacritics more important for reading comprehension than Hanyu Pinyin tone marks."

    I remember getting some emails in Vietnamese without any diacritics (computer problems on the sending end) and being able to read them easily enough (carefully choosing vocabulary and sentence structure for a non-native user was probably a factor too).

    I think the biggest difference is that Vietnamese spelling is not a transcription of any particular spoken variety but instead is a compromise that is meant to be decoded by speakers of different dialects. In this ways it's more like the orthographies of European languages like German or Italian than pinyin.
    Very roughly speaking its initials correspond better to southern pronunciation and the finals closer to northern pronunciation). In practice this means most speakers have to learn to make written distinctions that don't correspond with their speech whether distinguishing initial d- and r- (both /z/ in the north) or final -an -ang (both /aŋ/ in the south). This might make the tone markers seem more like part of the spelling (along with the vowel quality diacritics).

  11. Victor Mair said,

    October 19, 2015 @ 5:29 am

    With the advent of word processing on computers, I have been noticing that my German friends are far more lax about substituting -e for the umlaut (the impact of this change extends even into handwriting) and friends in other languages that use them omitting the tilde (˜) and cedilla (¸as on "c" –> ç). I haven't been tracking what's happening with the caron or háček (ˇ).

    I think the reason for these substitutions or omissions of diacritical marks may be due to what Bob Ladd says about Romanian typewriters versus word processing.

  12. Victor Mair said,

    October 19, 2015 @ 5:46 am

    Here's another use for pinyin:

    Despite all of the hyphens and with no word separation, I can read that quickly and easily — even with the mistake of "-fu-wai-chang-" for "-fu-wai-zhang-". The absence of tones has no impact on me whatsoever: I supply them automatically, just as I supply stress automatically when reading unmarked English texts or as a Russian speaker does when reading unmarked texts (the latter also automatically makes the necessary vowel changes without their being marked).

  13. John Swindle said,

    October 19, 2015 @ 5:54 am

    @michael farris: Yes, that may be it.

  14. michael farris said,

    October 19, 2015 @ 5:55 am

    I remember before a trip to Poland (back after the fall of communism but before the internet proper) my Polish teacher was receiving a two or three page daily email newsletter from Poland in ASCII which was no problem for native speakers like her to understand. She gave me a few with the assignment of filling in the missing diacritis which was pretty useful in several different ways.

    At that time I was using a mac that did not deal well with diacritics (you needed to switch fonts for letters like ż or ą or ł and then switch back to the other font because too many other things were different for fast word processing.

    In the mid 90s when I was already living full time in Poland and started using email I omitted the diacritics when writing in Polish because it was too much touble. But sometime between 10 and 15 years ago the Polish programmer keyboard became widely available which was faster than the traditional Polish keyboard (which switched the places of y and z along with other weirdness) and now I always use them, it would feel weird to leave them out.

    I remember in the early internet before unicode when different fonts needed to be installed for different diacritics and different languages were adapting at different rates. I remember the double umlauts of Hungarian were very much a hit or miss afair (often the o or u with tilde or circumflex was used instead) until some time after 2000.

    Romanian seemed to lag further and longer than most other languages of the region (possibly due to an orttographic reform which didn't affect the repertoire of diacritics but did require some attention on its own).

  15. Victor Mair said,

    October 19, 2015 @ 6:03 am

    From Jason Q. Ng:

    Should also mention that when we examined the censorship lists for various Chinese chat applications that we reverse-engineered, dozens of the keywords were merely pinyin versions of sensitive keywords, e.g. "molihua", "renquanjiang". So censors are aware of the usage of pinyin for getting around restrictions. You can see more in this CSV I previously compiled:

  16. John said,

    October 19, 2015 @ 6:08 am

    Were you perhaps primed by the context in which you saw this link, allowing you to read it quickly and easily? I got only as far as 中國極端分子 (and even that took some effort before I realized jiduanfenzi is one word–I kept reading jiduan as 幾段!) Of course I then quickly fell into the trap of parsing huiliu as 匯流, and assuming that this was going to be about digital streaming somehow. After that there was really no saving my comprehension of this headline!

    But I really do think that, without context or word segmentation, it would be very difficult to read xu as a monosyllable representing Syria. I had even considered 徐, but it never occurred to me.

  17. Ken Miner said,

    October 19, 2015 @ 6:58 am

    One of the first things I noticed as a kid (and which got me interested in languages) was that English and Dutch got along just about entirely without diacritics, yet the Latin alphabet is pretty inadequate for all the languages of Europe, and English especially pays a high price for its freedom from diacritics.

    Does anybody know of any conscious attention to the issue of diacritics in the history of European writing? English and Dutch certainly have an advantage now, in the computer age, but of course this could not have been foreseen.

  18. Victor Mair said,

    October 19, 2015 @ 7:17 am


    I was not primed by any context. I read it absolutely cold.

  19. Victor Mair said,

    October 19, 2015 @ 11:34 am

    China Digital Times has documented instances of pinyin words and phrases being blocked from Weibo search results. You can find them in their sensitive words spreadsheet:

  20. Michael Watts said,

    October 19, 2015 @ 3:40 pm

    The advantage in computer input of English doesn't come from a lack of diacritics. It comes from computers being developed by English speakers.

  21. Victor Mair said,

    October 19, 2015 @ 4:45 pm

    From Michele Thompson:

    My Vietnamese friends never leave off tones, even on Facebook postings. Of course most of them are adults and many are academics. I have no idea what teenagers, for example, might do.

    It is also true that because tone marks are only part of the diacritics for a given vowel they are much more an inherent part of the spelling in Vietnamese than is true for pinyin. In other words a tone mark is not something you add after you have spelled the word

  22. Victor Mair said,

    October 19, 2015 @ 8:10 pm

    From Steve O'Harrow:

    Just a brief note from inside a Vietnamese family: before it was possible to add diacritics on-email, we just wrote VNese without marks and, because all the folks in the correspondence were either native speakers, or fairly high functioning gringos like myself, we understood nearly everthing, especially given the context of each utterance.

    To this day, because I have a "dumb phone," not a "smart phone," I tend to text to my VNese TA (from Quang Ninh) in a no-tones-no-special-vowels way and she understands everything I send her. She, on the other hand, being from VN itself and therefore a whole lot more tech-savvy than I and with an iPhone (naturlich!) replies with full diacritics, which register as little empty boxes on my el-cheap-o phone, but I can usually guess what she wants. In other words, most native speakers can read and write diacritic-less VNese and get away with it. But don't tell our students this, please.

  23. Victor Mair said,

    October 19, 2015 @ 8:13 pm

    Is it easier / simpler / faster to input with or without diacritical marks — if you can get by without them and still achieve the same results?

  24. michael farris said,

    October 20, 2015 @ 1:46 am

    "Is it easier / simpler / faster to input with or without diacritical marks — if you can get by without them and still achieve the same results?"

    It depends partly on whether you have keyboards that are designed to make adding diacritics easy and how well a person types. I learned how to touch type many years ago so in English I can type relatively rapidly.

    And a point that isn't necessarily recognized enough. Languages with lots of diacritics can be read without them as long as the diacritics appear most of the time. I really don't think Vietnamese without diacritics could be a functional system though since people are conditioned to use them they oftne sort of see them even when they're not there.

    In my old 'add in the diacritics' exercise in Polish one of the problems was that if I read for content I mentally added in the diacritics which is a very different proposition from a hypothetical Polish without diacritics at all.

  25. michael farris said,

    October 20, 2015 @ 1:54 am

    Also, I used the telex system for writing in Vietnamese and the input takes getting used to, you type ngwowfi vieejt to get người việt (Vietnamese person) but once I learned the system I could type very fast.

    Pinyin designers could look at telex for a way to type with tones very quickly.

  26. Adrian said,

    October 20, 2015 @ 5:45 am

    My (Hungarian) wife absolutely does not want to leave off the diacritics when typing in Hungarian, partly because it's alien to her to do so and partly because it can make the text ugly or difficult to understand, but living in the UK makes it very awkward to apply them. Looking at the posts from her Hungarian friends on Facebook, I'd say about 30% are in a similar position. Sometimes I think what happens is that Hungarian diacritics are not available on a particular platform or piece of hardware, the writer gets used to leaving them off, and then continues to do so on other platforms. It'd be good to see some studies into the effect this will have on particular languages. In theory, it's hard to see Hungarian losing its marks, since (for example) ő and o are such different sounds, but that doesn't mean it's impossible, and it's probably more likely for certain sounds and certain languages (and may cause changes in frequency, pronunciation and spelling). You can see how it's more likely for Pinyin since it's an orthography that's learned as an adjunct.

  27. michael farris said,

    October 20, 2015 @ 6:44 am

    "but living in the UK makes it very awkward to apply them."

    It should be easy to download a Hungarian keyboard on any computer she uses often.

    I have several keyboards downloaded that I don't often use but I like having them.

  28. Victor Mair said,

    October 20, 2015 @ 7:30 am

    When I read something like this — zhong-guo-ji-duan-fen-zi-zai-xu-can-zhan-xu-fu-wai-chang-zai-hui-liu-qian-xiao-mie (see comment above) — I am definitely *not* mentally adding diacritics, but if I pronounce it out loud I would automatically (without thinking about it) read each syllable with the correct tones.

    Of course, I'd much prefer that all those distracting hyphens weren't there and that the syllables would be linked up into words like this:

    Zhongguo jiduanfenzi zai Xu canzhan
    Xu fuwaizhang: zai huiliu qian xiaomie.

    That would make it really easy to read, with correct tones no less.

  29. Eidolon said,

    October 20, 2015 @ 11:52 am

    @Michael Watts the standard "English" keyboard certainly fits your story, but in general, I think efficiency of electronic input favors orthographic systems with smaller character sets, simply because of spatial limitations with regards to your hands and, in the case of touch screens, the size of the screen. This is why developing an efficient electronic input system for the Chinese writing system, independent of any phonetics, has been challenging. Methods such as wubi do have efficiency, to a degree, but at the cost of requiring the typist to memorize a new way for decomposing characters. Even there, however, the same principle applies whereby you want to make your keyboard character set small, but not too small, so as to balance between the physical limitations of the human hand and the quantity of key presses per character.

    At times, I wonder what the most efficient *human language* for electronic input would be. I'd think such a language would have a limited quantity of phonemes, and few phonemes per word, which would nonetheless be highly expressive and easily understood when combined together. It'd probably be highly contextual.

  30. Keith said,

    October 21, 2015 @ 5:56 am

    I frequently encounter texts without diacritics and it used to really annoy me that educated, literate people would refuse to put in the fifteen minutes necessary to learn how to install the correct keyboard map and learn how to, for example, compose e + ` to get è and e + ' to get é.

    (Hmm… that second example seems to work oddly in the preview pane of this comment: I don't see the difference between the "back-tick" and "apostrophe" that I typed. In fact, if I want to force the symbol to really look like an apostrophe, I need to hit the compose key and then the apostrophe key twice, so that LL's comment system really displays the correct ´ symbol.)

    I've lost count of the number of discussions I've commented in about keyboard maps for Android phones, Windows XP and Windows 7, and Linux… it seems like hardly a month goes by without this problem coming up again and some participants claiming "it's too difficult to type [insert language here] with a US keyboard".

    Reading French is much more difficult for me when the accents are left out. In addition to abbreviations such as a.m. not being clear (a French speaker living in the USA could mean ante meridiem or après-midi), I now have to figure out whether "achete" should be "achète" or "acheté" from the context, which may already be muddled by abbreviations or skipped words.

    I make a conscious effort to try to get all diacritics in the correct places, and this is perhaps a side-effect from me trying to improve my own knowledge of foreigh languages; I can accept that a native speaker probably does not have this motivation. And if I am learning mostly from written sources, I really, really want to have the clues that diacritics provide so that when I use recently learnt words orally, I can pronounce them correctly (such as saying vor'bitsʲ when I have seen it written as "vorbiți", and not pronouncing it as vor'biti because the text was written without the "ţ").

RSS feed for comments on this post