"Written Cantonese must have word segmentation"

« previous post | next post »

That's the title of an essay that appeared in my e-mail today from an outfit called Cantonese Script Reform 粵字改革.  Here's what they say:

Written Cantonese must have spaces, like Korean. The calligraphic issue must give way. For the space itself is a grammatical marker that marks the beginning and the end of a word. This tool of demarcation will allow poet and playwright to invent new words by putting words together within the confinements delineated by the spaces between words. Written Cantonese needs all the tools imaginable for it to revitalise and resurrect its lost vocabulary. A Hebrew-esque recycling off ancient words for purposes anew is the way to go. But we can’t do that if we can’t tell if this is a new word because we can’t tell if these characters familiar so and so sequenced are merely a fanciful poetic playful arrangement or other mark of the invention of a new word, where a familiar noun is turned into a verb or verb is turned into an adjective or an adjective is now henceforth interpreted as a noun in this particular context.

Written Cantonese must have word segmentation. It’s not just so that future pythonist natural language processing wizards will have an easier time. Word segmentation, is the beginning of grammatical awareness, and therefore of conscious conjugation and word coinage. The absence of word segmentation, is a symptom of a backward written language. The last languages with writing systems with no word segmentation were the first sophisticated languages – ancient Greek and Latin. Absence of word segmentation is therefore only justifiable if you’re an early civilization, like the Greeks, the Romans – or the Egyptians or the Sumerians.

Any modern orthography must do it. The Koreans did it, and the Thais did it – as late as the 1990s! – Which is why the full name of Bangkok is a poetic jumbled mess.* Even though the Japanese haven’t yet, how much of us are willing to bet that they won’t eventually? Didn’t they already sort of do it in the early days of digital device manufacturing? If they have all done it, what is the protest of a few literati with heads up their sinoglyphic arses?

—–

*My next post will be a video of the full name of Bangkok being pronounced, together with a written explanation.

I couldn't agree more heartily, and it's something I've been preaching for all Sinitic languages and topolects since I began studying them sixty years ago.  There is little doubt that one day it will come to pass even for written Mandarin / Putonghua.

 

Selected readings



17 Comments »

  1. Chris Button said,

    February 23, 2026 @ 7:31 am

    and the Thais did it

    I don't read/write Thai, but I don't think they did.

    Burmese doesn't put spaces between written words.

    And I don't think of any of the closely related scripts in the region do either.

    the full name of Bangkok

    I still remember the first time I heard it. It's impressive. But I fail to see what it has to do with spaces between written words.

  2. KMH said,

    February 23, 2026 @ 7:51 am

    Spaces in Thai are used as punctuation.

  3. Geoffrey said,

    February 23, 2026 @ 8:52 am

    I also agree with the need for spaces between words, as well as the broader argument in the Cantonese Script Reform manifesto that Cantonese (a) needs and (b) currently does not have a written language. When I worked at a Western consulate in Guangzhou, this was the most confounding thing to me when I tried to learn about the plight of the Cantonese language. For example, my Cantonese colleagues would explain how there were fewer Cantonese radio stations or TV broadcasts than in the past, but when I would ask about newspapers or books, they insisted that there was no separate written language – they would just read written Mandarin with Cantonese pronunciations. But while it is true that you can pronounce any character of Mandarin in Cantonese, they actually have different vocabularies and grammar. It is a sad state of affairs for a beautiful, melodic language.

  4. Victor Mair said,

    February 23, 2026 @ 9:33 am

    Amen, Geoffrey!

  5. KMH said,

    February 23, 2026 @ 8:55 am

    (See some discussion of usage here: http://www.thai-language.com/ref/spacing)

  6. Victor Mair said,

    February 23, 2026 @ 9:30 am

    The link in KMH's second comment is definitely worth reading. Quite enlightening!

  7. NW said,

    February 23, 2026 @ 9:39 am

    'Sinoglyphic arses' is good.

    Japanese doesn't really need spaces because the hiragana is always a delimiter at the far end. It's not all consecutive sinoglyphs.

  8. Chris Button said,

    February 23, 2026 @ 10:37 am

    In addition to the hiragana, the particles also clearly break up words in Japanese by signifying their part of speech. I don't think word spacing would make much of a difference.

    Burmese also use particles like Japanese. Unlike it seems in Thai, Burmese does use desginated punctuation symbols. But spaces may be used in modern writing too for clause separation, etc. I don't recall if there are defined rules for that though.

  9. Jim Unger said,

    February 23, 2026 @ 3:01 pm

    Note that Japanese braille, in which one cannot rely on the visual differences among kanji, hiragana, and katakana, there are conventions for using spaces.

  10. cliff arroyo said,

    February 23, 2026 @ 4:55 pm

    SE Asian languages like Thai or Vietnamese (and I'm told Khmer) have very… ambiguous word divisions which might be one reason they don't really use word division in writing.

    When I was learning Vietnamese(which tends to write by syllable rather than word) I could often understand a sentence perfectly well but if you asked me to make word divisions I would have no idea where to begin (or end).

  11. Chris Button said,

    February 23, 2026 @ 9:02 pm

    That's interesting about Vietnamese using spaces to divide syllables rather than words.

    I suppose the Sanskrit virama to indicate a consonant without its concomitant vowel (and its equivalents in related scripts such as Burmese and Thai) can also help break up many syllables visually.

  12. anon said,

    February 23, 2026 @ 9:16 pm

    I think there are subtle differences between genuine long words and faux long words.
    Genuine long words are like, in Greenlandic, Ainu, Choctaw, Mohawk, Nahuatl, Sora, Navajo, i.e. polysynthetic languages, many ideas and affixes can be packed into a single word prosody. Faux long words on the other hand are more prevalent with languages with their own orthographies like Turkish, German, Japanese. A Turk can convince someone who doesn't know Turkish that the "longest word" in Turkish has 77 leetters, but if that person are fluent in Turkish and linguistics he/she would know that's not a word but combination of lexical juxtapositions and enclitics that have no evidence to be considered a single phonological unit. Or else.

  13. Jonathan Smith said,

    February 23, 2026 @ 10:07 pm

    Opposite of what anon said is true for many of the MSEA languages including all of Chinese and Vietnamese — that is, the naive view (often orthography-mediated) is that the syllable is all there is and higher-level word division is arbitrary, but then one… oh looks at a dictionary or something. The Vietnamese spacing situation is down to Chinese influence; folks who insist that e.g. Romanized Taiwanese should be syllable-spaced on the Vietnamese model remain (thankfully) a clueless minority.

  14. dainichi said,

    February 23, 2026 @ 10:12 pm

    Wow. Seldom have I seen so many non sequiturs crammed together in such a short text.

    > This tool of demarcation will allow poet and playwright to invent new words by putting words together within the confinements delineated by the spaces between words.

    It's not at all obvious that the "confinement" will benefit invention. Sure, necessity is the mother of invention, but by that logic, let's outlaw all tools but hammers, because people will make such good hammers…

    > a familiar noun is turned into a verb or verb is turned into an adjective or an adjective is now henceforth interpreted as a noun in this particular context.

    Again, not at all obvious how spaces are necessary for part-of-speech analysis.

    > future pythonist natural language processing wizards will have an easier time.

    I very much doubt that will be a problem.

    > Word segmentation, is the beginning of grammatical awareness

    Sure, and outdated spelling is the beginning of etymological awareness and so on. But assuming the purpose of an orthography is communication, I see no argument that this grammatical awareness benefits communication.

    > The absence of word segmentation, is a symptom of a backward written language.

    I'm simply speechless…

    I'd like to state that I don't have a strong preference about spaces in reformed orthographies. I'd like to see rational arguments for and against, but honestly haven't seen many and definitely don't see any here. Anything that smells like "it's good because we advanced cultures in the West have it" just rub me the wrong way.

  15. Anthony Bruck said,

    February 23, 2026 @ 10:30 pm

    Sanskrit doesn't have spaces between words as I recall.

  16. Victor Mair said,

    February 24, 2026 @ 12:03 am

    "I very much doubt that will be a problem."

    This says nothing, as do many of your other excathedra assertions.

  17. dainichi said,

    February 24, 2026 @ 5:02 am

    > This says nothing

    OK, I'll try to be clearer. NLP already handles a lot of input without word boundaries quite well, like written Japanese, written Mandarin and lots of spoken languages. If there's some reason that's not the case for some Cantonese orthography, I'd love to hear why.

    > excathedra assertions

    That's exactly what I thought of the essay! It spews derogatives like "backward" and "only justifiable if you’re an early civilization" without providing any meaningful arguments.

    By the way, here's a better link to the essay:
    https://cantonesescriptreform.substack.com/p/written-cantonese-must-have-word

RSS feed for comments on this post · TrackBack URI

Leave a Comment