"Written Cantonese must have word segmentation"
« previous post | next post »
That's the title of an essay that appeared in my e-mail today from an outfit called Cantonese Script Reform 粵字改革. Here's what they say:
Written Cantonese must have spaces, like Korean. The calligraphic issue must give way. For the space itself is a grammatical marker that marks the beginning and the end of a word. This tool of demarcation will allow poet and playwright to invent new words by putting words together within the confinements delineated by the spaces between words. Written Cantonese needs all the tools imaginable for it to revitalise and resurrect its lost vocabulary. A Hebrew-esque recycling off ancient words for purposes anew is the way to go. But we can’t do that if we can’t tell if this is a new word because we can’t tell if these characters familiar so and so sequenced are merely a fanciful poetic playful arrangement or other mark of the invention of a new word, where a familiar noun is turned into a verb or verb is turned into an adjective or an adjective is now henceforth interpreted as a noun in this particular context.
Written Cantonese must have word segmentation. It’s not just so that future pythonist natural language processing wizards will have an easier time. Word segmentation, is the beginning of grammatical awareness, and therefore of conscious conjugation and word coinage. The absence of word segmentation, is a symptom of a backward written language. The last languages with writing systems with no word segmentation were the first sophisticated languages – ancient Greek and Latin. Absence of word segmentation is therefore only justifiable if you’re an early civilization, like the Greeks, the Romans – or the Egyptians or the Sumerians.
Any modern orthography must do it. The Koreans did it, and the Thais did it – as late as the 1990s! – Which is why the full name of Bangkok is a poetic jumbled mess.* Even though the Japanese haven’t yet, how much of us are willing to bet that they won’t eventually? Didn’t they already sort of do it in the early days of digital device manufacturing? If they have all done it, what is the protest of a few literati with heads up their sinoglyphic arses?
—–
*My next post will be a video of the full name of Bangkok being pronounced, together with a written explanation.
I couldn't agree more heartily, and it's something I've been preaching for all Sinitic languages and topolects since I began studying them sixty years ago. There is little doubt that one day it will come to pass even for written Mandarin / Putonghua.
Selected readings
- Archive for Parsing
- "Parsing of a fated kin tattoo" (11/29/25)
- "Words, morphemes, collocations, characters" (7/3/25)
- "Words in Mandarin: twin kle twin kle lit tle star" (8/14/12)
Chris Button said,
February 23, 2026 @ 7:31 am
I don't read/write Thai, but I don't think they did.
Burmese doesn't put spaces between written words.
And I don't think of any of the closely related scripts in the region do either.
I still remember the first time I heard it. It's impressive. But I fail to see what it has to do with spaces between written words.
KMH said,
February 23, 2026 @ 7:51 am
Spaces in Thai are used as punctuation.
Geoffrey said,
February 23, 2026 @ 8:52 am
I also agree with the need for spaces between words, as well as the broader argument in the Cantonese Script Reform manifesto that Cantonese (a) needs and (b) currently does not have a written language. When I worked at a Western consulate in Guangzhou, this was the most confounding thing to me when I tried to learn about the plight of the Cantonese language. For example, my Cantonese colleagues would explain how there were fewer Cantonese radio stations or TV broadcasts than in the past, but when I would ask about newspapers or books, they insisted that there was no separate written language – they would just read written Mandarin with Cantonese pronunciations. But while it is true that you can pronounce any character of Mandarin in Cantonese, they actually have different vocabularies and grammar. It is a sad state of affairs for a beautiful, melodic language.
Victor Mair said,
February 23, 2026 @ 9:33 am
Amen, Geoffrey!
KMH said,
February 23, 2026 @ 8:55 am
(See some discussion of usage here: http://www.thai-language.com/ref/spacing)
Victor Mair said,
February 23, 2026 @ 9:30 am
The link in KMH's second comment is definitely worth reading. Quite enlightening!
NW said,
February 23, 2026 @ 9:39 am
'Sinoglyphic arses' is good.
Japanese doesn't really need spaces because the hiragana is always a delimiter at the far end. It's not all consecutive sinoglyphs.
Chris Button said,
February 23, 2026 @ 10:37 am
In addition to the hiragana, the particles also clearly break up words in Japanese by signifying their part of speech. I don't think word spacing would make much of a difference.
Burmese also use particles like Japanese. Unlike it seems in Thai, Burmese does use desginated punctuation symbols. But spaces may be used in modern writing too for clause separation, etc. I don't recall if there are defined rules for that though.
Jim Unger said,
February 23, 2026 @ 3:01 pm
Note that Japanese braille, in which one cannot rely on the visual differences among kanji, hiragana, and katakana, there are conventions for using spaces.
cliff arroyo said,
February 23, 2026 @ 4:55 pm
SE Asian languages like Thai or Vietnamese (and I'm told Khmer) have very… ambiguous word divisions which might be one reason they don't really use word division in writing.
When I was learning Vietnamese(which tends to write by syllable rather than word) I could often understand a sentence perfectly well but if you asked me to make word divisions I would have no idea where to begin (or end).
Chris Button said,
February 23, 2026 @ 9:02 pm
That's interesting about Vietnamese using spaces to divide syllables rather than words.
I suppose the Sanskrit virama to indicate a consonant without its concomitant vowel (and its equivalents in related scripts such as Burmese and Thai) can also help break up many syllables visually.
anon said,
February 23, 2026 @ 9:16 pm
I think there are subtle differences between genuine long words and faux long words.
Genuine long words are like, in Greenlandic, Ainu, Choctaw, Mohawk, Nahuatl, Sora, Navajo, i.e. polysynthetic languages, many ideas and affixes can be packed into a single word prosody. Faux long words on the other hand are more prevalent with languages with their own orthographies like Turkish, German, Japanese. A Turk can convince someone who doesn't know Turkish that the "longest word" in Turkish has 77 leetters, but if that person are fluent in Turkish and linguistics he/she would know that's not a word but combination of lexical juxtapositions and enclitics that have no evidence to be considered a single phonological unit. Or else.
Jonathan Smith said,
February 23, 2026 @ 10:07 pm
Opposite of what anon said is true for many of the MSEA languages including all of Chinese and Vietnamese — that is, the naive view (often orthography-mediated) is that the syllable is all there is and higher-level word division is arbitrary, but then one… oh looks at a dictionary or something. The Vietnamese spacing situation is down to Chinese influence; folks who insist that e.g. Romanized Taiwanese should be syllable-spaced on the Vietnamese model remain (thankfully) a clueless minority.
dainichi said,
February 23, 2026 @ 10:12 pm
Wow. Seldom have I seen so many non sequiturs crammed together in such a short text.
> This tool of demarcation will allow poet and playwright to invent new words by putting words together within the confinements delineated by the spaces between words.
It's not at all obvious that the "confinement" will benefit invention. Sure, necessity is the mother of invention, but by that logic, let's outlaw all tools but hammers, because people will make such good hammers…
> a familiar noun is turned into a verb or verb is turned into an adjective or an adjective is now henceforth interpreted as a noun in this particular context.
Again, not at all obvious how spaces are necessary for part-of-speech analysis.
> future pythonist natural language processing wizards will have an easier time.
I very much doubt that will be a problem.
> Word segmentation, is the beginning of grammatical awareness
Sure, and outdated spelling is the beginning of etymological awareness and so on. But assuming the purpose of an orthography is communication, I see no argument that this grammatical awareness benefits communication.
> The absence of word segmentation, is a symptom of a backward written language.
I'm simply speechless…
I'd like to state that I don't have a strong preference about spaces in reformed orthographies. I'd like to see rational arguments for and against, but honestly haven't seen many and definitely don't see any here. Anything that smells like "it's good because we advanced cultures in the West have it" just rub me the wrong way.
Anthony Bruck said,
February 23, 2026 @ 10:30 pm
Sanskrit doesn't have spaces between words as I recall.
Victor Mair said,
February 24, 2026 @ 12:03 am
"I very much doubt that will be a problem."
This says nothing, as do many of your other excathedra assertions.
dainichi said,
February 24, 2026 @ 5:02 am
> This says nothing
OK, I'll try to be clearer. NLP already handles a lot of input without word boundaries quite well, like written Japanese, written Mandarin and lots of spoken languages. If there's some reason that's not the case for some Cantonese orthography, I'd love to hear why.
> excathedra assertions
That's exactly what I thought of the essay! It spews derogatives like "backward" and "only justifiable if you’re an early civilization" without providing any meaningful arguments.
By the way, here's a better link to the essay:
https://cantonesescriptreform.substack.com/p/written-cantonese-must-have-word
~flow said,
February 24, 2026 @ 10:32 am
I was already almost, if not quite, taken over by this fine piece of meticulous argumentation:
> The absence of word segmentation, is a symptom of a backward written language. The last languages with writing systems with no word segmentation were the first sophisticated languages – ancient Greek and Latin. Absence of word segmentation is therefore only justifiable if you’re an early civilization
Undeniably true! What convinced me even more, tho, was the realization that opposition to the forward way of writing will only come from a
> few literati with heads up their sinoglyphic arses
which I certainly do not want to be identified as. You have my support, good sir!
BTW we can tell the Ancient Egyptian scribes did have something akin to an awareness of words because of the way that determinatives usually are placed after the root (and after or before trailing signs that indicate e.g. plural); what's more, hieroglyphs that stand for more than one consonant are not normally applied across what we would call a word boundary (e.g. *mn* can be written with a single biliteral sign, but *m#n* with an intervening boundary can not); this also means their orthography had, not entirely unlike modern written Japanese, subtle hints for where word boundaries were most likely situated.
Jerry Packard said,
February 24, 2026 @ 11:11 am
Interestingly, experiments show that inserting word spaces into Chinese character texts increases reading speed for Chinese L2 learners, but slows down native Chinese readers. My guess is that it slows down the native readers because they are not used to seeing Chinese character texts presented that way.
David Marjanović said,
February 24, 2026 @ 12:05 pm
No, at the very least the German ones are real.
It is true that each component of a German compound retains a stressed syllable that contains an unreduced vowel (unlike in English where e.g. -man and -land are often reduced). However:
1) the whole thing contains exactly one primary stress (all others are secondary);
2) some words are identical as free citation forms and as non-final components of compounds, but most are not. Instead, they get an otherwise meaningless suffix that marks them as prefixes. Historically, most of these suffixes – different words take different ones; a few even take different ones in different compounds – are genitive singular endings, but after a few centuries of scrambling this is no longer the case.
The prefix form of Sonne is Sonnen-. Today, Sonnen is the plural (the entire plural, all four cases). Once upon a time it was everything but the nominative singular, but no longer.
The prefix form of Stern is Sternen-. Today, Sternen is specifically the dative plural. Yet, that is what's used throughout, from the ancient Sternenlicht ("starlight") to the newfangled Sternenstaub ("stardust").
All the words with the suffix -ung are turned into prefixes by adding -s. That's the genitive singular, but the masculine one, even though -ung is feminine and marks its genitive singular on the article only.
cliff arroyo said,
February 25, 2026 @ 5:42 am
"Vietnamese spacing situation is down to Chinese influence"
Partly, yes. French auto as ô tô (though ô-tô and ôtô can sometimes be found) is Chinese influence.
But structurally it's just hard to make word divisions at times. I remember one example (can't remember the exact sentence) where three syllables could have been:
A B C (separate words)
AB C (AB one word C another)
A BC (A one word BC another)
(this being were both AB and BC appeared as lexemes)
All the readings had the same basic meaning and I asked the teacher about word separation and they just shrugged. I ended up thinking it may have been AB plus BC with one B eliminated for redundancy reasons.
That same thing happened… a lot.
I also know a second language learner of Mandarin (very fluent easily read newspapers) was learning Thai and was frustrated at the lack of clearly articulatable word boundaries in that language.
Bybo said,
February 25, 2026 @ 11:49 am
@cliff arroyo
Isn't that just syntactic ambiguity? (Written ABC is syntactically ambiguous: it may stand for A BC, AB C, etc. [any combinations that have meanings, probably closely related ones].)
번하드 said,
February 25, 2026 @ 2:44 pm
Of course this made me curious about how and when Korean acquired its spaces, so I did a search or two and it was worth it:
https://youtube.com/watch?v=NUNzBcznoPY
cliff arroyo said,
February 26, 2026 @ 1:53 am
"Isn't that just syntactic ambiguity? "
It's that and a lot more, lots of things that seem kind of like clitics or kind of like affixes and elements that can break up what seem like lexical words…
Start learning Vietnamese and somewhere in the second year you'll start to see what I mean.
It doesn't affect the functioning of the language or its comprehensibility but it does help make the choice to write by syllable than by word seem grounded in more than just Chinese influence (though that plays a role too).
Jonathan Smith said,
February 26, 2026 @ 1:45 pm
Re: Vietnamese and typologically similar languages, you wouldn't want to be in the position of arguing, as happens in e.g. Chinese fantasy world, that the monosyllable is basic and that there is no meaningful "word"-type level featuring units of 1,2,3+… syllables. This is like saying modern dictionaries, where most entries are 2+ syllables, just collect useful but syntactically transparent collocations.
There is much gray area when it comes to deciding what constitutes a lexical item. But the situation is European-ish, not Indigenous Americas-ish (we're only "sure" about the tons of borderline cases in e.g. English because of associated orthographical conventions.)
Re: "things that seem kind of like clitics or kind of like affixes and elements that can break up what seem like lexical words", these (monosyllabic) things should be listed in a lexicon, and the "breaking up" part is common to languages of this type (and not too different from phenomena found in English.) E.g. Chinese languages have two-part verbs of "V+Object" and "V.+Complement" types that can be broken up in syntax, often (e.g. in Taiwanese) in a very rich variety of ways. But still lexicon lists what is a lexeme. Tw. kiann5 is listed as 'walk; leave', loo7 is listed as 'road', kiann5-loo7 is listed as 'walk' (specify e.g. "generic/intransitive" or give usage examples if you like), and then when you get to e.g. kiann5-bo5-loo7 'be unfamiliar with the area / unable to walk anywhere' you have a decision to make about whether or not to list — and (as you would expect) writers over the years have made different choices in such cases about whether to use hyphens (as typical for multisyllabic words in Tw. orthography) or spaces (thus "different words").
Chris Button said,
February 26, 2026 @ 4:10 pm
Since an alphabet destroys the intuitive notion of the syllable as the basic building block of language, it's probably no wonder that alphabets developed spacing while other scripts never saw the need.
Bybo said,
February 27, 2026 @ 5:18 am
Okay, I admit that I have no idea what's special about syntactic ambiguity in Vietnamese. In comparison to written English, I mean, where phrases like 'big fat Greek wedding' or 'little girls' school' can also be segmented in different ways, yielding different (often similar) meanings. I don't understand why it should be unthinkable for Vietnamese writers to decide what nuance they want and group syllables to words accordingly. (But if there's no pressure from readers, why should writers bother?)
PS: I'm not sure about an intuitive notion of the syllable. Children are taught syllable division in school, after all.
cliff arroyo said,
March 1, 2026 @ 2:29 am
"no idea what's special about syntactic ambiguity in Vietnamese"
Less syntactic and more segmental… a sequence can be a noun phrase but it's just not clear where the word boundaries are (if they are).
And there's a graphic element as well since many consonants can either end or begin syllables so writing syllables together makes that harder to decode. Traditionally this was solved with dashes but those make an already cluttered orthography even more cluttered.
I think ultimately except for mono-mophemics loans like ô tô (probably better as ôtô) writing separate syllables is probably optimal for Vietnamese.
Chas Belov said,
March 4, 2026 @ 1:59 am
Apparently, given a fortuitous enough character string, word segmentation is not required for English. Today I passed a business signed PERFORMFORGOLF (perform for golf).
Bybo said,
March 4, 2026 @ 4:45 pm
>[…] a fortuitous enough character string […]
like any piece of Lojban text (assuming stress is indicated) :^)
Chris Button said,
March 4, 2026 @ 5:58 pm
And there are disagreements about precisely where to break them.
But the notion of a basic syllabic rhythm does not need to be taught.
Alabaster Au-Yeung said,
March 16, 2026 @ 10:29 am
Least deranged and obnoxious ~~HK Nationalist~~ Cantonese language advocate (the most deranged and obnoxious one being ŋɔ˩˧).
I will eventually actually engage with this as a serious linguistic proposal (to be clear, this is and always has been deeply Unserious linguistic project), but I think it's important to make it clear that this project is best viewed as like… in the realm of a conlang writing system for the author's fantasy setting of a world where HK and their idea of a 粵 cultural area is or will become a Cantonese Civilization with its own Cantonese Philosophy (not just philosophical works written in Cantonese (very cool!) but a whole new philosophical tradition based on… HK triad lingo… https://x.com/CantoneseScript/status/1993695378781204506), culturally differentiated from other Sinitic-speaking territories and united by a common literary standard based on HK Cantonese.
To be clear, I would very much love to see a vibrant literature in Cantonese and all the other linguistically underprivileged Sinitic languages, but this is a deeply unserious proposal for how to go about actually making that happen (more on this later) and if you read the author's social media (e.g. https://x.com/CantoneseScript/status/1825091227647156677) it becomes embarrassingly clear that this whole project is more or less just an expression of a common set HK nationalist neuroses that most people native to the territory have with respect to the state on the far shore.
To prove I do actually read languagelog and I'm not just coming around to do a drive-by on this one post, let me present what's probably the most, and in all likelihood, only insightful piece of writing ever posted here (ok I know Barbara Partee and like Jason Merchant have posted before here so I take that back): https://languagelog.ldc.upenn.edu/nll/?p=72325#comment-1636909
Specifically this part:
> What that tells me is that the reason so many people felt the urgency for language reform had very little to do with the Chinese language, but a lot to do with national power, and they were obsessed with the language because the bookish intellectuals were incapable of facilitating changes in other, arguably more direct and more relevant fields, like military reform, industrial reform, or even agricultural reform. The language reform craze was merely a symptom of the inadequacies of the overall modernization and reform movement. A few – arguably less intellectual – people dared to do more than diddling with the language, and forced real change: people like Sun Yat-sen.
I… don't really even have to provide any commentary on this I feel, cause it's just transparently what's going on here, just repeated on the even more farcical and yet ever more tragic stage that is HK nationalism. Like all "good" nationalisms, this one is very concerned with the perceived Strength and Purity of its national language (https://x.com/CantoneseScript/status/1967384620908888282), and like all good 18th century ideological movements that belong in the dustbin of history, subscribes to some bizarre and frankly hilarious pseudoscientific ideas, like the "strength" of a language increasing as it's used to discuss philosophy or whatever.
There's a trend among younger generations of HK people to see themselves as more or less, temporarily embarrassed Japanese people, a people who were denied the right to be Japanese, which is to say, somewhere within the Sinosphere, at least historically, but now at a comfortable distance from it and China. (This also describes Taiwan, HK's 老友 in uncomfortably, tragically, despite everything, including how good their beef noodle soup or whatever might be, still being Chinese ).
Anyway, in my opinion, Cantonese is in a quite bizarre position among Sinitic languages in being… actually pretty well served by existing Chinese characters. For most colloquial lexemes (e.g. 揈), there generally just is a way to write it that isn't a purely phonetic approximation and people are actually familiar with a decent number, though not all of these (obnoxious quibbling about what the Real 正字 for each lexeme is, a complaint I do actually share with the author). The biggest flaws to actually representing Cantonese as a language are basically the lack of standard characters representing all the tonal variants of the SFPs, which are a hugely important part of the grammar of Cantonese, and stuff like ni1 and le1 both being written 呢. It's little details like this, but otherwise, you can already Just Write Cantonese. People do it online all the time.
If you actually read about the author's proposed script, like apparently none of the credulous commenters on this site and VM himself have, you'll see that Jyutcitzi (the proposed system) is basically just Fanqie with a standard set of initials and rhymes, and optional tone marking, where every syllable is written in a single compressed Fanqie block, sort of like Hangul. That's it. That's the whole proposal, the linguistic and orthographic engine that's supposed to usher in the thousand year Cantonese Reich or whatever.
Anyway.
Freedom and life to all the languages and people of the world. Death to all, and I do mean All, nationalisms, including and especially the deeply embarrassing and neurotic scourge that is HK nationalism.
Alabaster Au-Yeung said,
March 16, 2026 @ 3:03 pm
Ok I swear I hadn't even scrolled this far down their social media when I typed up my last comment but, umm, lmao, your champion, folks:
https://x.com/CantoneseScript/status/2000980390341996592
https://x.com/CantoneseScript/status/2001025143066485033
https://x.com/CantoneseScript/status/2023644974768890198
"languagelog stop platforming the most deranged and conspiratorial people on the internet just cause they posted something with like, a hint of the aroma of relevance to Chinese linguistics and that could be vaguely interpreted as critical of the Chinese state" challenge 2k26 IMPOSSIBLE