{"id":72850,"date":"2026-02-23T07:09:14","date_gmt":"2026-02-23T12:09:14","guid":{"rendered":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=72850"},"modified":"2026-02-23T07:09:14","modified_gmt":"2026-02-23T12:09:14","slug":"written-cantonese-must-have-word-segmentation","status":"publish","type":"post","link":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=72850","title":{"rendered":"\"Written Cantonese must have word segmentation\""},"content":{"rendered":"<p>That's the title of an essay that appeared in my e-mail today from an outfit called <a href=\"https:\/\/substack.com\/app?utm_campaign=email-read-in-app&amp;utm_source=email\">Cantonese Script Reform \u7cb5\u5b57\u6539\u9769<\/a>.\u00a0 Here's what they say:<\/p>\r\n<p style=\"padding-left: 40px;\">Written Cantonese must have spaces, like Korean. The calligraphic issue must give way. For the space itself is a grammatical marker that marks the beginning and the end of a word. This tool of demarcation will allow poet and playwright to invent new words by putting words together within the confinements delineated by the spaces between words. Written Cantonese needs all the tools imaginable for it to revitalise and resurrect its lost vocabulary. A Hebrew-esque recycling off ancient words for purposes anew is the way to go. But we can\u2019t do that if we can\u2019t tell if this is a new word because we can\u2019t tell if these characters familiar so and so sequenced are merely a fanciful poetic playful arrangement or other mark of the invention of a new word, where a familiar noun is turned into a verb or verb is turned into an adjective or an adjective is now henceforth interpreted as a noun in this particular context.<\/p>\r\n<p style=\"padding-left: 40px;\"><!--more--><\/p>\r\n<p style=\"padding-left: 40px;\">Written Cantonese must have word segmentation. It\u2019s not just so that future pythonist natural language processing wizards will have an easier time. Word segmentation, is the beginning of grammatical awareness, and therefore of conscious conjugation and word coinage. The absence of word segmentation, is a symptom of a backward written language. The last languages with writing systems with no word segmentation were the first sophisticated languages &#8211; ancient Greek and Latin. Absence of word segmentation is therefore only justifiable if you\u2019re an early civilization, like the Greeks, the Romans &#8211; or the Egyptians or the Sumerians.<\/p>\r\n<p style=\"padding-left: 40px;\">Any modern orthography must do it. The Koreans did it, and the Thais did it &#8211; as late as the 1990s! &#8211; Which is why the full name of Bangkok is a poetic jumbled mess.* Even though the Japanese haven\u2019t yet, how much of us are willing to bet that they won\u2019t eventually? Didn\u2019t they already sort of do it in the early days of digital device manufacturing? If they have all done it, what is the protest of a few literati with heads up their sinoglyphic arses?<\/p>\r\n<p style=\"padding-left: 40px;\">&#8212;&#8211;<\/p>\r\n<p style=\"padding-left: 40px;\">*My next post will be a video of the full name of Bangkok being pronounced, together with a written explanation.<\/p>\r\n<p>I couldn't agree more heartily, and it's something I've been preaching for all Sinitic languages and topolects since I began studying them sixty years ago.\u00a0 There is little doubt that one day it will come to pass even for written Mandarin \/ Putonghua.<\/p>\r\n<p>&nbsp;<\/p>\r\n<p><b>Selected readings<\/b><\/p>\r\n<ul>\r\n<li><a href=\"https:\/\/languagelog.ldc.upenn.edu\/nll\/?cat=228\">Archive for Parsing<\/a><\/li>\r\n<li>\"<a title=\"Permanent link to Parsing of a fated kin tattoo\" href=\"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=72127\" rel=\"bookmark\">Parsing of a fated kin tattoo<\/a>\" (11\/29\/25)<\/li>\r\n<li>\"<a title=\"Permanent link to Words, morphemes, collocations, characters\" href=\"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=69737\" rel=\"bookmark\">Words, morphemes, collocations, characters<\/a>\" (7\/3\/25)<\/li>\r\n<li>\"<a title=\"Permanent link to Words in Mandarin: twin kle twin kle lit tle star\" href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=4129\" rel=\"bookmark\">Words in Mandarin: twin kle twin kle lit tle star<\/a>\" (8\/14\/12)<\/li>\r\n<\/ul>\r\n","protected":false},"excerpt":{"rendered":"<p>That's the title of an essay that appeared in my e-mail today from an outfit called Cantonese Script Reform \u7cb5\u5b57\u6539\u9769.\u00a0 Here's what they say: Written Cantonese must have spaces, like Korean. The calligraphic issue must give way. For the space itself is a grammatical marker that marks the beginning and the end of a word. [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[199,252,228,18],"tags":[],"class_list":["post-72850","post","type-post","status-publish","format-standard","hentry","category-grammar","category-language-reform","category-parsing","category-writing-systems"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/72850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=72850"}],"version-history":[{"count":2,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/72850\/revisions"}],"predecessor-version":[{"id":72854,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/72850\/revisions\/72854"}],"wp:attachment":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=72850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=72850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=72850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}