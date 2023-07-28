ROT-LLM?
There's a puzzling new proposal for watermarking AI-generated text — Alistair Croll, "To Watermark AI, It Needs Its Own Alphabet", Wired 7/27/2023:
We need a way to distinguish things made by humans from things made by algorithms, and we need it very soon. […]
Fortunately, we have a solution waiting in plain sight. […]
If the companies who pledged to watermark AI content at the point of origin do so using Unicode—essentially giving AI its own character set—we’ll have a ready-made, fine-grained AI watermark that works across all devices, platforms, operating systems, and websites.
What's proposed here is a character-for-character substitution — like ROT13 encryption but using character codes that are digitally different while being visually the same. As Croll explains:
In Unicode, every character has a number. The Latin Capital Letter A, for example, is hexadecimal number 41. But there are plenty of other A’s in Unicode: There’s Fullwidth Latin Capital Letter A (Ａ, number EF BC A1), Mathematical Bold Capital A (, number F0 9D 90 80), Mathematical Sans-Serif Capital A (, F0 9D 96 A0), and plenty of others. Each A has its own name, its own Unicode value, and in some cases, its own font shape. Why not create a letter A just for AI?
If the AI-specific character sets were created — and we'd need many of them, to support all the world's writing systems — then the watermarking process would be a trivial computer program.
Of course, de-watermarking would be an equally trivial program, so what's the point?
Croll's answer:
It’s important to note that this proposed markup is not an enforcement mechanism. Bad actors could easily convert AI text to look like it was written by a human. A recipient still needs to trust a sender in order to believe what is marked up. But that’s one of the strengths of this approach. Once text is marked, a human has to actively remove the AI marker at some stage between the LLM and the consumer. We have legal mechanisms to investigate and deal with negligence or wrongdoing. The proposed protocol simply lets us apply these to AI.
This really puzzles me. It assumes that a student (for example) who's willing to use an LLM as a ghost-writer, despite this being against the rules, will draw the line at running a trivial character-substitution program to disguise their violation.
Michael P said,
July 28, 2023 @ 6:06 am
What puzzles me is why anyone would identify Unicode characters or code points with their UTF-8 byte sequences rather than their direct values, or call these sequences numbers. The UTF-8 sequence is not a number; it interleaves extra bits for the convenience of other software.
For example, Fullwidth Latin Capital Letter A is usually identified as U+FF21 (https://unicodeplus.com/U+FF21), and most ways to represent it look more like FF21 than EF BC A1.
Writing "In Unicode, every character has a number" suggests unfamiliarity with Unicode because the Unicode standard is very clear that there is not a N-to-M correspondence between code points and characters for any fixed N or M: is there really a difference between A, Ａ and as characters? Is Å the same character with a diacritic or a different character? Does adding a combining diacritical mark change the character or its "number"?
Similarly with writing "Fullwidth Latin Capital Letter A (Ａ, number EF BC A1)", both because "number" is not the correct term and because UTF-8 encodings are very unusual to write out in such a context.
Mark Liberman said,
July 28, 2023 @ 6:12 am
@Michael P: The UTF-8 sequence is not a number; it interleaves extra bits for the convenience of other software.
Indeed, and similarly for other encodings. But this doesn't really change the nature of the idea, or the apparent problem with it.