The most recent xkcd:
Mouseover title: "Collaborative editing can quickly become a textual rap battle fought with increasingly convoluted invocations of U+202a to U+202e."
What convolutions are really available? The Wikipedia entry on Bi-directional text says:
[T]he Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed.
In Unicode encoding, all non-punctuation characters are stored in writing order. This means that the writing direction of characters is stored within the characters. If this is the case, the character is called "strong". Punctuation characters however, can appear in both LTR and RTL scripts. They are called "weak" characters because they do not contain any directional information. So it is up to the software to decide in which direction these "weak" characters will be placed. Sometimes (in mixed-directions text) this leads to display errors, caused by the BiDi-algorithm that runs through the text and identifies LTR and RTL strong characters and assigns a direction to weak characters, according to the algorithm's rules.
In the algorithm, each sequence of concatenated strong characters is called a "run". A weak character that is located between two strong characters with the same orientation will inherit their orientation. A weak character that is located between two strong characters with a different writing direction, will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL). If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called marks. The mark U+200E (left-to-right mark) or U+200F (right-to-left mark) is to be inserted into a location to make an enclosed weak character inherit its writing direction.
This suggests that the text-direction "marks" only affect the ordering of "weak" characters. But this page on "Understanding Bidirectional Text" tells a more complex story: In addition to the "implicit markers" U+200E and U+200F, Unicode has two kinds of "explicit" text-direction markers, the "embeds" (U+202A and U+202B) and the "overrides" (U+202D and U+202E). These must be canceled by the "Pop-Directional-Formatting" (PDF) character U+202C.
In the comic, Black Hat Guy uses U+202E, the "Right-to-Left Override" (RLO), which causes following characters to be considered "strong" in the right-to-left direction until the override is ended by a PDF. Does this kind of thing work?
Check out RTLtest.html to see.
Screenshot from Chrome on Linux here:
As the test page notes, Black Hat Guy has apparently managed to subvert the implementation of this corner of Unicode interpretation in all non-Microsoft browser projects, so that the effect of his interventions will be non-deterministic in the absence of detailed knowledge about his collaborators' browser and OS choices.
(We need to do the test in a separate html file because WordPress has a set of obnoxious and hard-to-evade filters, which munge the text of posts in what someone once thought would be intelligent and helpful ways. Among many other annoyances, this results in the loss of most things like the RLO and LRO marks. This same misfeature has also required me to edit the second paragraph of the Wikipedia quotation displayed, because WordPress blithely ignores explicit instructions to the contrary and removes or defaces its attempts to display the relevant code points.)
Update — OK, following up on suggestions in the comments, here's a version of the test using Latin letters. Chrome in Linux screenshot:
So indeed this works in some browser/OS combinations where the other file didn't. I still don't understand why the number-string version of the test produces such curious and variable results, though it obviously has something to do with the fact that in (most?) RTL languages, digit strings are rendered LTR — though this doesn't explain why the right-hand digit string is rendered RTL; plus maybe in the non-IE browsers, the second digit string winds up on the left (a problem that I didn't think about) — though still, it's not clear to me why that should happen with digits and not with letters. Convolutions within convolutions, whatever, it just shows that Black Hat Guy has been hard at work in the councils of the Unicode Consortium as well.
Feel free to tell us about your explorations of these issues in various browsers and text editors and word processing programs. Or (as the cartoon suggests) in IM contexts?