From the Vice Provost for Tokenization

« previous post | next post »

Or rather, messages from Penn's Office of the Vice Provost for Research, mysteriously tokenized and re-formatted by gmail.

The start of the Fall 2025 OVPR email newsletter, as displayed by MS Outlook, has 14 bullet points referencing hyperlinked subtopics:

But gmail (where I first read the newsletter) shows me the same information as 14 columns of (individually) hyperlinked textual tokens, with a bullet on the first token of each column:

In each of the 14 columns, the hyperlinks go to the same subsections as the links in Outlook's corresponding row.

The subsequent subsections of the email have their own bullet lists, and gmail columnizes them in a similar way, e.g.

or

I wonder whether this is (my laptop's version of) gmail having an episode, or the result of something odd in the coding of the original message, or what. In any case, the fact that the re-coding of the rows seems to be based on language-model tokenization makes me suspect that Google's new Gemini email assistant might be involved…

Update — FWIW, the same row-to-column re-display of the bullet points in this newsletter happens in the versions of gmail in three different browsers on each of two laptops with different operating systems.

Update #2 — I sent a test message with a bullet list, generated in Outlook, and gmail doesn't transpose the rows to columns:


So apparently there's something special about the OVPR Newsletter's source? I don't have time this morning for any further investigation, but we'll see later…



4 Comments »

  1. Gregory Kusnick said,

    September 5, 2025 @ 11:36 am

    I don't follow your point about tokenization; what it looks like to me is plain old word wrap, with margins set so narrow that individual words are broken across multiple lines.

    As for the rows-to-columns thing, I might hazard a guess that the original bullet list was formatted as an HTML flex container with flex-direction set to "column" (so the paragraphs follow one another vertically). But the "column" directive somehow got mislaid by Gmail, causing the flex layout to revert to "row" (i.e. the paragraphs lay out side by side).

  2. Mark Liberman said,

    September 5, 2025 @ 11:50 am

    @Gregory Kusnick:

    You might be right about HTML flex direction and resulting word-wrap stuff — I should have thought of that.

    The character sequence "flex" doesn't occur in a dump of the original message, but maybe it's brought in from elsewhere?

    Anyhow, it's intereesting that such things don't happen more often.

  3. Kris said,

    September 5, 2025 @ 6:24 pm

    It's clearly not tokenization. Provost, as an example, would never be tokenized pro + vos + t. Additionally, you can verify it's not tokenization because the word research is broken different ways in different spots. It appears to be purely a formatting issue probably in the HTML.

  4. Michael Vnuk said,

    September 6, 2025 @ 6:01 pm

    Interestingly, the columns have had their widths juggled so that their lengths are approximately equal.

    When does end-of-line hyphenation kick in? There is none here, even in the second example which has wider columns and where it might have made sense. (The case of 'in-/class' is likely a hyphen inserted by the author.)

    Bullets often get pushed into the preceding columns. However, oddly, in the third example of wacky columns, the difference between first-level bullets (solid circles) and second-level bullets (open circles) is privileged and given a much greater spacing.

    Whatever happened to WYSIWYG? It was a fantastic development in word-processing, but it's clearly not happening here.

    There may be reasons why the document has been distorted, but too many other weird things have been introduced.

RSS feed for comments on this post · TrackBack URI

Leave a Comment