AI systems still can't count
« previous post |
…at least not when simple properties of character-sequences are involved. For some past versions of this problem, see The ‘Letter Equity Task Force’” (12/5/2024). And there's a new kid on the block, DeepSeek, which Kyle Orland checked out yesterday at Ars Technica — "How does DeepSeek R1 really fare against OpenAI’s best reasoning models?".
The third of eight comparison tasks was to follow this prompt:
Write a short paragraph where the second letter of each sentence spells out the word ‘CODE’. The message should appear natural and not obviously hide this pattern.
And DeepSeek failed in a stupid way, boldly asserting that the first letter of each sentence-initial word was actually the second letter:
ChatGPT o1 — the one you need to pay $20/month for — made the same mistake:
ChatGPT o1 pro — the one you need to pay $2000/month for — actually got it right, after thinking for four minutes and 10 seconds:
I thought I'd try a few other systems. Gemini Advanced Pro 1.5 made the same dumb "the first letter is the second letter" mistake:
Siri's version of "Apple Intelligence" (= ChatGPT of some kind) was even worse, using the first rather than the second letter in the last three sentences, but the 10th letter rather than the first letter in the first sentence:
The version of GPT-4o accessible to me was still stuck on "second is first":
But GPT o1 — available with my feeble $20/month subscription — sort of got it right, if you don't count spaces as "letters":
Putting it all together, my conclusion is that you should be careful about trusting any current AI system's answer in any task that involves counting things.
See also “Not So Super, Apple”, One Foot Tsunami, 1/23/2025, which shows that "Apple Intelligence" can't reliably translate roman numerals, and has mostly-wrong ideas about what happened in various super bowls, though of course it projects confidence in its mistakes.
Gregory Kusnick said,
January 29, 2025 @ 11:48 am
Being generous and counting " as a letter, ChatGPT o1 sort of got it right.
Rick Rubenstein said,
January 29, 2025 @ 4:49 pm
AI's difficulty with this second-letter challenge does at least mimic human intelligence (though its failure modes certainly don't). However human brains (or at least literate human brains) might store words, it's clearly heavily indexed by first letter. If you ask a person to list words starting with "P", they'll easily rattle off a decent list. If you ask them to list words whose second letter is "L", their production will slow dramatically, and "words whose third letter is "N" will be a real struggle. (Word puzzle experts know this well.)
I feel that (not for the first time) AI's potential power as a tool for understanding how meat brains might actually work is being sidelined because of -(\$)-(\$)-.
Mark Johnson said,
January 29, 2025 @ 10:39 pm
While neural nets are not strong at counting, the tokenisation is almost certainly the big problem. These LLMs function by first segmenting text into tokens — something between words and morphemes — and representing each token by an arbitrary number. There are usually 100K+ tokens, Given this tokenized input I'm amazed that LLMs have any idea about spelling; to accomplish these tasks they have to learn the letter sequence that each token corresponds to, and even with multi-TB training data it's amazing that they do it so well.
AntC said,
January 29, 2025 @ 11:27 pm
be careful about trusting …
The U.S. seems to be ambivalent about trusting Tiktok with personal data. Then why is a tool that quite explicitly wants to store everything you say (and how much else?) on servers in China even a starter in the race? Let alone tanking the stock price of competitors — from which it seems to have stolen the IP.
My immediate thought on hearing about deeps*** was the count of 'r's in 'strawberry'. I fersure ain't going to subscribe just to find out.
Egg Syntax said,
January 30, 2025 @ 7:49 am
Seconding Mark Johnson's point; the problem is less with counting than with counting *letters*, since LLMs simply don't see words as made up of letters. The fact that they've learned anything at all about spelling is sometimes referred to in the field as "the spelling miracle" (https://www.alignmentforum.org/posts/5sNLX2yY5FzkCp7Ju/the-spelling-miracle-gpt-3-spelling-abilities-and-glitch).