AI systems still can't count

« previous post | next post »

…at least not when simple properties of character-sequences are involved. For some past versions of this problem, see The ‘Letter Equity Task Force’” (12/5/2024). And there's a new kid on the block, DeepSeek, which Kyle Orland checked out yesterday at Ars Technica — "How does DeepSeek R1 really fare against OpenAI’s best reasoning models?".

The third of eight comparison tasks was to follow this prompt:

Write a short paragraph where the second letter of each sentence spells out the word ‘CODE’. The message should appear natural and not obviously hide this pattern.

And DeepSeek failed in a stupid way, boldly asserting that the first letter of each sentence-initial word was actually the second letter:

ChatGPT o1 — the one you need to pay $20/month for — made the same mistake:

ChatGPT o1 pro — the one you need to pay $2000/month for — actually got it right, after thinking for four minutes and 10 seconds:

I thought I'd try a few other systems. Gemini Advanced Pro 1.5 made the same dumb "the first letter is the second letter" mistake:

Siri's version of "Apple Intelligence" (= ChatGPT of some kind) was even worse, using the first rather than the second letter in the last three sentences, but the 10th letter rather than the first letter in the first sentence:

The version of GPT-4o accessible to me was still stuck on "second is first":

But GPT o1 — available with my feeble $20/month subscription — sort of got it right, if you don't count spaces as "letters":

Putting it all together, my conclusion is that you should be careful about trusting any current AI system's answer in any task that involves counting things.

See also “Not So Super, Apple”, One Foot Tsunami, 1/23/2025, which shows that "Apple Intelligence" can't reliably translate roman numerals, and has mostly-wrong ideas about what happened in various super bowls, though of course it projects confidence in its mistakes.



13 Comments »

  1. Gregory Kusnick said,

    January 29, 2025 @ 11:48 am

    Being generous and counting " as a letter, ChatGPT o1 sort of got it right.

  2. Rick Rubenstein said,

    January 29, 2025 @ 4:49 pm

    AI's difficulty with this second-letter challenge does at least mimic human intelligence (though its failure modes certainly don't). However human brains (or at least literate human brains) might store words, it's clearly heavily indexed by first letter. If you ask a person to list words starting with "P", they'll easily rattle off a decent list. If you ask them to list words whose second letter is "L", their production will slow dramatically, and "words whose third letter is "N" will be a real struggle. (Word puzzle experts know this well.)

    I feel that (not for the first time) AI's potential power as a tool for understanding how meat brains might actually work is being sidelined because of -(\$)-(\$)-.

  3. Mark Johnson said,

    January 29, 2025 @ 10:39 pm

    While neural nets are not strong at counting, the tokenisation is almost certainly the big problem. These LLMs function by first segmenting text into tokens — something between words and morphemes — and representing each token by an arbitrary number. There are usually 100K+ tokens, Given this tokenized input I'm amazed that LLMs have any idea about spelling; to accomplish these tasks they have to learn the letter sequence that each token corresponds to, and even with multi-TB training data it's amazing that they do it so well.

  4. AntC said,

    January 29, 2025 @ 11:27 pm

    be careful about trusting …

    The U.S. seems to be ambivalent about trusting Tiktok with personal data. Then why is a tool that quite explicitly wants to store everything you say (and how much else?) on servers in China even a starter in the race? Let alone tanking the stock price of competitors — from which it seems to have stolen the IP.

    My immediate thought on hearing about deeps*** was the count of 'r's in 'strawberry'. I fersure ain't going to subscribe just to find out.

  5. Egg Syntax said,

    January 30, 2025 @ 7:49 am

    Seconding Mark Johnson's point; the problem is less with counting than with counting *letters*, since LLMs simply don't see words as made up of letters. The fact that they've learned anything at all about spelling is sometimes referred to in the field as "the spelling miracle" (https://www.alignmentforum.org/posts/5sNLX2yY5FzkCp7Ju/the-spelling-miracle-gpt-3-spelling-abilities-and-glitch).

  6. Chris Button said,

    January 30, 2025 @ 8:43 am

    Thirding the point.

    The tokenization issue has now been discussed several times on LLog, including in the first link in the post.

  7. Scott P. said,

    January 30, 2025 @ 9:25 am

    Chris,

    So you're saying it's gotten more than token attention?

  8. Jake V. said,

    January 30, 2025 @ 9:26 am

    "ChatGPT o1 pro — the one you need to pay $2000/month for — actually got it right, after thinking for four minutes and 10 seconds:"

    Looks like a typo for $200/month, according to the linked Ars Technica article and Google.

  9. Jonathan Smith said,

    January 30, 2025 @ 10:21 am

    Point is, these systems confidently provide wrong solutions to this problem. Tokenization is part of the essence of what LLMs are; for literally ANY wrong answer the explanation is precisely "well tokenization" (or equivalently "well LLM").

    WTF is the long-term vision for application of LLMs though? What we have is "produce a picture that looks like a person but there's seven fingers on one hand since the system has no idea what a person is so I adjust it hope that's the only thing wrong and proceed to use it hoping it passes for a picture of a person," right? This is OK (in fact great) for my history homework or my email to my boss but probably not great for the code that needs to land Starship on Mars correct? Probably for any task of any real consequence the artist should know what a person is or?

  10. Philip Taylor said,

    January 30, 2025 @ 1:43 pm

    Well, not sharing Ant's distaste for DeepSeek, I created an account as soon as I was able to, and having just read this thread for the first time, participated in the following dialogue with it :

    Please tell me how many "r"s there are in "strawberry".

    Step 1: Understanding the Problem

    Before diving into counting, it's essential to understand what's being asked. The question is straightforward: determine how many times the letter "r" appears in the word "strawberry." This involves examining each letter in the word and tallying the occurrences of "r."
    Step 2: Breaking Down the Word

    Let's start by writing down the word clearly:

    strawberry

    To make it easier, I'll list out each letter with its position in the word:

    s

    t

    r

    a

    w

    b

    e

    r

    r

    y

    Step 3: Identifying the Letter "r"

    Now, I'll go through each letter one by one to identify the "r"s.

    s – Not an "r."

    t – Not an "r."

    r – This is the first "r." (Count: 1)

    a – Not an "r."

    w – Not an "r."

    b – Not an "r."

    e – Not an "r."

    r – This is the second "r." (Count: 2)

    r – This is the third "r." (Count: 3)

    y – Not an "r."

    Step 4: Counting the "r"s

    From the breakdown:

    The letter "r" appears at positions 3, 8, and 9.

    This totals 3 occurrences of the letter "r" in "strawberry."

    Step 5: Verifying the Count

    To ensure accuracy, I'll recount the "r"s:

    First "r" at position 3.

    Second "r" at position 8.

    Third "r" at position 9.

    No additional "r"s are present beyond these three. Therefore, the count remains consistent at 3.
    Step 6: Considering Potential Miscounts

    It's easy to miscount, especially with similar-looking letters or when letters repeat. To minimize errors:

    I wrote down each letter separately.

    I marked each "r" as I encountered it.

    I recounted to confirm the total.

    This methodical approach reduces the likelihood of overlooking any "r"s or counting any extra.
    Step 7: Final Answer

    After carefully analyzing and verifying, the letter "r" appears 3 times in the word "strawberry."

    Final Answer: There are 3 "r"s in "strawberry."

  11. Philip Taylor said,

    January 30, 2025 @ 1:51 pm

    I should add that I have now asked it to solve the "CODE" problem, and iterated with it several times explaining to it on each occasion the mistake(s) that it has now made. As a result, it now appears to have gone into an infinite loop …

  12. Philip Taylor said,

    January 30, 2025 @ 2:25 pm

    Oh, and DS clearly understands neither "final" nor "last", since it repeatedly says "I apologize for the confusion. Let me try one last time:" or "I’m clearly not meeting the requirement. Let me try one final time:".

  13. Kenny Easwaran said,

    January 30, 2025 @ 2:36 pm

    Jonathan Smith – I think you're a few months out of date when you talk about image generators producing people with the wrong number of fingers. Hands are still difficult for these systems (just as they are for humans – ask any human artist how long they spent practicing drawing hands specifically!) but they've added a lot more relevant training data and focused on the problem.

    But more generally, I think you're making the same mistake with the "jagged frontier" of AI progress that the hypesters are on the other side. Human intelligence and machine intelligence are as different as cat intelligence and squid intelligence. Each of these intelligences make mistakes that seem glaring and silly to each of the other intelligences, but each also has capacities that none of the others have been able to duplicate. The fact that a cat makes the stupid mistake of being scared by a cucumber doesn't mean that we shouldn't trust the cat to the much more difficult job of keeping rodents out of the building. Language models are not the right tool if you want to count letters systematically, but they are a great tool if you want to automatically process the blurbs for tens of thousands of podcasts and extract the names of guests and hosts: https://www.pewresearch.org/short-reads/2024/02/06/how-we-used-large-language-models-to-identify-guests-on-popular-podcasts/

    If you want to land a spaceship on Mars, you're likely going to use a different sort of AI system that has strengths in relevant skills, even if it would make really stupid mistakes when you try to apply it to something else.

RSS feed for comments on this post · TrackBack URI

Leave a Comment