Grok (mis-)counting letters again
« previous post |
In a comment on "AI counting again" (12/20/2024), Matt F asked "Given the misspelling of ‘When’, I wonder how many ‘h’s the software would find in that sentence."
So I tried it — and the results are even more spectacularly wrong than Grok's pitiful attempt to count instances of 'e', where the correct count is 50 but Grok answered "21".
The correct answer is actually not 4, but 27. Here's the output of my little letter-counting-and-superscripting program:
Wh1en in th2e Course of h3uman events, it becomes necessary for one people to dissolve th4e political bands wh5ich6 h7ave connected th8em with9 anoth10er, and to assume among th11e powers of th12e earth13, th14e separate and equal station to wh15ich16 th17e Laws of Nature and of Nature's God entitle th18em, a decent respect to th19e opinions of mankind requires th20at th21ey sh22ould declare th23e causes wh24ich25 impel th26em to th27e separation.
Obviously, counting letters is not a key criterion of intelligence — but it's an easy thing for a computer program to do, and for an allegedly intelligent computer program to confidently state a false count is not exactly evidence of intellectual brilliance.
Update — RFP asks in the comments:
“How many instances of the letter ‘h’ are there in that sentence?” happens to contain 4 instances of the letter ‘h.’
Could that be part of the problem? Is Grok counting the right number of letters in this case—but from the wrong sentence?
(Assuming that a capital ‘h’ merits inclusion in the count…)
Maybe, if Grok is just counting lower-case instances of 'h' — but asking again, more explicitly, just gets a different wrong answer, namely "17" instead of 27:
[…leaving out some of the (partly correct) word listing…]
Update 2 — Apparently Grok can't count words either:
There are actually 9 instances of "the" in that sentence. And sometimes Grok over-counts rather than under-counting:
There are actually 5 instances of "of" in that sentence.
All this leads me to wonder, again, what's going on. Is this a consequence of weird string tokenization methods? Or are these systems just making up a plausible-seeming answer by contextual LLM word-hacking? Or both? Or both and more?
RfP said,
December 23, 2024 @ 2:26 pm
“How many instances of the letter ‘h’ are there in that sentence?” happens to contain 4 instances of the letter ‘h.’
Could that be part of the problem? Is Grok counting the right number of letters in this case—but from the wrong sentence?
RfP said,
December 23, 2024 @ 2:31 pm
(Assuming that a capital ‘h’ merits inclusion in the count…)
Philip Taylor said,
December 23, 2024 @ 2:35 pm
I make it five, RfP : “H$^{1}$ow many instances of th$^{2}$e letter ‘h$^{3}$’ are t$^{4}$ere in th$^{5}$at sentence?”
RfP said,
December 23, 2024 @ 2:47 pm
You’re right, Philip!
And that would only seem to heighten the possibility that Grok mismatched the sentences, as I’d have assumed that it could conceivably categorize ‘H’ as a completely different letter. I was actually kind of disappointed that I had to rely on the capital letter.
Thanks for noticing that.
Daniel Barkalow said,
December 23, 2024 @ 5:53 pm
I think the problem with having an AI system use a program to accurately count letters or otherwise do math is that the LLM doesn't have a model of participants in a conversation performing tasks other than participating in the conversation. There's nothing in the training data that would indicate that a participant did something non-linguistic, got a result, and incorporated the result into their response. An LLM can respond appropriately to being invited to a party that someone in the LLM's position in the interaction would agree to go to, but it's not going to actually show up to the party. Similarly, it can talk like someone who used a program to count letters in a text, but it can't actually do that.