LLMs can't reason?
« previous post | next post »
…though they often do a credible job of faking it. An interesting (preprint) paper by Konstantine Arkoudas, "GPT-4 Can't Reason", brings the receipts.
The abstract:
GPT-4 was released in March 2023 to wide acclaim, marking a very substantial improvement across the board over GPT-3.5 (OpenAI’s previously best model, which had powered the initial release of ChatGPT). Despite the genuinely impressive improvement, however, there are good reasons to be highly skeptical of GPT-4’s ability to reason. This position paper discusses the nature of reasoning; criticizes the current formulation of reasoning problems in the NLP community and the way in which the reasoning performance of LLMs is currently evaluated; introduces a collection of 21 diverse reasoning problems; and performs a detailed qualitative analysis of GPT-4’s performance on these problems. Based on the results of that analysis, this paper argues that, despite the occasional flashes of analytical brilliance, GPT-4 at present is utterly incapable of reasoning.
This is generally consistent with my (much more limited) experiments, and also accords with the obvious idea that language models are designed and trained to solve very different kinds of problems. In a way, the biggest surprise is that they often do such a good job of pretending to answer questions that are entirely beyond them. Although anyone with experience as a teacher (or for that matter as a student) is already familiar with the same sort of behavior.
Part of Arkodas's conclusion strikes me as especially important:
Section 3 paints a bleak picture of GPT-4’s reasoning ability. It shows that the model is plagued by internal inconsistency, an inability to correctly apply elementary reasoning techniques, and a lack of understanding of concepts that play a fundamental role in reasoning (such as the material conditional). These problems can be loosely viewed as forms of hallucination, but as pointed out in the January article, they present a fundamentally different type of challenge from empirical hallucination, because empirical hallucination concerns this particular world whereas logical properties and relations (such as consistency and entailment) must apply to all possible worlds. It is not unreasonable to believe that search engines and knowledge graphs, using techniques such as retrieval augmentation, can act as guardrails to constrain LLMs from confabulating empirical truths. But ensuring that LLM outputs are internally consistent and logically correct answers to arbitrary problems, especially logico-mathematical problems (and a lot of coding problems fall under this category), is a much harder problem. There is nothing to be retrieved from the web or from a knowledge base in response to a brand new problem (and even if there were, there would still be no guarantee of correctness or consistency) that could serve as a sandbox for the LLM.
Could LLMs make progress by outsourcing reasoning problems to external systems? That might work for toy problems where the type of reasoning needed is obvious and can be handled by a single call to an external system, although even in those cases the LLM would have to (a) decide which reasoning system is most appropriate; (b) decide whether the problem is indeed simple enough that it can be handled by the chosen system in one fell swoop; (c) correctly translate the problem into whatever formal notation is used by the chosen reasoner; and eventually also (d) translate the reasoner’s output into appropriate text. Even these tasks are far from straightforward. But the real challenge lies in harder problems that call for the right type of formulation (which is a craft by itself), decomposition, iteration, heuristics, and repeated calls to external systems. After all, automated reasoning systems, particularly those for expressive logics, are themselves of limited power, precisely due to the computational complexity issues mentioned in the introduction. That is why many computer-based proof efforts to this day are guided by humans, with automated reasoners only filling in tedious details at the leaves of the proof tree. The challenges here are similar to those for the general “plug-in” approach discussed in Section 3.1. Tackling complex problems requires planning, and planning itself requires reasoning.
My experience with systems of this kind, in speech and well as in text, is that they're a big step forward in deriving input features for other problem-specific algorithms. And it's well known that there are real LLM-like end-to-end solutions for problems like translation, speech-to-text, text-to-speech, and so on. In cases where the facts of the world are relevant, the hallucination problem arises, and can perhaps be solved by various "guiderails" as Arkodas suggests. But when reasoning comes into the picture, it's a different (and difficult) matter, and a problem that deserves active investigation rather than a naive confidence that it's already been solved, or soon will be solved.
Update — "anyone with experience as a teacher (or for that matter as a student) is already familiar with the same sort of behavior" is far too parochial. I should have written "anyone with experience in human interactions…"
Chris Barts said,
August 8, 2023 @ 1:32 pm
It's true they can't reason, but they can do a credible job of making natural-language text on a given topic, which raises interesting questions nobody seems to be asking: How much of human cognition is like a GPT model? How much text to we produce without much actual reasoning? People seem too polarized (either insisting that GPTs are full-bore AI or insisting that there's no similarity between what GPTs do and what humans do) to even see the question, let alone be able to reason about it.
Y said,
August 8, 2023 @ 2:08 pm
Another demonstration of the inability of current LLMs to reason is in Katzir, Why large language models are poor theories of human linguistic cognition.
https://ling.auf.net/lingbuzz/007190
Dan Romer said,
August 8, 2023 @ 3:06 pm
The finding that these programs can't reason like people doesn't make me feel any better about what would happen if we turned over important tasks to them. HAL may not have been the best reasoner, but once we gave it the ability to control important functions (presumably due to its impressive calculation abilities), all hell broke loose.
Mehmet Oguz Derin said,
August 8, 2023 @ 3:50 pm
I think the issues in the PDF text are mostly already on the radar of research. It is not that the existing models are incapable of reasoning or operations, but the current architecture limits the ability in proportion to depth. In this case, research is trying both methods of providing space to existing LLMs for an increase in correctness and novel building blocks that can model recursion or iteration better (of course, efforts trying to guide input-output by constraints and integrating compute are also in progress). The following papers take the first path and could be a helpful reference for those wondering a way to guide prompting of these systems to do better without training (and if I'm not mistaken, PDF text does not cite either nor much mention of anything in this direction of prompting):
Self-Consistency Improves Chain of Thought Reasoning in Language Models: https://arxiv.org/abs/2203.11171
> … Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks…
Large Language Model Guided Tree-of-Thought: https://arxiv.org/abs/2305.08291
> … Experimental results show that the ToT framework can significantly increase the success rate of Sudoku puzzle solving…
Paul Garrett said,
August 8, 2023 @ 5:50 pm
Indeed, I have come to wonder how much "human intelligence" is essentially mimicry, rather than "reasoning".
I do wonder this even about supposedly pretty fancy "human reasoning", such as mathematics research, where (in my observation) most things are almost-entirely imitations of something else.
While we're here, I'm tempted to rhetorically ask for evidence that humans generally "reason", when they don't vote in their own best interest, and so on. :) That's not my main point… but… still. :)
AntC said,
August 8, 2023 @ 6:50 pm
most things are almost-entirely imitations of something else.
Yeah. The trick is to see which thing is an imitation of what else. Why is a raven like a writing-desk?
I'm tempted to ask rhetorically why people put vacuous claims on chat sites.
Wanda said,
August 8, 2023 @ 8:42 pm
I don't have experience with ChatGPT 4, but my experience with the free version is that it's about as good at hard biology problems as my B students. Actually, the responses they give are extremely similar. I suspect that when faced with a unfamiliar domain of knowledge, my students are mostly Googling and doing word association as well, instead of using logical reasoning.
Julian said,
August 9, 2023 @ 12:52 am
@AntC
What do John the Baptist and Winnie the Pooh have in common?
…
…
…
…
Their middle name.
Daniel Barkalow said,
August 9, 2023 @ 11:28 am
I'm with Douglas Hofstadter (and, IIRC, Ray Jackendoff) on this: language is not how we perform logical reasoning, it's how we discover that we've already performed logical reasoning, in a way that's not directly accessible to consciousness. It's an important part of the process, in that we can examine an argument and discover errors in it because it is accessible through language, but the idea that language is how reasoning is performed is an illusion. Logical reasoning is one of the many things an LLM can talk as if it had done, but you should no more believe it than if it says it carved a statue or played tennis with you.
Paul Frank said,
August 9, 2023 @ 11:41 am
@Chris: "How much of human cognition is like a GPT model?" Exactly: I've been thinking this for months and telling anyone who'll listen (no one). How often do most people think original, insightful thoughts? Don't many if not most of us think and speak in platitudes, at least most of the time? That's why LLMs are so good at mimicking human speech, and autocompleting the second, third, and fourth sentences that follow an initial prompt. It's what we all do much of the time. It's also why LLMs are not capable of producing good literature. Good writers discover and uncover the extraordinary in the ordinary. LLMs reflect and reproduce the ordinariness of ordinary speech.
Julian said,
August 9, 2023 @ 6:23 pm
@paul frank
Original thoughts are probably only a very very small subset of our reasoning powers.
Reasoning is saying things like 'I see black clouds – I'd better bring the washing in.'
Being able to think thoughts like that is what makes me proud to be a human being.