More on LLMs' current problem-solving abilities
« previous post | next post »
It's hard to keep up with the waves of hype and anti-hype in the LLM space these days.
Here's something from a few weeks ago that I missed — Xiaoxuan Wang et al., "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models", arxiv.org 7/20/2023:
Abstract: Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
That's more or less what I would expect, given experiences such as those described in "LLMs as coders", 6/6/2023, or in the paper by Konstantine Arkoudas featured in "LLMs can't reason?", 8/8/2023.
FWIW, my overall evaluation is that LLMs are very valuable tools, whose full value has yet to be demonstrated. But they're now very far from achieving "general intelligence"; and in particular, as I wrote in an earlier post, reasoning is "a problem that deserves active investigation rather than a naive confidence that it's already been solved, or soon will be solved".
Taylor, Philip said,
August 12, 2023 @ 12:02 pm
For me, the problem lies in the very title of the article: "Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models". I would argue (and call me a Luddite if you will) that Large Language Models do not possess any "Scientific Problem-Solving Abilities". What they possess is the remarkable ability to churn out prose that looks as if it addresses the problem posed. But determining whether or not it does address that problem, rather than mereky appearing to, requires the input of a sentient being.
Chester Draws said,
August 12, 2023 @ 10:04 pm
Given that it could solve a third of the problems given, which is far more than most sentient beings could achieve, one could reasonably argue that it is more than the mere appearance.
I'm no booster for them, but doubt there is a single person on the planet that could outperform the modern "AI" across the full range of human mental skills and abilities. And certainly not at the speed they perform.
AntC said,
August 13, 2023 @ 1:30 am
Given that it could solve a third of the problems given,
But _which_ third? Humans — at least the ones I come to rely on — can say 'Oh, that's not my field' and refuse to give an answer. LLM's seem hell-bent on giving some answer-or-other, with no awareness it's a guess. It's not a third right: it's either right or mostly rubbish, with 1 in 3 likelihood.
I've recently been getting an LLM to calculate combinatorials on large numbers (so not the usual how many ways to pick 3 from 7 single digit examples). Three different answers to the same question asked slightly different ways round. Each of them wildly different by orders of magnitude. None of them anywhere near the right answer — even approximately.
So the 'best' solution is "I don't know".
And certainly not at the speed they perform.
I couldn't look up answers in dead-trees encyclopedias as fast as wikipedia. But browsing wp's answers I can assess the quality of the sources, the substantiation of evidence, the agreement with other sources. With LLMs I've no idea where they're getting their answers.
Taylor, Philip said,
August 13, 2023 @ 4:27 am
"LLM's seem hell-bent on giving some answer-or-other, with no awareness." That is exactly the problem — they lack awareness. Whether or not we will ever succeed in creating a non-living organism that posses awareness is, to my mind, somewhat moot. I think that we will not.
Bill Benzon said,
August 13, 2023 @ 6:45 am
Meanwhile, the ever skeptical Gary Marcus is asking, What if Generative AI turned out to be a Dud? From the article:
His concluding paragraph:
FWIW, I tend to think that the confabulation problem inheres in the technology and is not (going to be readily) fixable. Beyond that, I simply don't know. What LLMs can do is interesting and remarkable, but they have not brought us into tha anteroom of general intelligence. Just what they have done is not at all clear.
Randy Hudson said,
August 13, 2023 @ 5:04 pm
I believe the fundamental issue with LLMs is that they — quite literally — have no idea what they're talking about.
Taylor, Philip said,
August 14, 2023 @ 1:59 am
A very succint analysis, Randy, which which I am in complete agreement.
Jaap said,
August 14, 2023 @ 7:37 am
They have trouble with any simple question that requires reasoning. When chatGPT first became available I asked "Two people enter a room, and then three people leave. How many people are left in the room?", and it would confidently assert there was 3-2=1 one person left in the room. I tried again recently, and it now makes an entirely new set of mistakes. It waffles on about various possibilities, some of which make little sense. For example it stated that if the room started empty, there are now 0 people left in the room because you cannot have -1 people.
Taylor, Philip said,
August 14, 2023 @ 9:40 am
"For example it stated that if the room started empty, there are now 0 people left in the room because you cannot have -1 people." — well, that is at least a step in the right direction.
Taylor, Philip said,
August 14, 2023 @ 12:54 pm
"… a step in the right direction" — of course, if ChatGPT were really sentient, it would have inferred that one of the two people entering the room was a gravid female who then gave birth in the room, after which it was patently simple for three people to leave, leaving zero as it had already deduced …