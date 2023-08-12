« previous post |

It's hard to keep up with the waves of hype and anti-hype in the LLM space these days.

Here's something from a few weeks ago that I missed — Xiaoxuan Wang et al., "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models", arxiv.org 7/20/2023:



Abstract: Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

That's is more or less what I would expect, given experiences such as those described in "LLMs as coders", 6/6/2023, or in the paper by Konstantine Arkoudas featured in "LLMs can't reason?", 8/8/2023.

FWIW, my overall evaluation is that LLMs are very valuable tools, whose full value has yet to be demonstrated. But they're now very far from achieving "general intelligence"; and in particular, as I wrote in an earlier post, reasoning is "a problem that deserves active investigation rather than a naive confidence that it's already been solved, or soon will be solved".

