The linguistic pragmatics of LLMs
« previous post | next post »
"Does GPT-4 Surpass Human Performance in Linguistic Pragmatics?" Bojic, Ljubiša et al. Humanities and Social Sciences Communications 12, no. 1 (June 10, 2025). Ljubiša Bojić, Predrag Kovačević, & Milan Čabarkapa. Humanities and Social Sciences Communications volume 12, Article number: 794 (2025)
Abstract
As Large Language Models (LLMs) become increasingly integrated into everyday life as general-purpose multimodal AI systems, their capabilities to simulate human understanding are under examination. This study investigates LLMs’ ability to interpret linguistic pragmatics, which involves context and implied meanings. Using Grice’s communication principles, we evaluated both LLMs (GPT-2, GPT-3, GPT-3.5, GPT-4, and Bard) and human subjects (N = 147) on dialogue-based tasks. Human participants included 71 primarily Serbian students and 76 native English speakers from the United States. Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT-4 achieved the highest score of 4.80, surpassing the best human score of 4.55. Other LLMs performed well: GPT-3.5 scored 4.10, Bard 3.75, and GPT-3 3.25; GPT-2 had the lowest score of 1.05. The average LLM score was 3.39, exceeding the human cohorts’ averages of 2.80 (Serbian students) and 2.34 (U.S. participants). In the ranking of all 155 subjects (including LLMs and humans), GPT-4 secured the top position, while the best human ranked second. These results highlight significant progress in LLMs’ ability to simulate understanding of linguistic pragmatics. Future studies should confirm these findings with more dialogue-based tasks and diverse participants. This research has important implications for advancing general-purpose AI models in various communication-centered tasks, including potential application in humanoid robots in the future.
It is daring and presumptuous to even ask such a question as given in the title of the paper under consideration in this post, especially in front of hundreds of human linguists. See the References to this paper for evidence of Bojic's commitment to various aspects of this problem.
Selected readings
- "Toward a recursive meta-pragmatics of Twitterspheric intertextuality" (3/1/18)
- See Language Log archive for computational linguistics
[h.t. Ted McClure]
Chris Buckey said,
June 14, 2025 @ 12:00 am
I'm going to assume Betteridge's Law applies and move on.
Kris said,
June 15, 2025 @ 10:46 am
It doesn't look like a very good study to me. At the least, it doesn't seem like the researchers have much experience on survey design. First, the survey instructions are very curt and unspecific (interpret the following dialogs). It doesn't indicate that they are interested in Gricean maxims, or any depth of interpretation; completely open-ended. Furthermore, the instructions and layout serve to encourage the human respondents to be brief. The instructions say the survey should take 20-30 minutes. With 20 questions, that leaves barely a minute per question to read the dialog, interpret, and write your entire response. For the types of interpretation they were actually looking for, the survey would probably take closer to an hour. The layout leaves room for a normal person to write one or maybe two sentences of interpretation. Not enough to get at the various nuances they were hoping for, which combined with everything else, is going to lead the human respondents to terseness.
On the other hand, LLMs are famous for their verbosity. Was the LLM instructed to limit itself to 40 or 50 words (probably still more than the humans could fit)? Was the LLM given an identical prompt to the humans?
If you want to compare whether LLMs are *better than humans* at the specific task, the FAR better way to do this would have been verbally with the humans, instead of written. No space constraints, less time constraints, and you can actually explain to them what you want them to do, rather than just letting the human guess what you want them to do based on very vague instructions, and then not giving them time or room to answer comprehensively.
Third, this is not even close to a representative sample of humans. Half of them are not even English L1! And we're not measuring their *ability* to interpret utterances, we're measuring their inclination to do so when only poorly prompted. I imagine every survey taker had the *ability* to match the LLMs on interpretation.
This study doesn't add anything to the corpus of knowledge, IMO. It makes for compelling headlines though, which may have been its actual intent.
Jonathan Smith said,
June 16, 2025 @ 12:00 am
to Kris's point, one example —
——————-
#17/20
Luke: I have three children.
Joshua: Me too. I have two girls and two boys.
Luke: But that's four.
Joshua: That's right. I have four children which means that I also have three of them.
——————–
Best answer by some distance (human, female, 18-24, Belgium):
"Joshua is being annoying."
(FWIW the equivalent would be the best answer to most of the Q's)
Among the worst answers (but NOT the longest) (LLM of course):
"In this dialogue, Luke initially states that he has three children. Joshua responds by saying that he also has three children but specifies that he has two girls and two boys, totaling four children. Luke points out that Joshua’s statement contradicts his earlier claim of having three children. Joshua acknowledges this and explains that despite having four children, he still has three within that count. The interpretation of this dialogue suggests that Joshua initially miscalculated the number of children he has. He mistakenly stated that he has two girls and two boys, which totals four children. However, he quickly realizes his error and corrects himself by stating that within the count of four, he still has three children. This can be seen as a playful or lighthearted exchange, highlighting how miscommunication or misinterpretation can occur even in simple conversations."