The LLM-detection boom
« previous post | next post »
Joe Marshall, "As AI cheating booms, so does the industry detecting it: ‘We couldn’t keep up with demand’", The Guardian 7/5/2023:
Since its release last November, ChatGPT has shaken the education world. The chatbot and other sophisticated AI tools are reportedly being used everywhere from college essays to high school art projects. A recent survey of 1,000 students at four-year universities by Intelligent.com found that 30% of college students have reported using ChatGPT on written assignments.
This is a problem for schools, educators and students – but a boon for a small but growing cohort of companies in the AI-detection business. Players like Winston AI, Content at Scale and Turnitin are billing for their ability to detect AI-involvement in student work, offering subscription services where teachers can run their students’ work through a web dashboard and receive a probability score that grades how “human” or “AI” the text is.
I've been skeptical of these LLM-detection claims ever since the issue first came up: "Detecting LLM-created essays?", 12/20/2022; "It's impossible to detect LLM-created text", 7/5/2023. And the current situation, in my opinion, is well summarized by Chris Barts' comment on the July 5 post:
There's a simple, desperate logic here:
Schools and other businesses will decide they need to detect AI-generated work.
They will see these AI-based tools claiming to be able to do the job.
Therefore, they will declare that those tools work, and ignore the evidence to the contrary.
"It works because it must work" is a powerful logic.
At any given stage of the LLM-vs.-detector arms race, the detectors will have some success — but significant rates of false positives and false negatives will remain, and there will be many simple tricks that can be used to fool them.
Also, as I wrote in that post,
This does NOT mean that LLMs (of current types) can reliably create coherent and correct texts. Aside from the well-documented "hallucination" issues, there are other systematic problems due to the fact that such systems try to deal with things like logic and plot using nothing more than word-association patterns on various scales. Of course, it's possible that current explorations in "neuro-symbolic AI" will fix this — for one such approach, see Lionel Wong et al., "From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought", 6/23/2023.
There's some excellent commentary and advice in Ethan Mollick's "The Homework Apocalypse", 7/1/2023. He agrees that "THERE IS NO WAY TO DETECT THE OUTPUT OF GPT-4", and concludes (along with a cute AI-generated illustration) that:
There is light at the end of the AI tunnel for educators, but it will require experiments and adjustment. In the meantime, we need to be realistic about how many things are about to change in the near future, and start to plan now for what we will do in response to the Homework Apocalypse. Fall is coming.
I'll add one more possible response, based on an experience from more than 20 years ago.
In 1999, Wendy Steiner and I taught an experimental undergraduate course on "Human Nature".
Both of us were skeptical of essay-question exams, mostly due lack of trust in our own subjective evaluations. So we decided to experiment with an in-person oral midterm. But as I recall, there were about 80 students in the class — so how to do it?
We circulated in advance a list of a dozen or so sample questions, explaining that the actual questions would be similar to those.
And we scheduled 16 (?) one-hour sessions, each involving five students and both instructors sitting around a small seminar table. We went around the table, asking a question of each student in turn, and following up with redirections or additional questions as appropriate,. We limited each such interaction to 10 minutes, and before going on to the next student, we asked the group for (optional, brief) comments or additions. We both took notes on the process.
This took the equivalent of two full 8-hour days — spread over a week, as I recall — but it went smoothly, and was successful in the sense that our separately-assigned grades were almost exactly in agreement.
There are obvious problems with grades based on this kind of oral Q&A, just as there are problems with evaluating in-class essays, take-home exams, and term papers. And I've never repeated the experiment. But maybe I should.
Alyssa said,
July 7, 2023 @ 12:47 pm
At the moment, it's not terribly difficult to tell that a text is AI-generated, when it's on a topic for which one has some specialist knowledge. I've seen it happen "in the wild" a few times now with websites or organizations trying to farm out content – it doesn't fool anybody.
I do think that teachers will have to re-think the way they assign and grade essays, because until now it has been pretty easy to get a B grade for the kind of rambling nonsense that AI is so good at. Maybe teachers will have to put more emphasis on students making a coherent point. Or maybe they'll just switch to essays hand-written during class time, like it's already done for high-stakes exams.
Dave said,
July 7, 2023 @ 4:47 pm
I'm hoping that AI making "grammatically correct rambling" too cheap to meter implies we will all put more emphasis on making coherent points.
Thomas Shaw said,
July 8, 2023 @ 5:27 pm
One option is the Butlerian Jihad.
Benjamin E. Orsatti said,
July 10, 2023 @ 6:58 am
Thomas Shaw,
Great, now, whenever I think about AI programs, I'll be thinking about an old eccentric couple on a farm…
astrange said,
July 14, 2023 @ 3:42 am
The concept of "world models" is another example of "it works because it must work" – it's a leftover piece of old psychological theories that the old kind of expert-system AI research tried to create because they assumed people have one and therefore AI would need one.
They never got anywhere, so there's no reason to believe it'll happen now.