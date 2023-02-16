« previous post |

For weeks, everyone was talking about how great the Large Language Model (LLM) ChatGPT is, or else showing that it can make serious mistakes of fact or logic. But since the alliance between OpenAI and Microsoft added (a version of) this LLM to (a version of) Bing, people have been encountering wierder issues. As Mark Frauenfelder pointed out a couple of days ago at BoingBoing, "Bing is having bizarre emotional breakdowns and there's a subreddit with examples". The cited subreddit, r/bing, has examples going back to the start of the alliance. And today, Kevin Roose posted a long series of strikingly strange passages from his own interactions with the chatbot , "Bing's A.I. Chat: 'I Want to Be Alive", NYT 2/16/2023.

One question about these interactions is where the training data came from, since such systems just spin out word sequences that their training estimates to be probable. Someone suggested to me, "It seems like they might have gotten a bunch of conversations from a bad dating app, maybe one that is known for catfishing or something." But OpenAI tells us that

We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.

So an army of low-paid "AI trainers" created training conversations, and also evaluated such conversations comparatively — which apparently generated enough sad stuff to fuel those "bizarre emotional breakdowns".

A second question is what this all means, in practical terms. Most of us (anyhow me) have seen this stuff as somewhere between pathetic and ridiculous, but C.M. pointed out to me that there might be really bad effects on naive and psychologically vulnerable people.

