Alexa down, ChatGPT up?

« previous post | next post »

Two recent developments seem to point in opposite directions. On one hand, there are R&D cutbacks as voice assistants are seen as failures. On the other hand, there's widespread enthusiasm for the impressive capabilities of ChatGPT, including suggestions that it will take over internet search (Ben Cost, "Rise of the bots: ‘Scary’ AI ChatGPT could eliminate Google within 2 years", NY Post 12/6/2022), destroy writing education (Stephen Marche, "The College Essay is Dead", The Atlantic 12/6/2022), and more.

Writing shortly before the release of ChatGPT, Gary Marcus offered a sensible (and plausible) explanation for this apparent contradiction (Gary Marcus, "How come smart assistants have virtually no ability to converse, despite all the spectacular progress with large language models?", 11/22/2022). He rejects the silly idea that the engineers responsible for practical chatbots are ignorant or incapable, and instead offers these five possible reasons why ChatGPT-style large language models (LLMs) are not yet ready to save Alexa and the rest of them:

Here are my five best guesses; I suspect all five contributed:

    1. LLMs are inherently unreliable. If Alexa were to make frequent errors, people would stop using it. Amazon would rather you trust Alexa for a few things like timers and music than sell you a system with much broader scope that you stop trusting and stop using.
    2. LLMs are unruly beasts; nobody knows how to make them refrain 100% of time from insulting users, giving bad advice, or just plain making stuff up. (Galactica was an epic failure in this regard.)
    3. Amazon doesn't want to get sued. Any one of these scenarios of LLMs gone awry (bad advice, insults, lies etc) could hurt the Amazon brand, open up litigation, etc.. It's just not worth the risk.
    4. Alexa has to do stuff in the world, like turning on lights, playing music, opening shades, etc; if Alexa could converse freely, user expectations would go through the roof, and mostly be unmeetable. (You could tell Alexa to wash the dishes, but until their robot division really picks up speed, that ain’t happening.)
    5. LLMs spit our words, not actions (and not API calls either). When an LLM produces a sentence, you can't directly use that sentence to control stuff, unless you build another system to parse the sentences into actions. Nobody knows how to do this reliably, either.

Bottom line: From the outset Large Language Models like GPT-3 have great at generating surrealist prose, and they can beat a lot of benchmarks, but they are not (and may never be) great tech for reliably inferring user intent from what users say.

Turning LLMs into a product  that controls your home and talks to you in a way that would be reliable enough to use at scale in millions of homes is still a long, long way away.

A sample of the voluminous coverage of LLM successes and failures (updated):

Gary Marcus, "How come GPT can seem so brilliant one minute and so breathtakingly dumb the next?", 12/1/2022
Mike Pearl, "The ChatGPT chatbot from OpenAI is amazing, creative, and totally wrong: Need ideas? Great! Need facts? Stay away!", Mashable 12/3/2022
Ian Bogost, "ChatGPT Is Dumber Than You Think: Treat it like a toy, not a tool", The Atlantic 12/7/2022
Kevin Roost, "The Brilliance and Weirdness of ChatGPT", NYT 12/5/2022
Janus Rose, "Stack Overflow Bans ChatGPT For Constantly Giving Wrong Answers", Mashable 12/5/2022 (and ChatGPT responds…)
Tweet: "ChatGPT confidently gives the right kind of reasoning to solve this math problem, but whiffs on the algebra in the middle and gets the answer wrong"
Steve Piantodosi on ChatGPT bias (Twitter 12/5/2022)
Pritam Bordoloi, "ChatGPT is Now Available on WhatsApp", Analytics India 12/6/2022
Clive Thompson, "On Bullshit, and AI-Generated Prose", Medium 12/6/2022
Daniel Herman, "ChatGPT will end High School English", The Atlantic 12/9/2022
Daniel Lametti, "A.I. Could Be Great for College Essays", Slate 12/7/2022
Keenan Malik, "ChatGPT can tell jokes, even write articles. But only humans can detect its fluent bullshit", The Guardian 12/11/2022
Beth McMurtrie, "AI and the Future of Undergraduate Writing: Teaching experts are concerned, but not for the reasons you think", The Chronicle of Higher Education 12/13/2022
James Vincent, "Google won't launch ChatGPT rival because of  'reputational risk'", The Verge 12/14/2022
Lucas Ropek, "ChatGPT Wrote a Terrible Gizmodo Article", Gizmodo 12/14/2022

A specific example of a factual problem, taken originally from the Mashable article cited above (where the answer is a bit shorter):

Let's say I'm painting a battle scene from the napoleonic wars. What color of paint should I use for the uniforms of the Royal Marines?

The uniforms of the Royal Marines would have typically been dark blue in color, so using a dark blue paint would be a good choice for painting them. It's important to keep in mind, however, that the exact shade of blue may vary depending on the specific time period and unit you are depicting in your battle scene. For a more accurate representation, you may want to do some research on the specific uniforms worn by the Royal Marines during the Napoleonic Wars.

Some relevant past LLOG  posts: "Cat names from GPT-2", 6/3/2019; "Danger: Demo!", 10/9/2019;"Conversations with GPT-3", 6/25/2022

And ChatGPT does much better than GPT-3 in the cases where I've checked:

Do ducks lay eggs?

Yes, ducks do lay eggs. Like other birds, ducks have a reproductive system that allows them to lay eggs. Female ducks, called hens, have ovaries that produce eggs, which are fertilized by a male duck's sperm and then incubated until they are ready to hatch. Ducks typically lay eggs on a regular basis and can lay several eggs each week. The number of eggs a duck can lay depends on the breed and the individual hen. Some ducks may lay as few as 5 eggs per year, while others can lay more than 200 eggs per year.

What about lions?

Lions are mammals, not birds, so they do not lay eggs. Unlike birds, which lay eggs, female lions give birth to live young after a gestation period of about 110 days. When a lioness gives birth, she typically gives birth to a litter of 2 to 4 cubs. The cubs are born blind and are completely dependent on their mother for the first few weeks of their lives. The mother lion will nurse and care for her cubs until they are old enough to start hunting and feeding on their own.

Do ticks carry Lyme Disease?

Yes, ticks can carry Lyme Disease. Lyme Disease is a bacterial infection that is transmitted to humans and animals by the bite of an infected tick. The bacterium that causes Lyme Disease, Borrelia burgdorferi, is carried by certain types of ticks and is passed on to humans and animals when the tick bites them. The most common type of tick that carries Lyme Disease is the deer tick (Ixodes scapularis), also known as the black-legged tick. These ticks are found in wooded and grassy areas, and they can attach to humans and animals as they brush against vegetation. When a tick bites an infected animal, it can pick up the bacteria and then transmit it to another host when it bites again. If you have been bitten by a tick, it is important to watch for symptoms of Lyme Disease and to see a doctor if you develop any of the characteristic signs of the infection.

Do all ticks carry Lyme Disease?

No, not all ticks carry Lyme Disease. Only certain types of ticks are known to carry the bacterium that causes Lyme Disease, Borrelia burgdorferi. The most common type of tick that carries this bacterium is the deer tick (Ixodes scapularis), also known as the black-legged tick. These ticks are found in wooded and grassy areas in the northeastern, mid-Atlantic, and north-central United States. In other parts of the world, different species of ticks may carry the bacterium that causes Lyme Disease. However, not all ticks of these species will be infected with the bacterium, and not all ticks that carry the bacterium will transmit the infection to humans or animals.


  1. Taylor, Philip said,

    December 8, 2022 @ 9:51 am

    "When an LLM produces a sentence, you can't directly use that sentence to control stuff, unless you build another system to parse the sentences into actions" — so when an LMM produces a sentence, it has no "awareness" of the semantics of the sentence that it produces ? If so, that surprises me — I would have expected an LMM to have an internal model of the semantics of the sentence it is about to produce, which could, if required, be transferred, through a suitable API, to a robot capable of performing (a subset of) the sentences that the LMM is capable of producing.

  2. Mark Liberman said,

    December 8, 2022 @ 10:01 am

    @Philip Taylor: "I would have expected an LMM to have an internal model of the semantics of the sentence it is about to produce, which could, if required, be transferred, through a suitable API, to a robot capable of performing (a subset of) the sentences that the LMM is capable of producing."

    You would have been wrong. These systems are very large networks of matrix (or tensor) products, with interspersed point non-linearities of various kinds. No explicit "semantic" model is involved — it's all about patterns of word (or letter or sound) co-occurences.

    If you're interested in learning about how such systems work, try reading some of these (or really, anything about the last 15-20 of language model development):

    Andrey Karpathy, "The Unreasonable Efectiveness of Recurrent Neural Networks"
    Alec Radford,"Improving Language Understanding with Unsupervised Learning"
    Jacob Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
    Ashley Pilipsiszyn, "Better Language Models and Their Implications"

  3. Victor Mair said,

    December 8, 2022 @ 11:25 am

    What does GPT in this context stand for?

    (computing) Initialism of GUID Partition Table.
    (machine learning) Initialism of generative pretrained transformer.
    (physics) Initialism of generalized probabilistic theory.


    A baker's dozen of possibilities are listed here.

  4. Taylor, Philip said,

    December 8, 2022 @ 11:34 am

    Given Mark's reference to Jacob Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, I would hazard a guess that in this context it means "generative pretrained transformer".

  5. Mark Liberman said,

    December 8, 2022 @ 11:54 am

    @Victor Mair:

    You could ask Wikipedia. Or ChatGPT itself:

    What does GPT stand for?

    GPT stands for Generative Pretrained Transformer. It is a type of large language model that has been trained on a vast amount of text data to generate human-like text. The "generative" part of the name refers to its ability to generate new text based on the input it receives, while the "pretrained" part indicates that it has been trained on a large amount of data in advance and can be fine-tuned for specific tasks. The "transformer" part of the name refers to the type of neural network architecture that GPT uses.

    A random 3-letter initialism is likely to have many possible references.

  6. Victor Mair said,

    December 8, 2022 @ 1:01 pm

    @Mark Liberman

    Your Wikipedia link just leads one back to this post, so no help.

    I did try on my own to find out from the actual Wikipedia article on ChatGPT what GPT in the name ChatGPT means. No help. It doesn't tell us that GPT means "Generative Pretrained Transformer". It doesn't even mention "Generative Pretrained Transformer".

    The ChatGPT link, which I also checked before writing my comment, requires an OpenAI account to access. Not all LL readers have an OpenAI account.

    One of the purposes of the LLog comments section is to clarify issues that are not discussed in the o.p.

  7. Bill Benzon said,

    December 8, 2022 @ 3:12 pm

    I've spent a LOT of time playing with ChatGPT in the last few days. It's more fun than a barrel of Monkeys. But, no, there's no semantics. Try this thought experiment.

    Let’s treat a text as a string of colored beads. We can assign each word the value of some color, any color, as long as each word TYPE is assigned a different value. Then we take each TOKEN in a text and replace it with a pixel having the color corresponding to its respective type. Now we’ve transformed a text into a string of pixels – color beads on a string. We do that for each text in a corpus and then modeled the corpus using the methods used in creating large language models.

    To prompt the model we have to feed it a string of pixels. That’s easily done. We simply generate a natural language prompt, translate that prompt into pixel form and present those pixels to the model. The model will then extend the string in the normal way. We could even set up a pair of models so that one prompts the other with strings of pixels.

    These pixel strings would be unintelligible to humans, nor are they likely to be very interesting as images. They’d just be a linear jumble of color. But these color jumbles are intelligible as texts if the proper word tokens are substituted for the color tokens.

    You might object that this thought experiment is completely artificial. Well, yeah, it is. I created it for a specific purpose, to get around the exceeding compelling illusion that mere word forms are words, in all their fullness of meaning and syntactic potential. For many purposes that distinction is of little value. But when thinking in a technical or quasi-technical way about language in computers, it is essential. I see widespread confusion on the issue, even from experts who should know better. And they do, in their technical work. But when they start thinking about and talking about that work in a non-technical way, that distinction disappears.

  8. Michael Watts said,

    December 8, 2022 @ 3:18 pm

    My thinking on the current wave of "language AI", as proponents prefer to call it, has basically been informed by the discussion in The Language Instinct of Broca's vs. Werner's aphasia.

    As presented there, Werner's aphasics are perfectly capable of producing reams of grammatical speech, but they have no understanding of what their — or others' — speech means and their speech sometimes betrays this lack of understanding. By contrast, Broca's aphasics have normal mental function, except that they are unable to produce grammatical speech. They may be able, after struggling, to produce some isolated words that bear high semantic content.

    What this means is that if you ask a (stylized) Werner's aphasic what a lion is, they will give you a fluent, accurate description: it is a large cat, distinctive for its mane; it lives in Africa; it forms social "prides"; etc. If you ask a Broca's aphasic what a lion is, they will be unable to tell you, in the same way that they are unable to convey any information verbally.

    But if you take the Broca's aphasic to the zoo, and ask them to show you the lion, they will succeed at that task. The Werner's aphasic will fail; the only thing they can do with a lion is describe one.

    My analysis of enthusiasm over "language AI" is that people have been deceived by the method they're using to perform capability assessments: they provide verbal input to the model and they get verbal responses, so they assume that everything is in order. The assistants are failing because they aren't supposed to produce verbal responses; they're supposed to produce actions!

    Mostly, I guess, I'm agreeing with Gary Marcus's point #5. (Or, from a certain perspective, point #4 — I guess you could read me as saying that the chatbots already have the ability to converse freely, and this has fooled people into believing that they have capabilities which they don't have. But I see myself as being more in the space of point #5. I think chatbots are operating in a model where meaning is irrelevant to them, and if it sounds good, they say it. And this leads to "irreconcilable differences" with assistants, who are intrinsically concerned with the meaning of a command.)

  9. Jerry Packard said,

    December 8, 2022 @ 4:05 pm

    When I read Bill Benzon’s thought experiment, the first thing I thought of, like Michael, was Wernicke’s aphasic speech, in the sense of proper type but random tokening. It seems to be quite an apt metaphor.

  10. Jonathan Smith said,

    December 8, 2022 @ 4:16 pm

    Developers need to hype up and dial back in response to circumstances, like keeping your pot on simmer. So begin with terms like "LM", "LLM", which of course suggest to normal people some kind of uh model of [the human] language [capacity], followed by the need to provide endless caveats in effect (if not always directly) acknowledging that these tools just assemble word strings. "Chat" is the same kind of nonsense; what kind of chat companion answers "do duck lay eggs?" like the above — or more to the point, real "chat" is indeed functional and relies on actual as opposed to fake meaning. These particular tools should be called say textbuilder or textsimulator, but that would be boring I guess.

  11. Bill Benzon said,

    December 8, 2022 @ 5:03 pm

    On GPT-3 vs. ChatGPT, I asked GPT-3 to explain a Jerry Seinfeld bit. It was an interesting interaction, but it really didn't get the point. ChatGPT got the point immediately. Here's a post comparing the two.

    I've just finished a (longish) post: Of pumpkins, the Falcon Heavy, and Groucho Marx: High level discourse structure in ChatGPT. Well of course there's high level discourse structure in ChatGPT's output. But it's nice to have an explicit analysis of some easy cases. What I find particularly interesting is its ability to take an abstract theory, such as Rene Girard's accounts of mimetic desire and sacrifice, and apply them to a particular case, such as Steven Spielberg's Jaws: Conversing with ChatGPT about Jaws, Mimetic Desire, and Sacrifice.

  12. Taylor, Philip said,

    December 8, 2022 @ 6:12 pm

    Following Mark's suggestion, I endeavoured to create an OpenAI account in order to be able to interact with ChatGPT. All went well, and I opened the e-mail sent by OpenAI inviting me to "Verify [my] e-mail address". In the body of that e-mail was a white-on-green button inviting me to "Verify email address" (something I thought I had just done), only to be invited to "Verify [my] 'phone number". Somewhat puzzled as to why I should need to very my 'phone number in order to be able to verify my e-mail address, I nonetheless provided my telephone number in standard international format (the system had already provided the leading "+44"), only to be told "SMS Verification is not supported by landline phones". Why it claims that I have no idea, since BT/Plusnet kindly provide an SMS-to-speech converter free of charge, but at that point I was forced to give up. Rather sad really, as I had rather looked forward to the idea of interacting with ChatGPT.

  13. Victor Mair said,

    December 8, 2022 @ 6:15 pm

    "[Shobita] Parthasarathy discusses implications of Large Language Models", University of Michigan, Ford School (11/7/22)


    Shobita Parthasarathy

    Professor of Public Policy; Director, Science, Technology, and Public Policy program

    Parthasarathy studies the governance of emerging science and technology and the politics of evidence and expertise in policy in comparative and international perspective. Her current research focuses on equity in innovation and innovation policy. She co-hosts The Received Wisdom podcast.



    Large Language Models (LLMs) are artificial intelligence tools that can read, summarize and translate texts and predict future words in a sentence letting them generate sentences similar to how humans talk and write. Shobita Parthasarathy, professor of public policy and director of the Science, Technology, and Public Policy Program, recently released a report about how LLMs could exacerbate existing inequalities.

    "Big companies are all doing it because they assume that there is a very large lucrative market out there. History is often full of racism, sexism, colonialism and various forms of injustice. So the technology can actually reinforce and may even exacerbate those issues," Parthasarathy told Asian Scientist. "They’re all privately driven and privately tested, and companies get to decide what they think a good large language model is. We really need broader public scrutiny for large language model regulation because they are likely to have enormous societal impact."



    VHM: The following is a starkly abridged version of the original, lengthy article.]

    Historical analogies for large language models

    Lessons from ice, chess, tractors, railroads, swords, sword fighting, mass production, pre-fab homes, photography, segways, scissor doors, pornography, and calculators

    How will large language models (LLMs) change the world?

    No one knows. With such uncertainty, a good exercise is to look for historical analogies—to think about other technologies and ask what would happen if LLMs played out the same way.

    I like to keep things concrete, so I’ll discuss the impact of LLMs on writing. But most of this would also apply to the impact of LLMs on other fields, as well as other AI technologies like AI art/music/video/code.

    [VHM: Here follow her historical analogies of the impact of the above 13 human innovations.]


    This exercise made me even less sure about what’s going to happen. But it helped me clarify the reasons for uncertainty. There is of course the obvious question of how good LLMs will get, and how fast. But to predict the impact of LLMs we also need to understand:

    Will LLMs act more as competitors or complements to human writing?

    How will people react to LLMs? Maybe LLMs will write amazing novels and people will love them. Or, maybe, people just can’t stand the idea of reading something written by an AI.

    If people decide they don’t like LLMs, to what degree are countermeasures possible? Can we build machine learning models to detect LLM-generated text? Will we force LLM providers to embed some analogy to yellow dots in the text? Can we create a certification process to prove that text was created by a human? (You could record a video of yourself writing the entire book, but how do you certify the video?)

    Beyond all that, I wonder to what degree these analogies are useful. One big difference between writing to these other domains is that once writing is created, it can be copied at near-zero cost. The closest historical analogy for this seems to be the printing press disrupting hand copying of books, or maybe computers disrupting paper books. But it’s also possible that this shift is something fundamentally new and won’t play out like any of these analogies suggest.

  14. Yuval said,

    December 9, 2022 @ 5:56 am

    Speaking of "insults", here's that fun time when I cold-started an attempt to play with chatGPT in Hebrew and it insisted on being identified as an Antisemite.

    But there's truly astonishing stuff, like the conlanging, recursive implementation of functions, that "virtual machine" post. Endless, really, and this is coming from a longtime general-purpose-LLM-unenthusiast.

  15. GH said,

    December 9, 2022 @ 6:13 am

    @Mark Liberman:

    @Philip Taylor: "I would have expected an LMM to have an internal model of the semantics of the sentence it is about to produce, which could, if required, be transferred, through a suitable API, to a robot capable of performing (a subset of) the sentences that the LMM is capable of producing."

    You would have been completely wrong.

    Still, ChatGPT can produce computer code for specific tasks (on specific systems) on request, so it seems like it ought to be possible without too much difficulty to turn its output into actions, "simply" by running that code instead of displaying it. (Whether they would, with acceptable reliability, be the correct actions is a separate question.)

  16. astrange said,

    December 9, 2022 @ 6:46 am

    Gary Marcus is a crank, but it's hard to notice since he does it in the anti-AI direction which is mostly the correct one. He won't even believe in the things that are true, though.

    The reasons LLMs aren't used in personal assistants are:

    1. OpenAI has the only productized one that works.

    2. They're very expensive to run.

    3. They don't work in every language it supports.

    #2 is the main one – we're temporarily not seeing this because ChatGPT is free, but if you submit a few requests to the real API (OpenAI playground) it can quickly get in the $10s, which is too much when all assistants are free.

  17. /df said,

    December 9, 2022 @ 8:25 am

    @Bill Benzon, isn't that the point, that a system like ChatGPT has no intrinsic pre-defined understanding of any word in in any language, yet its emergent behaviour appears to demonstrate an understanding of words and utterances, even to the astonishing level shown in the previously linked

    From the OP one might infer that ChatGPT has spent a lot of time with Wikipedia.

    @Philip Taylor, same problem with (eg) Discord. There's probably some JS module on NPM that has the faulty assumption about SMS built in. You can get a £10 "burner phone" in a supermarket, possibly also useful if visiting PRC.

  18. Bill Benzon said,

    December 9, 2022 @ 2:00 pm

    @astrange: Yes, it's expensive to run. I've seen estimates of 6 figures a day (American dollars) for ChatGPT, which now has over a millions users. I have no idea how many sessions it's now running per day, enough though so it sometimes gets backed up and you can't get in.

    @/df said: That's the point, it has no pre-defined understanding. It works on the understanding that words that co-occur together share some meaning. When you get enough co-occurrences over gazillions of texts, you've got something. Warren Weaver articulated the basic principle (of distributional semantics) in a 1949 memo simply titled "Translation." Early work in machine transation, however, worked on different principles. Gerald Salton exploited distributional semantics in his 1970s work on document retrieval. That's well before current work in deep learning.

  19. Chester Draws said,

    December 9, 2022 @ 6:19 pm

    History is often full of racism, sexism, colonialism and various forms of injustice. So the technology can actually reinforce and may even exacerbate those issues," Parthasarathy told Asian Scientist. "

    The Luddites used to be the uneducated. Now it is the over-educated.

    This man would presumably have railed against Luther translating the Bible, on the basis of who knew what evils it might bring.

  20. AntC said,

    December 11, 2022 @ 2:19 am

    Stackoverflow Temporary policy: ChatGPT is banned

    This is a temporary policy intended to slow down the influx of answers and other content created with ChatGPT. What the final policy will be regarding the use of this and other similar tools is something that will need to be discussed with Stack Overflow staff …

    Overall, because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site …

    The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce. …

RSS feed for comments on this post