Language Log

New frontiers in dataset corruption

May 5, 2024 @ 9:06 am· Filed by Mark Liberman under Elephant semifics, Nerdview

In a comment on yesterday's "Software testing day" post, ernie in berkeley offered a nice "QA Engineer walks into a bar" joke, and pointed us to its origin in an old xkcd comic "Exploits of a Mom":

…which in turn reminded me of an old problem, discussed in "Excel invents genes", 8/26/2016:

Read the rest of this entry »

Permalink Comments (31)

ChatGPT having a stroke?

February 21, 2024 @ 3:35 pm· Filed by Mark Liberman under Computational linguistics, Elephant semifics

Or a psychotic episode? ICYMI — Maxwell Zeff, "ChatGPT Went Berserk, Giving Nonsensical Responses All Night", Gizmodo 2/21024:

ChatGPT started throwing out “unexpected responses” on Tuesday night according to OpenAI’s status page. Users posted screenshots of their ChatGPT conversations full of wild, nonsensical answers from the AI chatbot.

Read the rest of this entry »

Permalink Comments (12)

Eliza reborn?

August 6, 2022 @ 8:33 am· Filed by Mark Liberman under Elephant semifics

Meta is inviting everyone to try out its BlenderBot3:

By releasing the chatbot to the general public, Meta wants to collect feedback on the various problems facing large language models. Users who chat with BlenderBot will be able to flag any suspect responses from the system, and Meta says it’s worked hard to “minimize the bots’ use of vulgar language, slurs, and culturally insensitive comments.” Users will have to opt in to have their data collected, and if so, their conversations and feedback will be stored and later published by Meta to be used by the general AI research community.

So following up on my earlier-reported "Conversations with GPT-3" (6/25/2022), here's BlenderBot3 chatting with a young person interested in philosophy:

Read the rest of this entry »

Permalink Comments (14)

COURTHOUHAING TOGET T ROCESS.WHE

July 25, 2022 @ 6:59 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

HE HAS ALL THE SOU OF COURSE
0:05 AND LOADED, READTOO.K
0:11 TING
0:16 A TVERY CONFIDENT.CONWAY
0:21 COURTHOUHAING TOGET T ROCESS.WHE
0:28 COIDATE'
0:30 TTACUTION'S CATHATE'
0:36 SE.
0:36 CHCEN'T KNHA
0:37 TAER OFURDI

That's the start of the automatically-generated transcript on YouTube for "See George Conway's reaction to Trump's reported plan if he wins again", CNN 7/24/2022.

Read the rest of this entry »

Permalink Comments (3)

"Train hard, dream big"

July 26, 2021 @ 5:10 am· Filed by Victor Mair under Alphabets, Elephant semifics, Errors, Information technology, Lost in translation

[This is a guest post by Bernhard Riedel]

I stumbled across what was probably a mis-MT in the context of the Olympic Games. (article in Korean)

"During a foot kick on the way to the gold medal, some hangul became visible. But…"

On the black belt of the athlete from Spain, one can see "기차 하드, 꿈 큰" which is wonderful gibberish. Netizens in Korea were puzzled but also quick to guess an erroneous machine translation.

기차(汽車): (railway) train (definitely *not* related to "to train")
하드: (en:hard, transliterated)
꿈: dream (noun built from the verb 꾸다(to dream) with the nominalizer ㅁ/음)
큰: big (from the verb 크다) in the form used when modifying a noun that follows

Read the rest of this entry »

Permalink Comments (1)

English as Afrikaans?

July 22, 2021 @ 11:17 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics

Language-identification from digital text has been a solved problem for many years, so I was surprised yesterday to see Gmail offering to translate from Afrikaans an email written in perfectly idiomatic English, which started this way:

Read the rest of this entry »

Permalink Comments (10)

Knowing when you don't know

July 8, 2021 @ 7:33 am· Filed by Mark Liberman under Elephant semifics

It's often observed that current AI systems will generalize confidently to areas far away from anything in their training, where the right answer should be "huh?" This is true even when other available algorithms, often simple ones, could easily diagnose the lack of fit to expectations.

We've seen many amusing examples, which we've filed in the category Elephant Semifics, named for a phrase emerging from one of Google's hallucinatory translations of meaningless repetitions of Japanese or Thai characters, or random strings of ascii vowels. Obviously a human translator would immediately notice the unexpected properties of the inputs — and in fact it's trivial to create algorithms that could screen for such things. Google and its colleagues don't bother, or at least didn't do so in the past, because why should they? Except that in real world applications, noticing that inputs are nonsense is a clue that something has gone wrong, and maybe business-as-usual is not the right response.

Read the rest of this entry »

Permalink Comments (16)

Maltese email ARC

June 9, 2021 @ 9:56 am· Filed by Mark Liberman under Elephant semifics

Yesterday I got a strange email message, apparently from American Express. The first strange thing: gmail showed it with no Subject and no content:

But then it got stranger…

Read the rest of this entry »

Permalink Comments (7)

Covered. Nineteen. At pain medicine

May 2, 2021 @ 5:20 pm· Filed by Mark Liberman under Elephant semifics

Google Fi screens my calls, so that my phone doesn't even ring unless the caller is in my contacts, or passes some kind of quasi-Turing Test. This is a Good Thing, since I get half a dozen spam calls a day, often at inconvenient times. As a result, robocalls generally end up as voicemail, which Google Fi helpfully turns into a convenient text message — which is often amusing. For example, a couple of days before my second vaccine shot last month, a robocall from Penn Medicine got transcribed like this:

Hello, this is pain medicine reaching out to you regarding covered. Nineteen. We've implemented a short sentence screening survey before coming into your appointment. All pain medicine patients are being asked to complete this brief electronic symptom checker to answer the questions, please call 215-NNN-NNNN. If your appointment has been canceled or rescheduled, please disregard this message patients and visitors. I'm presenting two pain medicine locations for inpatient outpatient or emergency department care should be wearing a cloth face covering in accordance with current CDC and state guidance. Thank you.

[Callback number obscured]

Read the rest of this entry »

Permalink Comments (7)

Advances in topic modeling

April 1, 2021 @ 6:15 am· Filed by Mark Liberman under Computational linguistics, Elephant semifics, Language and the media

In the middle to late 1990s, "Topic Detection and Tracking" was an active research area (see also this). And by the early 2000s, the technology was good enough to support the creation of Google News. Twenty years later, these and other innovations have transformed the mass media, for good or ill. I don't know what algorithms the AI in charge of Topic Modeling at Google News is using these days, but I'm happy to see it developing a sense of humor:

Read the rest of this entry »

Permalink Comments (21)

Articulate Tory gestures

February 26, 2021 @ 5:04 pm· Filed by Mark Liberman under Elephant semifics

At our most recent Penn Phonetic Lab meeting, we heard a (virtual) talk by Marc Garellek on the topic "Reconsidering voicing during glottal sounds". The talk was quite interesting, but more relevant for a general audience was what happened when someone turned on Zoom's "Live Transcription" feature:

Read the rest of this entry »

Permalink Comments (13)

Image search results

October 19, 2020 @ 6:13 am· Filed by Mark Liberman under Elephant semifics

Yesterday my wife challenged me to identify the person in a photo she sent. I decided to cheat, by using Google Image Search — and the results were very strange.

We've posted often about weird AI behavior in Speech-to-Text and Machine Translation and other NLP applications. Image processing has its own litany of weirdness, which is not often a topic here for obvious reasons. But this case does have a linguistic aspect, namely the cited links…

Read the rest of this entry »

Permalink Comments (4)

"The inspirations to be more inoperative"

September 17, 2020 @ 7:13 am· Filed by Mark Liberman under Elephant semifics, WTF

Recently I was doing some background research on Central Auditory Processing Disorder (CAPD), and one of the references that Google Scholar handed me was a Semantic Scholar page for J.A. Willeford and J. Burleigh, "Handbook of central auditory processing disorders in children", 1985, with the following abstract:

The handbook of central auditory processing disorders in children that we provide for you will be ultimate to give preference. This reading book is your chosen book to accompany you when in your free time, in your lonely. This kind of book can help you to heal the lonely and get or add the inspirations to be more inoperative. Yeah, book as the widow of the world can be very inspiring manners. As here, this book is also created by an inspiring author that can make influences of you to do more.

Read the rest of this entry »

Permalink Comments (11)

Archive for Elephant semifics

New frontiers in dataset corruption

ChatGPT having a stroke?

Eliza reborn?

COURTHOUHAING TOGET T ROCESS.WHE

"Train hard, dream big"

English as Afrikaans?

Knowing when you don't know

Maltese email ARC

Covered. Nineteen. At pain medicine

Advances in topic modeling

Articulate Tory gestures

Image search results

"The inspirations to be more inoperative"

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta