Language Log

AI plagiarism

January 4, 2024 @ 4:56 pm · Filed by Mark Liberman under Language and the law

"The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work", NYT 12/27/2023:

The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies.

The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times.

The lawsuit includes nearly 30 pages of persuasive examples in which OpenAI programs parrot large chunks of NYT material, essentially verbatim. Here's the start of the first example:

That same example is featured by Gary Marcus in "Things are about to get a lot worse for Generative AI: A full of spectrum of infringement", 12/29/2023, along with a selection of image examples like this one:

He concludes:

In all likelihood, the New York Times lawsuit is just the first of many. On a multiple choice X poll today I asked people whether they thought the case would settle (most did) and what the likely value of such a settlement might be. Most answers were $100 million or more, 20% expected the settlement to be a billion dollars. When you multiply figures like these by the number of film studios, video game companies, other newspapers etc, you are soon talking real money.

A multiple-choice X poll may not be an accurate predictor of settlement value, but it seems clear that copyright infringement will be a serious problem for generative AI in the near future. Effective decelerationism, even.

Ironically, this may turn out to be a Good Thing for Google. According to Myles Kruppa, "Jeff Bezos Bets on a Google Challenger Using AI to Try to Upend Internet Search", WSJ 1/4/2024:

Perplexity, a startup going after Google’s dominant position in web search, has won backing from Jeff Bezos and venture capitalists betting that artificial intelligence will upend the way people find information online.

Started less than two years ago, Perplexity has fewer than 40 employees and is based out of a San Francisco co-working space. The company’s product, which it calls an answer engine, is used by about 10 million people monthly.

Those ingredients were enough to persuade Institutional Venture Partners, Bezos and other tech executives to invest \$74 million in the company, the largest sum raised by an internet search startup in recent years. The investment valued Perplexity at $520 million, including the new money, said Chief Executive Officer Aravind Srinivas. […]

Perplexity’s founders said their advantage is using advances in AI to provide direct answers, instead of website links, in response to search queries, without some of the limitations felt by larger companies.

“If you can directly answer somebody’s question, nobody needs those 10 blue links,” Srinivas said.

But based on today's "Stochastic Parrot" AI technology, those direct answers are likely to be mostly copied from other people's published texts, and so Perplexity may run into the same sort of copyright caltrops that are in OpenAI's pathway.

Of course, Google has been doing the same thing for a while — see "News Publishers See Google’s AI Search Tool as a Traffic-Destroying Nightmare", WSJ 12/24/2023:

Shortly after the launch of ChatGPT, the Atlantic drew up a list of the greatest threats to the 166-year-old publication from generative artificial intelligence. At the top: Google’s embrace of the technology.

About 40% of the magazine’s web traffic comes from Google searches, which turn up links that users click on. A task force at the Atlantic modeled what could happen if Google integrated AI into search. It found that 75% of the time, the AI-powered search would likely provide a full answer to a user’s query and the Atlantic’s site would miss out on traffic it otherwise would have gotten.

What was once a hypothetical threat is now a very real one. Since May, Google has been testing an AI product dubbed “Search Generative Experience” on a group of roughly 10 million users, and has been vocal about its intention to bring it into the heart of its core search engine.

But if "Search Generative Experience" is blocked, that leaves Google where it is today — in control of web search, and safe from an OpenAI-powered Microsoft invasion or guerilla raids by upstarts like Perplexity.

By the way, "perplexity" is an important concept in Information Theory, part of the foundations of "language models" large and small, which is no doubt why the start-up's founders chose the name.

Interestingly, perplexity's technical meaning — 2 raised to the power of the entropy — doesn't seem to have made it into dictionaries yet. As the Wikipedia article informs us, the earliest published example is in the abstract for a presentation at the 1977 Acoustical Society annual meeting — Fred Jelinek, Robert Mercer, Lalit Bahl, and James Baker, "Perplexity -— a measure of the difficulty of speech recognition tasks":

Perhaps there are other examples of words in widespread use that were first published in a conference abstract, but this is the only one that I know of.

Update — A more extensive report from Gary Marcus and Reid Southern on the visual copyright issues: "Generative AI Has a Visual Plagiarism Problem" IEEE Spectrum 1/6/2024. And more about perplexity.ai: "Jeff Bezos–backed AI search startup’s CEO says ‘Google is going to be viewed as something that’s legacy and old’", Fortune 1/6/2024; "Everyone wants better web search – is Perplexity's AI the answer?", The Register 1/5/2024. And on the legal front: "Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim", CNBC 1/5/2024; "'Impossible' to create AI tools like ChatGPT without copyrighted material, OpenAI says", The Guardian 1/8/2024.

From Politico's Weekly Cartoon Gallery, by Nick Anderson, Reform Austin News:

January 4, 2024 @ 4:56 pm · Filed by Mark Liberman under Language and the law

Permalink

18 Comments

Jon W said,

January 4, 2024 @ 5:05 pm

Yes, if I were trying to calculate lawsuit settlement value, I would definitively do it by asking random people on Xitter to click a button on a multiple-choice poll.
Tim Rowe said,

January 4, 2024 @ 8:56 pm

I'm amused by the notion that one has to multiply a billion dollars by *anything* before one is "talking real money".
Seth said,

January 4, 2024 @ 9:07 pm

@ Jon W – It's been a while since I've heard the buzzphrase "The wisdom of crowds" :-)
Mike Anderson said,

January 4, 2024 @ 10:35 pm

@Seth – in my G.I. days, the "wisdom of crowds" had another name: "pooling our ignorance."
Martin said,

January 5, 2024 @ 3:20 am

@Tim Rowe Mark's paraphrasing a quote originally about government spending attributed to US Senator Everett Dirksen: https://www.senate.gov/artandhistory/history/minute/Senator_Everett_Mckinley_Dirksen_Dies.htm

Re last sentence of the post, surely 'perplexity' is a perfectly standard English word dating back to Middle English, which was additionally co-opted in 1977 as a technical term…
Mark Liberman said,

January 5, 2024 @ 5:27 am

@Martin: "surely 'perplexity' is a perfectly standard English word dating back to Middle English, which was additionally co-opted in 1977 as a technical term…"

Yes, but the technical meaning — which is very widely used — is overdue for addition to dictionaries, joining the many other "perfectly standard English words" that have added widely-used technical senses.
Philip Anderson said,

January 5, 2024 @ 8:08 am

Is ‘perplexity’+technical meaning the same “word” as ‘perplexity’+common meaning? One is derived from the other, so a dictionary would include them under the same headword, but since the technical meaning is not obvious it deserves its own definition. I wouldn’t say either Martin or Mark is wrong, although I might have said ‘word definition’ myself.
Mark Liberman said,

January 5, 2024 @ 8:23 am

@Philip Anderson: "One is derived from the other, so a dictionary would include them under the same headword, but since the technical meaning is not obvious it deserves its own definition."

Yes, exactly the point I made — apparently I was somehow unclear?
RfP said,

January 5, 2024 @ 11:54 am

@Mark:

I, too, am… perplexed by that last sentence.

It seems from my layman’s perspective that it’s the same lexical item or the same word (do those two terms apply equally here, or am I missing something?) as the “perfectly standard English word dating back to Middle English,” so maybe you’re making a distinction that I don’t quite understand.
Philip Taylor said,

January 5, 2024 @ 1:19 pm

Well, I am perplexed by the fact that others appear perplexed by Mark's words. To my mind he is saying simply the following :
The word "perplexity" (in the sense of entropy) has been established since at least the mid-to-late 70s, and is seemingly now widely used and attested, yet no dictionary of which he (or I) is/am aware yet glosses that new, entropy-based, definition. Am I, too, misunderstanding something ?
KWillets said,

January 5, 2024 @ 2:30 pm

Neither Perplexity nor Surprisal appear to have an information-theoretic dictionary definition. Both were adopted from general meaning to describe a specific mathematical formula.

Surprisal is the simplest concept, the (log) unlikelihood of one specific event. The average surprisal across all possible events is the entropy, and perplexity is a more intuitive expression of entropy as a number of choices.

In the case of the text here, one could say that OpenAI has no perplexity in its output because each word is simply regurgitated from the NYT article, with only one choice at each step.
Philip Anderson said,

January 5, 2024 @ 5:27 pm

I can see where Martin and RfP are coming from, because the word “perplexity”, with that spelling and the root meaning of puzzlement, is an older word in widespread use. Despite this, I thought it clear that Mark was referring to the new(-ish) technical meaning (although I wouldn’t have said that was widespread, outside of technical literature).
Richard Hershberger said,

January 6, 2024 @ 8:40 am

I just checked out the Perplexity site. It is similar to Bing's new and unimproved search: Go out and find a website that seems to answer the question and copy a block of text from it, with a tiny hyperlink to the site it just copied from. Even apart from plagiarism issues, this is a terrible way to get information if you are at all serious about its being right.

My standard test question is "Who invented baseball?" I choose this partly because it is within my expertise (I could go on about it at any length you are willing to tolerate) and partly because the answer is not at all straightforward. (If you think you know the answer, and it is straightforward, you are wrong.) The result from Perplexity is not as bad as it could be, but this far from its being good.

What really caught my eye is the statement that "The claim that Civil War hero Abner Doubleday invented baseball in 1839 is considered a myth." Fair enough: weakly stated and kind of weaselly (considered by whom?) but not wrong. Below this are "related questions," including "what is the origin of the name "baseball"" The answer provided includes "The once widely accepted story that US Army officer Abner Doubleday invented baseball in Cooperstown, New York, in 1839 has been conclusively debunked." Apart from this being a rambling digression from the actual question, as is typical of AI answers, it also is different from the first version. Is the Doubleday story conclusively debunked, a strong (and accurate) statement, or the much weaker "is considered a myth"? There is no knowing. At least not here.

Asking Bing "Who invented baseball?" is funnier. It coughs up "Albert Cartwright." Apart from the historical problems with the Cartwright version, his first name was "Alexander." The link is to a botched article in Parade. This is not only funnier, but also displays the AI's inability to separate quality from crap.
Richard Hershberger said,

January 6, 2024 @ 8:46 am

Here is another gem from Perplexity" "When did the Dodgers become a major league club?" The correct answer is 1884. They first played in 1883 in the minor Inter-state Association, then jumped to the major American Association in 1884 then to the National League in 1890.

The answer Perplexity gives is "The Dodgers became a major league club in 1883. Originally based in Brooklyn, New York, the team joined the American Association in 1884 and eventually the National League in 1890…" What is really interesting is that it links to three sites, each of which gets is right. Perplexity manages to take consistently correct information and introduce an error all its own. Impressive.
RfP said,

January 6, 2024 @ 1:25 pm

In terms of the questions raised by Philip Taylor and Philip Anderson, my own puzzlement was pretty minor. I understood what Mark was getting at, but I was surprised at his terminology.

Although I am not a linguist, I have a vested, professional interest in learning as much as I can about the tools from which I make my living. So I try, in my amateurish way, to ensure that my understanding is as accurate and thoroughgoing as it can be.

(And in that spirit, I would like to make as big of a plug as ever I can for the recent second edition of A Student’s Introduction to English Grammar, which is based on The Cambridge Grammar of the English Language by Rodney Huddleston and a certain highly esteemed former poster here, namely Geoff Pullum. CGEL isn't exactly breakfast reading, but it's been invaluable to me on many occasions. This new edition of The Student Guide is a Rosetta Stone of incredible value, at a fraction of the price. Available at stores near you—or by way of a simple click online. Get yours today!)

At any rate, I'm still not completely clear on whether Mark's usage of "word[]" in his last sentence was standard, informal, or somewhere in between. I'll live.
Philip Taylor said,

January 6, 2024 @ 4:03 pm

RfP — OK, all is now clear. I think that you are referring to Mark's "Perhaps there are other examples of words in widespread use that were first published in a conference abstract, but this is the only one that I know of", where I think both you and I would agree that the word "perplexity" was first published in (perhaps) J. Gower, Confessio Amantis (Fairfax MS.) viii. 2190, where the author wrote "Tho was betwen mi Prest and me Debat and gret perplexete" in or around 1393. And therefore what Mark appears to have meant is "Perhaps there are other examples of [specialised meanings of] words in widespread use that [first appeared] in a conference abstract, but this is the only one that I know of".

As to the CaGEL, I am afraid that it is beyond my pay grade, so I shall have to rely on Quirk's Comprehensive Grammar of the English Language, my copy of which I purchased long before its price reached today's eye-watering levels…
Tuantuan said,

January 9, 2024 @ 9:35 pm

With the development of AI, copyright issues will become a major concern. Can content created by AI also be registered as copyright? For example, if I write a poem using GPT , coincidentally similar to others, how can I tell who is right?
Deedy Medici said,

February 4, 2024 @ 10:01 pm

Interestingly Google itself is rolling out an AI, which may well have the same problems.

Though we do need some better search engines to break the stranglehold of sometimes useless results that Google provides (there have been some interesting mini-docs I've seen about it, such as Google now preventing you from actually reaching the end of its "millions of results" and just letting you read the first several pages).

And, amusingly enough, it's not as though traditional publishers, particularly in certain industries such as music (looking at the RIAA for example), aren't gobbling up and destroying creators/artists and merely pretending to be acting in the artists' best interest…

RSS feed for comments on this post

AI plagiarism

18 Comments

Jon W said,

Tim Rowe said,

Seth said,

Mike Anderson said,

Martin said,

Mark Liberman said,

Philip Anderson said,

Mark Liberman said,

RfP said,

Philip Taylor said,

KWillets said,

Philip Anderson said,

Richard Hershberger said,

Richard Hershberger said,

RfP said,

Philip Taylor said,

Tuantuan said,

Deedy Medici said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta