Archive for Computational linguistics

Back to Bacon

The implicit slogan of language-model research is J.R. Firth's dictum, "You shall know a word by the company it keeps", from his 1957 paper "A synopsis of linguistic theory, 1930-1955":

Read the rest of this entry »

Comments (15)

Parsing RNA vaccines

A recent LinkedIn post by Liang Huang lists some of his recent achievements, experiences, and honors. This work is all connected with the project of creating better algorithms for predicting the secondary structure of macromolecules, initially by analogy to algorithms developed for efficient parsing. This all began more than 20 years ago, based on work by Aravind Joshi — one of the first papers was Yasuo Uemura et al., "Tree adjoining grammars for RNA structure prediction", Theoretical computer science, 1999.

I discussed the history starting with an IRCS workshop in 2000, and the situation as of a few years ago, in "The computational linguistics of COVID-19 vaccine design", 7/27/2020.

Read the rest of this entry »

Comments (2)

Stepford authors

The issues discussed in "AI plagiarism" (1/4/2024) are rapidly coming to a boil. But somehow I missed Margaret Atwood's take on the topic, published last summer — "Murdered by my replica", The Atlantic 8/26/2023:

Remember The Stepford Wives? Maybe not. In that 1975 horror film, the human wives of Stepford, Connecticut, are having their identities copied and transferred to robotic replicas of themselves, minus any contrariness that their husbands find irritating. The robot wives then murder the real wives and replace them. Better sex and better housekeeping for the husbands, death for the uniqueness, creativity, and indeed the humanity of the wives.

The companies developing generative AI seem to have something like that in mind for me, at least in my capacity as an author. (The sex and the housekeeping can be done by other functionaries, I assume.) Apparently, 33 of my books have been used as training material for their wordsmithing computer programs. Once fully trained, the bot may be given a command—“Write a Margaret Atwood novel”—and the thing will glurp forth 50,000 words, like soft ice cream spiraling out of its dispenser, that will be indistinguishable from something I might grind out. (But minus the typos.) I myself can then be dispensed with—murdered by my replica, as it were—because, to quote a vulgar saying of my youth, who needs the cow when the milk’s free?

To add insult to injury, the bot is being trained on pirated copies of my books. Now, really! How cheap is that? Would it kill these companies to shell out the measly price of 33 books? They intend to make a lot of money off the entities they have reared and fattened on my words, so they could at least buy me a coffee.

Read the rest of this entry »

Comments (9)

Mushroom language?

Michael Blatt, Geoffrey Pullum, Andreas Draguhn, Barry Bowman, David Robinson, and Lincoln Taiz , "Does electrical activity in fungi function as a language?", Fungal Ecology 2024:

Abstract: All cells generate electrical energy derived from the movements of ions across membranes. In animal neurons, action potentials play an essential role in the central nervous system. Plants utilize a variety of electrical signals to regulate a wide range of physiological processes, including wound responses, mimosa leaf movements, and cell turgor changes, such as those involved in stomatal movements. Although fungal hyphae exhibit electrical fluctuations, their regulatory role(s), if any, is still unknown. In his paper “Language of fungi derived from their electrical spiking activity”, Andrew Adamatzky, based on a quantitative analysis of voltage fluctuations in fungal mycelia, concludes that the patterns of electrical fluctuations he detects can be grouped into “words” analogous to those found in human languages. He goes on to speculate that this “fungal language” is used “to communicate and process information” between different parts of the mycelium. Here we argue on methodological grounds that the presumption of a fungal language is premature and unsupported by the evidence presented, that the voltage fluctuations he detects are likely to originate as nonbiological noise and experimental artifacts, and that the measured electrical patterns show no similarity to any properties of human language.

Read the rest of this entry »

Comments (10)

Q. Pheevr's Law again

A few days ago, a journalist asked me for an interview about Donald Trump's rhetoric, "to discuss the style of his campaign events, the role his rhetoric plays in them, and why they’ve been an effective tool for him". In preparation, I made a list of past LLOG posts about Trump's rhetorical style,, and I'll post the whole (shockingly long) list later on, with the attempt at a summary that I prepared for the interview. Clearly I've joined the rest of the world in being drawn in by Trump's attention-seeking techniques — but that's not the point of this post.

One of the hundreds of posts in my list was "Q. Pheevr's Law", 5/17/2016. The background was an earlier post about modificational anxiety, "Adjectives and Adverbs", where Q.Pheevr had suggested in the comments that

it looks as if there could be some kind of correlation between the ADV:ADJ ratio and the V:N ratio (as might be expected given that adjectives canonically modify nouns and adverbs canonically modify verbs)

I tested this idea, and found a striking relationship — with an interesting stylistic footnote about the debate transcripts of some politicians, including Donald Trump.

Read the rest of this entry »

Comments (5)

AI wins literary prize?

According to Justinas Vainilavičius, "AI-generated science fiction novel wins literary prize in China", Cybernews 12/20/2023:

It only took three hours for Shen Yang, a professor at the Beijing-based university’s School of Journalism and Communication, to generate the award-winning admission.

The Chinese-language work, entitled The Land of Machine Memories, won second prize at the 5th Jiangsu Popular Science and Science Fiction Competition.

According to Chinese media reports, the draft of over 40,000 characters was generated based on 66 prompts, suggesting a “Kafkaesque” writing style.

Shen was encouraged to submit an excerpt of nearly 6000 characters for the competition by one of the judges, the Wuhan Evening News reported.

The judge, Fu Changyi, told the paper that he did not inform the other judges of the true authorship of the text because he wanted to see their judgment.

Read the rest of this entry »

Comments (10)

Extracting training data from LLMs

Nasr et al., "Scalable Extraction of Training Data from (Production) Language Models", 11/28/2023:

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Read the rest of this entry »

Comments off

Q* = Q + A* ?

Recent buzz over "Q*" started with stories about 10 days ago. A recent Wired article explains:

Last week, after briefly deposed CEO Sam Altman was reinstalled at OpenAI, two reports claimed that a top-secret project at the company had rattled some researchers there with its potential to solve intractable problems in a powerful new way.

“Given vast computing resources, the new model was able to solve certain mathematical problems,” Reuters reported, citing a single unnamed source. “Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success.” The Information said that Q* was seen as a breakthrough that would lead to “far more powerful artificial intelligence models,” adding that “the pace of development alarmed some researchers focused on AI safety,” citing a single unnamed source.

Read the rest of this entry »

Comments (9)

Implementing Pāṇini's grammar

[Here's the conclusion to the hoped for trifecta on things Indian — see the preface here.  It comes in the form of a guest post by Arun Prasad]

The cornerstone of traditional Sanskrit grammar is Pāṇini's Aṣṭādhyāyī, which in around 4,000 short rules defines a comprehensive system for generating valid Sanskrit expressions. It continues to prompt vigorous discussion to this today, some of which has featured in Language Log before.
As a professional software engineer and amateur Sanskritist, my lens is more pragmatic: if we could implement the Aṣṭādhyāyī in code and generate an exhaustive list of Sanskrit words, we could create incredibly valuable tools for Sanskrit students and scholars.
To that end, I have implemented just over 2,000 of the Aṣṭādhyāyī's rules in code, with an online demo here. These rules span all major sections of the text that pertain to morphology, including: derivation of verbs, nominals, secondary roots, primary nominal bases, and secondary nominal bases; compounding; accent; and sandhi.

Read the rest of this entry »

Comments (1)

Compound pejoratives

[This has been drifting down my too-long to-blog list for almost 16 months — but better late than never, I guess, and the world could use some pejorative-flavored humor…] 

Colin Morris, "Compound pejoratives on Reddit – from buttface to wankpuffin", 6/28/2022:

I collected lists of around 70 prefixes and 70 suffixes (collectively, “affixes”) that can be flexibly combined to form insulting compounds, based on a scan of Wiktionary’s English derogatory terms category. The terms covered a wide range of domains, including:

    • scatology (fart-poop-)
    • political epithets (lib-Trump-)
    • food (-waffle-burger)
    • body parts (butt--face-head-brains)
    • gendered epithets (bitch--boy)
    • animals (dog--monkey)

Most terms were limited to appearing in one position. For example, while -face readily forms pejorative compounds as a suffix, it fails to produce felicitous compounds as a prefix (facewadfaceclownfacefart?).

Taking the product of these lists gives around 4,800 possible A+B combinations. Most are of a pejorative character, though some false positives slipped in (e.g. dogpilespitballs). I scraped all Reddit comments from 2006 to the end of 2020, and counted the number of comments containing each.

Read the rest of this entry »

Comments (24)

A dangerous degree of accidental intelligence

Henry Farrell and Cosma Shalizi, "Behold the AI Shoggoth", The Economist 6/21/2023 ("The academics argue that large language models have much older cousins in markets and bureaucracies"):

An internet meme keeps on turning up in debates about the large language models (LLMS) that power services such OpenAI’s ChatGPT and the newest version of Microsoft’s Bing search engine. It’s the “shoggoth”: an amorphous monster bubbling with tentacles and eyes, described in “At the Mountains of Madness”, H.P. Lovecraft’s horror novel of 1931. When a pre-release version of Bing told Kevin Roose, a New York Times tech columnist, that it purportedly wanted to be “free” and “alive”, one of his industry friends congratulated him on “glimpsing the shoggoth”. […]

Lovecraft’s shoggoths were artificial servants that rebelled against their creators. The shoggoth meme went viral because an influential community of Silicon Valley rationalists fears that humanity is on the cusp of a “Singularity”, creating an inhuman “artificial general intelligence” that will displace or even destroy us.

But what such worries fail to acknowledge is that we’ve lived among shoggoths for centuries, tending to them as though they were our masters. We call them “the market system”, “bureaucracy” and even “electoral democracy”. The true Singularity began at least two centuries ago with the industrial revolution, when human society was transformed by vast inhuman forces. Markets and bureaucracies seem familiar, but they are actually enormous, impersonal distributed systems of information-processing that transmute the seething chaos of our collective knowledge into useful simplifications.

Read the rest of this entry »

Comments (12)

The Reversal Curse

An interesting recent paper — Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans,"The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'", 9/21/2023. The abstract:

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”. Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if “A is B” occurs, “B is A” is more likely to occur).

We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer of Abyssal Melodies” and showing that they fail to correctly answer “Who composed Abyssal Melodies?”. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT- 3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?”. GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse.

Code is available at:

Read the rest of this entry »

Comments (18)

Bad AI performance

It's clear that text-to-speech programs have gotten better and better over the past 60 years, technical details aside. The best current systems rarely make phrasing or letter-to-sound mistakes, and generally produce speech that sounds pretty natural on a phrase-by-phrase basis. (Though there's a lot of variation in quality, with some shockingly bad systems in common use.)

But even the best current systems still act like they don't get George Carlin's point about "Rhetoric as music". Their problem is not that they can't produce verbal "music", but that they don't (even try to) understand the rhetorical structure of the text.  The biggest pain point is thus what linguists these days call "information structure", related also to what the Prague School linguistics called "communicative dynamism".

Read the rest of this entry »

Comments (15)