[This has been drifting down my too-long to-blog list for almost 16 months — but better late than never, I guess, and the world could use some pejorative-flavored humor…]

Colin Morris, "Compound pejoratives on Reddit – from buttface to wankpuffin", 6/28/2022:

I collected lists of around 70 prefixes and 70 suffixes (collectively, “affixes”) that can be flexibly combined to form insulting compounds, based on a scan of Wiktionary’s English derogatory terms category. The terms covered a wide range of domains, including:

scatology (fart-, poop-) political epithets (lib-, Trump-) food (-waffle, -burger) body parts (butt-, -face, -head, -brains) gendered epithets (bitch-, -boy) animals (dog-, -monkey)



Most terms were limited to appearing in one position. For example, while -face readily forms pejorative compounds as a suffix, it fails to produce felicitous compounds as a prefix (facewad? faceclown? facefart?).

Taking the product of these lists gives around 4,800 possible A+B combinations. Most are of a pejorative character, though some false positives slipped in (e.g. dogpile, spitballs). I scraped all Reddit comments from 2006 to the end of 2020, and counted the number of comments containing each.

Among other results, the graphical "Matrix of Pejoration" stands out:

And there's a github repo including (quoted from the README.md):

counts.csv , a dataset mapping ~4,800 compound pejoratives to the number of Reddit comments containing that compound. ( wikt.csv has the same format with a column recording whether the term has an entry in Wiktionary) various Python scripts used to generate this dataset. This process is documented in the "Data pipeline" section below, though you probably don't need to run these unless you want to expand the dataset with a different set of affixes or time range. various IPython notebooks exploring and visualizing the dataset (See "Guide to IPython notebooks" section below)



Colin Morris's blog post notes that the rank-frequency distribution of the compound terms follows Zipf's law:

Colin Morris offers some additional descriptive insight about productivity (click images to embiggen):

But the principles governing the patterns of morph-combination — in the graphical matrix or the full dataset — are not so clear.

Exercise for the reader: What combination of morphology, syntax, semantics, analogy, and historico-cultural accident is responsible for which aspects of the observed distributions? How is this similar to or different from the patterns in other English complex nominals?

