More on Juola's stylometry
« previous post | next post »
Worth reading if you were interested in the computational stylometric analysis by Patrick Juola that helped to unmask J. K. Rowling as the author of The Cuckoo's Calling: an article in The Chronicle of Higher Education about Juola's work.
I first encountered Juola when he was a young assistant professor and had just hit upon a delightfully simple technique for solving a problem in linguistics, namely deciding whether it was justified to say that one language had absolutely more complex and difficult inflectional morphology than another. Basically he realized that you could just type some paradigms into files (one file per language) and run the files through one of the ordinary off-the-shelf file compression algorithms that people used to use to save disk space or compress email attachments, and because those algorithms look for redundancy as a way to summarize file content more compactly, for languages with greater regularity in their morphology the files would shrink more. The Sanskrit paradigm file would shrink less than the Latin one, the Latin one would shrink less than the Spanish one, and so on. Cute, I thought at the time.
Today Juola is a full professor and runs a private consulting firm that will (for a fee) run computational checks on much trickier aspects of natural language, like whether a novel is more like one mystery author than another, or whether a man who claims to be the author of a certain anonymous article did in fact write that article. Genuine applied computer analysis of natural language material, serving real purposes in the real world. Linguistics with its shirtsleeves rolled up, as the computational linguist Geoffrey Sampson once put it.
The article contains a reference to a computer scientist who has started work on techniques for fighting author identification: programs for stripping your prose of the key identifiers that would have revealed it as yours, so that your anonymous publications can remain anonymous. The arms race of competing author-identification and author-concealment algorithms has begun!