Ex-physicist takes on Heavy Metal NLP
« previous post | next post »
"Heavy Metal and Natural Language Processing – Part 1", Degenerate State 4/20/2016:
Natural language is ubiquitous. It is all around us, and the rate at which it is produced in written, stored form is only increasing. It is also quite unlike any sort of data I have worked with before.
Natural language is made up of sequences of discrete characters arranged into hierarchical groupings: words, sentences and documents, each with both syntactic structure and semantic meaning.
Not only is the space of possible strings huge, but the interpretation of a small sections of a document can take on vastly different meanings depending on what context surround it.
These variations and versatility of natural language are the reason that it is so powerful as a way to communicate and share ideas.
In the face of this complexity, it is not surprising that understanding natural language, in the same way humans do, with computers is still a unsolved problem. That said, there are an increasing number of techniques that have been developed to provide some insight into natural language. They tend to start by making simplifying assumptions about the data, and then using these assumptions convert the raw text into a more quantitative structure, like vectors or graphs. Once in this form, statistical or machine learning approaches can be leveraged to solve a whole range of problems.
I haven't had much experience playing with natural language, so I decided to try out a few techniques on a dataset I scrapped from the internet: a set of heavy metal lyrics (and associated genres).
[h/t Chris Callison-Burch]
Magnus said,
July 4, 2016 @ 3:28 am
There's some interesting stuff here, especially the clusters. But the Brown corpus is a bad choice for this as far as reference corpora go. It would be interesting to see a follow up with a slightly more sophisticated model and a more suitable reference.
david said,
July 4, 2016 @ 11:48 am
Many (most? google?) peope associate NLP with NeuroLingusitc Programming.
Terry Hunt said,
July 4, 2016 @ 11:57 am
I'm intrigued that, of the Metal bands in the dataset that I like and indeed have very much familiarity with, all are contained within the four contiguous and most heavily populated squares of the first plot.
Allowing for the obvious population density bias, this suggests some unobvious correlation between my musical tastes and lyrical styles (in terms of that plot); or to put it the other way, bands whose lyrical styles are outliers on this plot are also unlikely to appeal to me on musical grounds.
Or am I overlooking something obvious?
J.W. Brewer said,
July 5, 2016 @ 9:47 am
I agree that the Brown corpus is a suboptimal baseline. I guess the question is whether a better baseline corpus already exists or if one would instead need to create it. Assuming that what one would want is a corpus of popular song lyrics in mainstream/non-metal genres, I believe there have been some studies that purported to show the rise/fall of various themes (or at least lexemes taken to be evidence of particular themes) over time in "top 40" or similarly-measured hit songs. Maybe one of those studies has a suitable baseline/reference corpus, although you might have the same hesitation this researcher did (I assume for copyright-infringement-liability-exposure reasons) in being willing to share it.
Rod Johnson said,
July 5, 2016 @ 3:52 pm
Many (most? google?) peope associate NLP with NeuroLingusitc Programming.
That particular strain of pseudoscience has been in decline in recent years, and natural language processing has blossomed. It would surprise me if many people in the linguistics community even remember it, if they've ever heard about it.
J.W. Brewer said,
July 5, 2016 @ 6:34 pm
The google books n-gram viewer says that "natural language processing" was consistently more common during the time frame it covers (ending in 2008) than "neurolinguistic programming" and "neuro linguistic programming" (the software is hyphen-averse) combined. Although it's at least possible that the hits for the former were more heavily concentrated in specialist texts and the latter sense was thus at least vaguely familiar-sounding to a wider audience.