"Heavy Metal and Natural Language Processing – Part 1", Degenerate State 4/20/2016:
Natural language is ubiquitous. It is all around us, and the rate at which it is produced in written, stored form is only increasing. It is also quite unlike any sort of data I have worked with before.
Natural language is made up of sequences of discrete characters arranged into hierarchical groupings: words, sentences and documents, each with both syntactic structure and semantic meaning.
Not only is the space of possible strings huge, but the interpretation of a small sections of a document can take on vastly different meanings depending on what context surround it.
These variations and versatility of natural language are the reason that it is so powerful as a way to communicate and share ideas.
In the face of this complexity, it is not surprising that understanding natural language, in the same way humans do, with computers is still a unsolved problem. That said, there are an increasing number of techniques that have been developed to provide some insight into natural language. They tend to start by making simplifying assumptions about the data, and then using these assumptions convert the raw text into a more quantitative structure, like vectors or graphs. Once in this form, statistical or machine learning approaches can be leveraged to solve a whole range of problems.
I haven't had much experience playing with natural language, so I decided to try out a few techniques on a dataset I scrapped from the internet: a set of heavy metal lyrics (and associated genres).
[h/t Chris Callison-Burch]