The "least metal words" almost seems like a challenge. I now feel strangely compelled to write metal songs with those words, and I haven't written a metal song in 20 years. I may not be particularly metal, anymore.
Carnivore (Pete Steele's band before Type O Negative) also did it in 1987 (although it's probably not going to be classified as a metal song unless your name is Marcel Duchamp or Tracy Enim): https://www.youtube.com/watch?v=eGDTrWw_APM
The lyrics for the first couple of Carcass albums were pulled from medical texts, so I think lyrical content is pretty flexible. But, the nonmetal words almost sound like a very boring BBC program about people in an office in the 19th century; like Eastenders with stuffy accountants. It'd be an interesting challenge.
> In the face of this complexity, it is not surprising that understanding natural language, in the same way humans do, with computers is still a unsolved problem.
I have the feeling it is not just an unsolved problem, but also an undefined problem.
I noticed there was song data errors in places - for example, in this chart (http://www.degeneratestate.org/static/metal_lyrics/clusters....), in the most representative songs for Tiamat, it lists White Pearl, Black Oceans. That is a song by Sonata Arctica. It also lists 3 versions of the same song for Nightwish in most representative songs (Elan). Similar mistakes are made for various bands in that chart (Symphony X, Hammerfall, Therion, Sabaton, Stratovarius, Helloween, Within Temptation, etc.).
This was interesting to read in general though, especially as someone who listens to metal quite frequently (mostly of the power metal variety).
He mentioned that in the article:
> What's interesting is that while the most representative songs for each band are mostly their own songs, occasionally other bands songs creep in. For example, "Wrathchild", is an Iron Maiden song, not a Diamondhead song.
I don't think it's a problem with the song data. When the algorithm picks a song it considers most lyrically representative of a band, it chooses from all the songs in the dataset, and it doesn't always pick a song by that band.
That list of most and least metal words, if I didn't know where they were from, I'd have guessed was the 19th-C. Romantic reaction to the Enlightenment (except for a few anachronisms like 'gonna'). How metal was Edgar Allan Poe?
There's also Ahab, who based a fantastic album on the narrative of Arthur Gordon Pym[0]. It's sort of halfway between funeral doom and your Isis/Neurosis style stuff.
There is something to be said about song structure. Most songs will have a chorus, which is repeated throughout the song. Choruses will generally have higher 'readability' because they are meant to be more memorable. If you don't pre-process your corpus by removing repeating paragraphs, then your comparisons to the brown corpus are not as valid, since brown doesn't use repetition.
I believe that the convention on darklyrics is to either elide or replace with "[chorus]" all repetitions of the chorus after the first. A few spot checks seem to confirm this, but I can't find an official policy on it.
I'm suprised no difference between american and european metal was found. I feel like they are quite different, but it seems this is not supported by this data.
I found it amusing that Venom and Running Wild are grouped together in the first step. But well, by the lyrics it fits. The rest matches my expectations surprisingly well.
I wonder if the clustering method can be used/is used by apps like Spotify to create a list of "related bands", as the graph at the end was fairly accurate.