
Million Song Dataset - commons-tragedy
http://millionsongdataset.com/
======
devinplatt
I did my master's thesis (2017) using this dataset. I trained a neural network
to predict musical features from the raw audio of the songs.

Unfortunately the 7digital ids were out-of-date, so in order to get access to
the audio (30 second clips) I had to email another researcher who'd recently
published work using the audio data and politely ask them to rsync me the
audio XD

------
sundarurfriend
For anyone else who was curious about the song selection process:

<quote>

How did you choose the million tracks?

Choosing a million songs is surprisingly challenging. We followed these steps:

1\. Getting the most 'familiar' artists according to The Echo Nest, then
downloading as many songs as possible from each of them 2\. Getting the 200
top terms from The Echo Nest, then using each term as a descriptor to find 100
artists, then downloading as many of their songs as possible 3\. Getting the
songs and artists from the CAL500 dataset 4\. Getting 'extreme' songs from The
Echo Nest search params, e.g. songs with highest energy, lowest energy, tempo,
song hotttnesss, ... 5\. A random walk along the similar artists links
starting from the 100 most familiar artists

The number of songs was approximately 8950 after step 1), step 3) added around
15000 songs, and we add approx. 500000 songs before starting step 5. For more
technical details, see "dataset creation" in the "code" tab.

</quote>[1]

What I really wanted to know was if it was a worldwide-music dataset or a more
narrowly focused one. My guess based on the above is that it's mostly English-
language music, mostly American - can someone who's worked with the data
confirm/deny that?

[1] [http://millionsongdataset.com/faq/#how-did-you-choose-
millio...](http://millionsongdataset.com/faq/#how-did-you-choose-million-
tracks)

------
ATsch
I think this Dataset is probably unfortunately most famous for an incredibly
flawed but very headlineable attempt, which you've almost certainly seen
somewhere, by a group of researchers from AI and related fields (none of which
had musical qualifications) to "objectively" determine if music has gotten
worse over the decades. As usual, they arrived at the conclusion that it did
by computing a vague number ("timbral diversity" and "harmonic complexity")
for each song, and then showing a scary graph of that number going down over
time.

Not sure what my point is exactly. I guess it's just another reminder of how
easy it is to arrive at any conclusion you want using complex algorithms on
big data.

~~~
robryk
For virtually any kind of art newer artworks will be on average worse than
_surviving_ older artworks: the older artworks have undergone a selection
process that weeded out the less popular ones.

This is IMO a more fundamental problem with any comparisons across long time
intervals.

~~~
BurningFrog
True for a naive approach, but I'd think looking at the chart toppers for,
say, each week would give decently comparable data.

~~~
pocketcheese
I actually did something similar! I wrote a small blog post about it here
[https://www.popnalysis.com/blog/lyrics-over-
time/](https://www.popnalysis.com/blog/lyrics-over-time/)

But I came to the same conclusion as the op comment. Music can't really be
judged by any one metric! But that doesn't mean that you don't gain insight
into how music has evolved!

also, I did release my entire lyrics dataset that I scraped (about 500k) for
free.

------
craze3
I was so excited, until I realized the list hadn't been updated in 8 years...
:|

~~~
nerdponx
Probably still good if you want to practice with machine learning on music.

------
eruci
I geolocated the artist list:
[https://geocode.xyz/874101666267029,share?export=GeoCluster](https://geocode.xyz/874101666267029,share?export=GeoCluster)

Next up, geoparse all song lyrics and compare those locations to the artists'.

------
carlosaguayo
That's cool, perhaps also link it from a centralized location like Kaggle?
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets) When I look
for datasets, I first go to a place like that.

~~~
badrequest
Found this dataset on Kaggle:
[https://www.kaggle.com/c/msdchallenge/data](https://www.kaggle.com/c/msdchallenge/data)

------
psalminen
Used this dataset in a university project to try and predict genre's from a
number of features.

[https://medium.com/modeling-music](https://medium.com/modeling-music)

------
vagab0nd
In case this is not clear: the dataset does not include the raw audio of the
songs, just extracted features.

------
glouwbug
Does this data set have a collection of chords in text? I'd love something
like it.

~~~
disgruntledphd2
The dataset is a bunch of SQLite files, so shouldn't be too tricky to
interrogate.

You can get a subset (static.echonest.com/millionsongsubset_full.tar.gz),
which is 1.8Gb compressed. The full dataset is 280Gb, and AFAICT, this does
_not_ contain the full audio.

There's a script on GitHub from like 8 years ago that apparently can get you
the audio (but I would be super-impressed if that actually still works).

You can see a description of one song here:
[http://millionsongdataset.com/pages/example-track-
descriptio...](http://millionsongdataset.com/pages/example-track-description/)

~~~
pontifier
It looks like they have data about many of the individual notes in the song. I
wonder if it could be possible to turn that data back into some sort of
horrible midi version.

