
Million song dataset available for download - nreece
http://flowingdata.com/2011/02/24/million-song-dataset-available-for-download/
======
SingAlong
Direct link: <http://labrosa.ee.columbia.edu/millionsong/>

If you need an API as an alternative <http://developer.echonest.com/docs/v4/>

That's the EchoNest API and that's where the database was collected from.

This is very interesting data and just looking at it has sent me into a
ponder-mode as to what to do with it.

Also, an interesting note on the EchoNest site about the API usage: _The Echo
Nest's APIs are for non-commercial purposes only. If you are a lone developer
selling advertising that is barely covering your running costs, we consider
that non-commercial_

------
petercooper
Once you click through, you find out the dataset is.. 280GB. So you might want
to steer clear unless you have the ability to deal with that ;-) There's a
1.8GB subset also.

~~~
bane
This thing is dying to be distributed via torrent.

~~~
aw3c2
Not really. Interest for that file is very low so few seeders if any at all
(think 12 months from now). Few people have the bandwidth to transmit so much
data quickly.

This is better off on a good http host.

~~~
bane
BT doesn't have to provide for the long-term, in the short term it can help
with not making the user download a bunch of small files.

------
duck
_We planned to release a R wrapper and looked at the default HDF5 library for
R on Ubuntu. Unfortunately, it crashes on empty arrays. This happens when a
track has no musicbrainz tag, for instance. If any R specialist is willing to
help us with this, please contact us!_

I'm sure someone from HN could help with that...

------
ajays
The site is down.

Can someone please post more about it? What's the schema for the data? How
clean is it? What are they trying to achieve with this release?

Thanks!

~~~
subspaceman
I dug around for some cached files and found the following:

Why:
[http://webcache.googleusercontent.com/search?q=cache:H52osT6...](http://webcache.googleusercontent.com/search?q=cache:H52osT6tLs8J:labrosa.ee.columbia.edu/millionsong/blog/11-2-8-why-
million-song-dataset)

Getting it:
[http://webcache.googleusercontent.com/search?q=cache:0I8CMo0...](http://webcache.googleusercontent.com/search?q=cache:0I8CMo0ulJEJ:labrosa.ee.columbia.edu/millionsong/pages/getting-
dataset+mirror+of+columbia+million+song+dataset&cd=3&hl=en&ct=clnk&gl=us&source=www.google.com)

from the second link I found it's hosted here:

<http://www.infochimps.com/collections/million-songs>

where you can download a subset as well as the whole thing. I'm downloading
the subset now, so I can't comment on the cleanliness or schema yet.

------
hammock
Site is down. If someone manages to get the metadata file and seeds it, please
post a link here!

------
danssig
If we assume the average song is 3 minutes, that's nearly 6 years of
continuous listening...

