For me, Spotify is a sub par experience because I don't know what I want and I have different music tastes than my friends. The only thing that makes Spotify useful for me is sites like http://sharemyplaylists.com/. I imagine for people who know exactly what they want and just treat it like a giant jukebox, it is a fantastic service. I am not one of those people.
Napster for me, was about finding people as much as music. I was part of a community of people that were super stoked to talk about some new indie band. I would search, notice who had music I was looking for, send them a message and start talking.
I met a lot of people in real life through Napster often at shows. I know others did as well since we received thank you letters and in a couple cases, wedding announcements.
I think the community with its passion for music was what made Napster great, not the massive catalog. All I had to do was enter one of many indie channels and read for half a minute before I had three new things to listen to and a bunch of people to talk to them about.
Not to discount the catalog, I will say that the catalog at Napster will possibly never be duplicated. There were a lot of back catalog works, a lot of pre-release works and a ton of bootlegs. Sure, the quality was sometimes poor, but then I used Napster as a tasting service and then, you know, bought the originals when I could.
My hope is Spotify continues to improve and eventually becomes more community oriented or something else comes along and re-ignites the flame. Music is one of those things that is central to a lot of people's lives and you couldn't ask for a more passionate userbase.
Can you talk about that a bit? What was the underlying algorithm, what was the stack and how much data were you pushing at your max?
The search engine was built on a ternary tree with a custom merging algorithm. I honestly don't know if the merging algorithm has a name as it was something I came up with (literally) while sleeping one night. Because we mostly used ID3 tags and file names, it was completely unnecessary to use a stemming algorithm because if you typed in a misspelled search, given the size of our index, there was probably someone who tagged their file using the same misspelling.
The network biasing code used BGP data combined from a number of looking glass servers to build a map of ip/prefix -> ASN number and ASN->ASN distances. It was then used to reorder search results based on network distance to users so they would bias towards their own networks and save ISPs money and speed transfers on broadband connections.
Servers were linked through a fully meshed network. Each had presence information about every user on the network so that they could route IMs around. The chat system was semi-linked (fully linked on some servers, but we couldn't fully link the whole thing because the client had no administrative functions for chat). If we couldn't send the user back enough results for a search, the query was simply passed around the backend.
The whole thing was written in C++. At its peak, there were about 2.3 million users online at any given time (80 million total users growing by a million every 4 days). The system would be indexing about 17.6 million files per second (and de-indexing about the same amount). The whole system pushed out about 2 Gbps of bandwidth in search results (which were tiny).
Napster was one of the very first services to push past 10K connections on a Linux machine. At peak, I could get over 100K users on a single process (though I'd run out of memory indexing files on the tiny 2 GB machines and blow out the NIC sending search results). During normal operations, each server process had around 40K users on it and between 7-12 million files indexed.
There were a bunch of side infrastructure things no one saw. Court mandated copyright filtering systems, recommender systems (most for play), load balancing servers, bot detection and sequestration systems, analytics reporting jobs, etc.
Nowadays, I could probably fit all of Napster on one big machine. Heh.
I'm particularly pleased by the use of ASN/distancing weights in your results! None of the BT trackers I've hacked around on have had anything like that in them.
Where is the code now? It should be in a museum.
I once hacked up Transmission to re-prioritize based on network distance. It worked rather well when it discovered over the DHT. It is a lot easier when you only have to calculate distance between you and other networks. Storing the graph of distances between two arbitrary users is harder.
The code was part of the assets that were bought out of bankruptcy by Roxio (who renamed themselves Napster). I doubt it is being used for anything.
"...it was completely unnecessary to use a stemming algorithm because if you typed in a misspelled search, given the size of our index, there was probably someone who tagged their file using the same misspelling."
Seems to me that the Spotify API opens up all sorts of possibilities for music discovery.
Sharemyplaylist.com is basically what makes Spotify useful for me at all. I still can't talk to anyone, but at least I can have a playlist that sounds roughly like the songs belong together. I'll rarely discover new music on it.
For music discovery, I find http://hypem.com and mp3 blogs to be far more useful. Still no strong community, but at least I actually discover new interesting things.