Hacker News new | past | comments | ask | show | jobs | submit login
Scrambling eggs for Spotify with Knuth's Fibonacci hashing (pncnmnp.github.io)
151 points by pncnmnp 10 months ago | hide | past | favorite | 51 comments



The fibonacci hashing mentioned here looks to be just Kronecker low-discrepancy sequences https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Addit... which is used for dithering too (see eg https://extremelearning.com.au/unreasonable-effectiveness-of...). It's quite likely already what Spotify were obliquely referring to. But I think the author missed a trick: tracks by an artist occur close together if your collection is already arranged by artist/album. Picking tracks far apart from the globally sorted list is what this sequence will do, and do you don't need to do anything at all to split by artist/album/category etc:

1. for the Kth track in a sorted collection of N with seed S in [0,1), pick track number floor(N((K(phi-1)+S)%1)).

2. There is no step 2.

Since Spotify suggest their algorithm is only a couple of lines, I'd guess this is what they did.

edited to add: the above would get pretty boring because songs would always follow one another in sequence, no matter what the seed was. But since the chance of picking an irrational number at random in a real interval is a near certainty (because rationals are countable and reals are uncountable) you can just pick any new random number as the stepsize in the sequence when you start to shuffle play and it should be good enough; picking in [.25, .75) avoids steps that take you back too close to the same artist.


This is fantastic! I still need to carefully study it all, but I implemented the idea you described in the original comment. It seems the error lies somewhere between the algorithm I mentioned and the Fisher-Yates shuffle, based on one million playlists: Mean: 0.13, Median: 0.11, Mode: 0, p25: 0.06, p75: 0.18, p90: 0.26, p95: 0.31.

It's possible I'm missing something here. Regarding your edit about 'picking any new random number as the stepsize,' wouldn't it be affected by a 'bad break,' as mentioned by Knuth? I still need to work out the 'bad break' proof.

Edit: If it helps, here is the code I used to test it out - https://gist.github.com/pncnmnp/8afb7903f61ec69a157287435a63...


I don't have Knuth to hand...but yeah it is possible you can accidentally pick a number that is too close to a rational, so you get short repeats of tracks instead of long ones. What you're after is a number with a high approximation coefficient https://en.wikipedia.org/wiki/Liouville_number#Irrationality... but in practice you just need to know that it is irrational "enough", eg you might want to avoid repeats in the first 1000 plays.

You can get that by rejection sampling on the random number and using the Farey sequence to find nearby low-denominator fractions https://en.wikipedia.org/wiki/Farey_sequence#Applications - if I pick a number between 0 and 1 I can use 0/1, 1/1 as the starting point for the usual iteration. (and then scale to .25,.75 at the end). You pick your approximation bound mu and reject if log(abs(p/q - x)) > - mu log q, repeat the farey iteration until q is large enough. (just making this up as I go on an ipad, I may have a sign wrong in there or whatever)

It is actually much simpler than this ^ to just explicitly check the first 1000 numbers for a loop. It's simpler than tortoise and hare: you know there is no run-in, so the first number is in the loop, and you want that number to not reappear in the first min(N,1000) items.


The problem is separating tracks in the same group. If the globally sorted reference list of N songs has M consecutive entries that we want to spread evenly, they should be "shuffled" at increments of floor(N/M) or 1+floor(N/M): easy to guarantee with tortured recursive constructions like the one that opens the article, less obvious with a hash function that is affected by floating point approximations.

An integer formula that is more obvious to work: starting from any number between the maximum group size and (if larger) N/phi, pick an increment D as the next larger or smaller integer that has no common factors with N (to ensure full period) and map index K to index (S+K*D)%N.


To avoid the fixed cycle problem, for each shuffle, choose a seed, then sort by hash(artist + seed) + hash(album + seed) instead of just artist + album before you start jumping through the list using fibonacci hashing. Album tracks and artists still end up colocated, but their position in the list is randomized.


I was a tile installer in a previous life, and occasionally a customer would make the dreaded request to have an accent tile placed 'randomly' throughout a backsplash. They were never happy with the placement, and clearly had no understanding of randomness. Generally what they actually wanted was 'evenly distributed'. I remember one particular customer kept changing the accent tiles and moving them until they converged upon a very specific pattern throughout the back splash except for one tile that was out of the pattern, that they kept 'wrong' out of some ego driven desire to justify their request for randomness.


The perfect pattern with one wrong tile is so awful. It's a fair punishment that they have to live with that.

I want those non-repeating pattern tiles, how awful would those be to tile?


I have a playlist with 150 hours of music on it and it seems to play the same five or six songs every day, no matter what. I wish I could choose "actually random", I wouldn't mind unexpected clustering.


It's absolutely wild. I have over 50 playlists I've made, some over 10 hours long with no repetitions anywhere. I have extremely thorough and diverse music interests and habits d

And yet, when Spotify's shit tier algorithm takes over, it kicks me into a similar 5 to 6 meme set of songs every single time. It's an absolute joke.


Same. Out of 5,000 songs, am I really supposed to be hearing certain songs 3+ days in a row, when I'm only listening to 20 songs a day...?

I can't tell if it's:

- Certain artists are paying Spotify to favor them in randomization?

- There's some kind of shared random seed across devices that results in picking a tiny subset of songs to randomize from in the first place?

- Other?

I do notice that the effect seems to persist for maybe a week, then I'll never hear those songs again, but now it'll be different songs that keep popping up repeatedly.

There's a related effect when you launch a radio station based on an artist or track. If you launch it multiple times in the same day or week, you get the exact same list of tracks. But maybe a week later the tracks have changed, like the radio has been recalculated based on a different random seed.


There's maybe some counterintuitive math going on here.

If you have 100 songs and listen to 1 song per day (which is 1% of the library), on any given day your odds of hearing the same song as yesterday are 1 in a 100.

If you have 1000 songs and listen to 10 per day (still 1%), the odds of hearing a song that was also played yesterday are a little less than 1 in 10, right?

So what matters is not only what fraction of your library's play time you sample daily, but also how finely subdivided the time is into individual tracks for sampling.


A bit of birthday paradox too? :)


> If you have 1000 songs and listen to 10 per day (still 1%), the odds of hearing a song that was also played yesterday are a little less than 1 in 10, right?

No. It's 10*(1/1000)=1%.

It'll happen a few times a year only.


The odds of hearing a specific song that was played yesterday is ~1%, the odds of hearing any song that was played yesterday is ~10%.


What are the chances of hearing the same three or four every day for a week though out of a set of a thousand? Certainly that must be quite unlikely.


No it's not. That's already accounted for by the "10*".

It's 1%. Any specific song is 0.1%.


That's not quite right. You need one 10 for the fact that you listened to ten songs yesterday, and another 10 for the fact that you're listening to ten songs today.

Assume that you got 10 unique songs yesterday, which is the case ~96% of the time. Then there are 990 songs you didn't hear yesterday, and for every song you listen to today, there is a 990/1000 chance that it's one of those songs. Hence the chance of only hearing new songs today is (990/1000)^10 = 90.4%.


This article is pretty insightful on the ways artists can improve their presence on Spotify, including images of the backend tools available to artists: https://blog.groover.co/en/tips/spotify-streams/


I was excited to get an insight into some hidden backend tools but it seems like this post is just about playlists (creating your own playlists, paying for placements in other people's playlists), and ads?


Out of curiosity, does it happen on some clients and not others?

I remember hearing of a bug where if you played on a remote device, it would transfer the first part of your playlist (10? 100 tracks?), and then shuffle would only choose from among them.

But it's been 5+ years so things may have changed and/or I could be remembering completely wrong.


That might be true, but my impression is that the algorithm is weighted, so some artists (more popular in general, recommended, played more often by that individual user, would get picked more frequently by the "random" algorithm).


I wonder if this is market manipulation and not a honest mistake. To make certain labels and artists more money.


As far as I can tell all Spotify's playlists (in all forms, e.g. including "random" and "track radio") use weighted algorithms based on several factors like user's play history, Spotify's recommendations (probably includes paid promotions), general artist's popularity, etc.

When I still used Spotify, I would get a dozen of my favorite artists mixed into basically any "playlist" I pick. Was one of the reasons I quit Spotify - they are too opinionated on what I should listen to.


It also seems to put anything I added recently on heavy rotation.


My "discovery" section is full of albums I have liked + downloaded. It's infuriating.


The effect is so strong that 3 of those songs become my top 5 listened despite me not actually wanting to listen to those songs THAT much


It might think you really like those songs. I listen to my favorites every day, among others.


Eh, unexpected clustering is sort of okay but again, it's what people respond to versus what they think they will. I've written some scripts so I can dredge playlists off of some radio stations that are hooked up to the Internet as part of a quixotic little project of mine and I've gone through, looked for dupes, etc. Let's say we're doing new wave (sure to start an argument there). What people seem to dislike is ABC, The Buggles, the Cure, Duran Duran, Ebn Ozn, Fra Lippo Lippi, Duran Duran, etc. Just having a gap between there feels insufficiently random.

Clustering apparently ought to feel deliberate. Now think back to when you had actual DJs selecting tracks on the radio. One of the techniques was "Two from a particular band." Not two from a band with some tracks between them.

Similarly, one can do a "Four tracks from 1994" to provide a cluster in time, another technique I've heard.

If anything, the more metadata you have, the more you can provide short runs of something. Microgenres, for example.


This entire HN thread reads like some fiction made up by a free software activist 15 years ago, as a dystopian cautionary tale. It's all complaints and speculations about stuff that just worked back then.

Back then we had our music locally and we chose our own players, of which there were many and easy to make another one. Actually, hacking on music playback was easy and not uncommon. We had full control of our musical lives.


Tangentially, I encourage everyone to check out Ken Thompson's talk on his jukebox: https://www.youtube.com/watch?v=kaandEt_pKw


Scrambx0red 3ggs by Cory Doctorow (2002)


Shame we can’t choose the type of shuffle, based on your mood/what you’re listening too (not to add even more complexity).

e.g. For classical music I’d prefer stringing together pieces from the same orchestra/composer. But for some contemporary music would like mix the artist/album up more.


I'd like one that on random and uses each song as an experiment to determine your mood. If you listen through that's positive, instantly skip? That's a negative signal. Just adapt on the fly at the beginning of each listing session.


Wasn't this how Pandora worked?

"The user can use thumbs up and thumbs down buttons to declare whether they like a track or not, which determines whether similar songs should be played in the station.[40] A second thumbs down to the same artist will ban that artist from the selected station.[41] A thumbs down immediately skips a song, but the number of times a user can skip tracks is limited unless they are using one of the paid subscription plans, or opts to watch a video ad.[42][43] More than 450 musical attributes are considered when selecting the next song.[44] These 450 attributes are combined into larger groups called focus traits, of which there are 2,000.[45] Examples of these are rhythm syncopation, key tonality, and vocal harmonies.[45]"

https://en.m.wikipedia.org/wiki/Pandora_(service)


I've never used Pandora before so I had no idea. Do they have a patent on the concept or something? It seems like such an obvious idea and yet the best Spotify can do is prepare a playlist based on taste rather than adapt on the fly.


Based on your comment, I would love to see a feature just as we have a prompt travel in image generations, how about genre travel? A playlist of 10 songs taking me from rock to french house.


I am not sure if the situation I describe is currently in production in Instagram.

But here's what I noticed when I went to the explore section of Instagram.

At display, there would be distinct choices of images and reels, varied and related to my interests. But if I select a particular reel/image type (e.g. animation or comics or 3d render), it would take that as a signal and would expand the feed based on that selection. I love that feature.

I guess Spotify Radio do that to some extent, not sure.


The song radio does that. Really well in my experience.


I was nerd-sniped by shuffling playlists a while back too and came up with this algorithm: https://ruudvanasseldonk.com/2023/an-algorithm-for-shuffling.... I've been using it since, my experience at this point is that the interleaving — although it does an optimal job of not playing the same artist twice in a row — is maybe a bit too non-random.


There was a delightful Usenet post way back around 1990, where someone described how they had just purchased a new multi-CD player. They very excitedly filled it up with their collection of Prince CDs, and set it to random shuffle play mode.

Great for a while, but then they complained that all the slow songs were bunched together. And perhaps the random shuffle play mode was sampling the songs, deriving the tempo of each, and adjusting the shuffle accordingly.

Very funny.

---

Heh-heh, I independently came up with Fibonacci hashing for color many years ago.

My web app was drawing a diagram of N rectangular items, color-coded to tell them apart, with a table listing the details of each below.

(Normally I would use EIA standard colors, with a nod to my EE brethren.)

But I didn't want the colors to bias anything. So you'd normally try random colors. But random colors can come out weird and some can be close together.

So I used a Golden Angle around the hue circle, with a constant brightness and saturation. And sure enough, the generated colors were nicely differentiated.

BUT... not as nice as I'd like. Something was wrong.

It turns out that our perception of color is more complex. And when we're differentiating between colors, it really, really helps if the colors are familiar, and describable.

So simple colors like blue and purple are much easier to differentiate than a new weird blueish color and a new weird purplish color.

So my Golden Angle colors were technically superior, but not as good a user experience.


My biggest desire for Spotify would be the ability to randomize a playlist but give massive priority to your least-listened songs.

So if you have a playlist with 10 songs A-J that you've listened to 10 times each. And then you add 10 new songs K-T that you've listened to 1 time each... Then every time I shuffle the playlist, I want songs K-T to be the first 10 songs in random order until I've played them 10 times each.

I mean, things can be mixed up a little more than that... but generally speaking, I want to listen to my least-listened songs much more than the ones I've been listening to forever. But I don't want to have to separate them out into special playlists "newest", "newer", "kinda new", "old", "oldest" which is annoying.


Slightly unrelated but just learned about Hacker News pool. Very interesting.


Spotify’s shuffling is so god-tier bad that I would be flabbergasted if it wasn’t biased towards songs with less royalties to pay out..


I'm convinced something's wrong with Spotify's randomization.

When I press the big green play button in a playlist's page, it tends to start playing from a specific song, in a weirdly familiar order. So I manually pick a song at random and let it play from there.

But even so, some songs almost never get played, but others get played fairly frequently.


I wouldn't mind the same artist a couple of times in a row. It's what I would arrange for myself, just that the mood of the song being consistent with the theme is important


What did the ipod shuffle use?


Hi, author here! Interesting question!

Martin Fiedler briefly addresses this topic in a comment on his blog post about shuffling algorithms (https://keyj.emphy.de/balanced-shuffle/)

> Apple has a so-called “smart shuffle” algorithm, but this merely puts higher-rated tracks in front of lower-rated ones. So basically, it’s just random shuffle, followed by a sort-by-rating operation. I don’t know of any product (software or hardware) that uses some kind of smart, balanced or whatever-you-like-to-call-it shuffle based on the principles I described in the text.

I'm not sure what they are using today.


What kind of categorization did they use to keep it from just becoming a sort? By artist?


I just found this 2009 research paper written by a group of statisticians titled 'Does Your iPod Really Play Favorites?' (https://www.researchgate.net/profile/Jessica-Culhane/publica...).

In this paper, they test different probability models to detect bias in iPod's shuffling algorithm and eventually conclude that:

> Our statistical tests show the long-term occurrences of these events are within expectations under the assumption of a random shuffle."

Regarding sorting by artists or groups, they found that:

> We failed to find any evidence to support the claim of users like Steven Levy of favoritism of certain groups in the shuffle.


No, I mean if you order from most to least popular it's always the same. Did they pick a random artist then sort the artist's tracks by popularity?


I wish they had a setting to go back to the Fischer-Yates shuffle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: