
How Spotify ran a large Google Dataflow job for Wrapped 2019 - jhatax
https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a-data-story/
======
spyke112
This they can do, but you can't change your display name unless you hook up
with Facebook. [0]

[0] [https://community.spotify.com/t5/Live-Ideas/Account-
Change-U...](https://community.spotify.com/t5/Live-Ideas/Account-Change-
Username/idi-p/703799)

~~~
Hasnep
You can make a new account and contact the support team who can transfer
almost all your info to the new account including followers and playlists. Not
ideal though.

~~~
spyke112
I find the biggest value of Spotify is the listening history they have
gathered on me throughout the years. I would hate to lose all those excellent
recommendation they give me based on that data.

------
gwittel
Interesting. I wish it had more details as far as inputs/outputs, data sizes
in different phases.

One thing that I wonder about is how much work could they do to collect this
data on a forward moving basis. Often I see huge lookback jobs that answer
predictable/static questions -- prime candidates for aggregation during
ingest.

~~~
wobblykiwi
This is the thing I was most forward to reading about in the article, but
there were no figures about how large the "largest Google Dataflow job ever"
is. There are a bunch of relative figures, 5x 2018 - but what does that
translate to? How long did it take?

~~~
tylerl
Ya, concrete details were conspicuously missing. Like petabytes? Exabytes? I
suspect that the "largest dataflow job ever" is significantly smaller than the
kind of crap Google regularly throws at the backend that dataflow runs on.
With that infrastructure at their fingers, I suspect engineers regularly fire
off jobs orders of magnitude larger than necessary simply because it's not
worth the 3 hours of human effort it'd take to narrow down the input set.

------
rsmets
I thought this was such a marvel! However, my excitement level was tapered
when I realized the playlist Best of the Decade was not created by only my
music listening habits.

Seems as though users were pinned to some general playlist that had
characteristics similar to listening habits? Still hats off from an
engineering perspective. I as well wish there was more technical detail
provided.

The year recap playlists though are fun personal snapshot of time.

~~~
paxys
I think the decade lists were a bit underwhelming considering not too many
people were actually using Spotify all that much 10 years ago. I still got a
ton of my music from CDs, iTunes downloads and other more nefarious places.

~~~
matsemann
I became a paying customer (Premium subscriber) oct 5, 2009. Everyone at my
school was using Spotify at the time, albeit the free version. (Norway)

~~~
huseyinkeles
Interesting, maybe it was more popular back then in the nordics as spotify is
a swedish company?

~~~
LeonidasXIV
Maybe it also depends on the stationery and mobile access you have. In Germany
streaming music wasn't feasible a decade ago since you had pretty limited data
plans and arguably it still isn't really all that feasible on mobile internet
still unless you download for offline on WiFi.

Meanwhile in Denmark or Poland there is very little in terms of data limits.

~~~
Orphis
10 years ago, barely anyone had a smartphone. Spotify then was about Desktop
usage.

~~~
Jhsto
I remember creating a mobile app for Spotify before they did. It used a
reverse-engineered API on a server to download songs and stream them to mobile
devices. Most of my friends used it at my school. There were some issues with
the server providers and eventually Spotify disliking the fact that the server
constructed DRM free music files and stored them temporarily on a disc.

Eventually, Spotify released its official mobile apps and a web player so the
project had no use. But it was fun times, it was really marvelous how anyone
could find their favorite music from the service and listen to them in good
quality without a torrent connection.

Nowadays, I think all those friends who used the hack are Premium subscribers.

------
dna_polymerase
Basically the perfect use case for cloud computing. Tons of compute for a
short time. In this case there can’t possibly be people arguing for their own
datacenter over cloud.

~~~
wrkronmiller
> Basically the perfect use case for cloud computing. Tons of compute for a
> short time.

I completely agree.

> In this case there can’t possibly be people arguing for their own datacenter
> over cloud.

Devil's advocate time: This solution was great for the cloud because it was
designed for the cloud. There might be equally good or even superior solutions
designed for on-prem or even on-device computing. For example, this ceases to
be a big-data problem if you are simply aggregating listening metrics for a
single user on a single device.

~~~
gen220
IMO, this is a great example for how the policy of “owning your own data”
actually leads to objectively “better” Engineering solutions.

If Spotify leveraged _my_ phone to calculate these statistics of _my_
listening history (owned and stored locally), this article would have been
written about an app update.

No need for a massive ad-hoc job with high-bandwidth round trips, just a
simple app update.

It’s funny to imagine how engineers of the future might look back on our pride
over this kind of computing similar to how we look back in horror on how
wasteful we once were with mining oil back in the 1910s, etc.

~~~
joshuamorton
> If Spotify leveraged my phone to calculate these statistics of my listening
> history (owned and stored locally), this article would have been written
> about an app update.

Then the article would be about the challenges of battery life on users'
phones, and trying to coordinate listening history on PC vs. phone.

~~~
gen220
To be clear, I’m not a data ownership nut, I just find the problem space
interesting and underrated. Apologies for the hyperbole in the last paragraph,
it was more tongue in cheek than serious.

The article on coordinating and compressing listening history (the particular
challenges of distributed schema evolution at the “edge”), would have been a
much more interesting article to read, IMO.

Also, I know you probably weren’t very serious about it, but I don’t think
that a few SQL queries against “thousands of data points” (temporal rows,
reading between the lines) would be a significant battery life drain! It would
have still been interesting to see that benchmarked. But “big data” is cooler,
I guess. :)

~~~
foota
Fwiw you can clock a few hundred listens a day for 30,000 a year or 300,000
over a decade, which is approaching non trivial levels for a phone, especially
if you're doing anything more than an index scan.

~~~
gen220
Oh for sure. I was just going off the article’s own phrasing, which I agree
sounds strange (seems too small). But if you think about it, very few people
probably listen to 30k _different_ songs on Spotify in a single year, so maybe
it does make sense.

Of course this all depends on the level of detail they want to store, it could
be a uuid, a tstzrange, and some Booleans about whether the song was liked,
downloaded, etc.

Every year (or once you reach some storage threshold) you could “compress”
this information by aggregating rows by song, and throwing away precision on
the time stamps, until you’re just left with a uuid, full/partial play
counters, and dates that the song was liked/unliked, downloaded/removed, etc.
You could give users the option to modulate the level of detail in the
records, to trade off storage constraints against recommendation UX.

It’s a set of constraints that differs greatly from a huge ETL job, but my
point is that this kind of edge work leads in interesting directions, too :)

------
data4lyfe
One massive SQL query across a billion plus users.

~~~
ipnon
Databases are the one area of computer science that makes me realize these
machines can do magical things.

------
matlin
I'm curious how much data this involves per user. This is clearly a massive
undertaking when you're talking about ~250 million users but I bet it would be
easy to provide the same info if all the data was local on a device and each
user ran their own query. This assumes that the space required to store all of
your listening history fits on device which I think is a safe bet.

~~~
paxys
> This assumes that the space required to store all of your listening history
> fits on device which I think is a safe bet

Space-wise, yes, but users are likely using multiple devices and may have
switched phones, reinstalled the app, wiped data etc.

Then you have to consider that the scripts would have to be individually
written for each platform, and would have to be careful about power
consumption, CPU usage etc., especially on mobile devices. And there's not
just data mining but also video encoding (for the stories).

And then there's this part:

> To bring you a Decade Wrapped, we had to process these data stories over 10
> years’ worth of data for all of our monthly active users

~~~
herbstein
> And there's not just data mining but also video encoding

I was under the impression that the stories were live graphics. They certainly
where on PC, as I had issues running the WebGL because of my script blockers.

------
deepsun
I'd recommend them to check out Clickhouse for exactly the same purposes.
Works well for Cloudflare, Yandex, Sentry.

Another idea is to run probabilistic queries instead of exact ones, could
bring down costs way more.

------
dang
There's more info at [https://techcrunch.com/2020/02/18/how-spotify-ran-the-
larges...](https://techcrunch.com/2020/02/18/how-spotify-ran-the-largest-
google-dataflow-job-ever-for-wrapped-2019/).

(via
[https://news.ycombinator.com/item?id=22359528](https://news.ycombinator.com/item?id=22359528))

------
justlexi93
In early December, Spotify launched its annual personalized Wrapped playlist
with its users’ most-streamed sounds of 2019. That has become a bit of a
tradition and isn’t necessarily anything new, but for 2019, it also gave users
a look back at how they used Spotify over the last decade. Because this was
quite a large job, Spotify gave us a bit of a look under the covers of how it
generated these lists for its ever-growing number of free and paid
subscribers.

------
drdoooom
Was a neat little feature, too bad the share functionality didn't actually
work.

------
dvtrn
I thought we had a thing about preserving post titles from the source?

~~~
capableweb
That's still true, submission used to link to
[https://techcrunch.com/2020/02/18/how-spotify-ran-the-
larges...](https://techcrunch.com/2020/02/18/how-spotify-ran-the-largest-
google-dataflow-job-ever-for-wrapped-2019/)

See
[https://news.ycombinator.com/item?id=22359865](https://news.ycombinator.com/item?id=22359865)

------
fmjrey
This may be a more appropriate source, from the source:

[https://labs.spotify.com/2019/11/12/spotifys-event-
delivery-...](https://labs.spotify.com/2019/11/12/spotifys-event-delivery-
life-in-the-cloud/)

~~~
mackey
This is correct link [https://labs.spotify.com/2020/02/18/wrapping-up-the-
decade-a...](https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a-
data-story/)

~~~
dang
Ok, we've changed to that from [https://techcrunch.com/2020/02/18/how-spotify-
ran-the-larges...](https://techcrunch.com/2020/02/18/how-spotify-ran-the-
largest-google-dataflow-job-ever-for-wrapped-2019/). Thanks all!

~~~
gabagool
The new Spotify blog only states that "the Wrapped Campaign data pipeline had
one of the largest Dataflow jobs to ever run on GCP," without claiming that it
was the largest ever. I didn't see any additional evidence in the TechCrunch
article to support this being the largest either.

Not sure if a better title is warranted ("How Spotify ran its massive Google
Dataflow job for Wrapped 2019", "How Spotify ran one of the largest Google
Dataflow jobs ever for Wrapped 2019"?).

~~~
dang
Ok, we've knocked the largest down to size in the title above.

I always tell startups not to use superlatives on HN. Modest language sounds
stronger.

------
downerending
Impressive, but I'd be more impressed if they fixed their random shuffle.

~~~
nvarsj
What's wrong with the spotify shuffle?

edit: Did a search, seems like there's quite a few problems (only playing
recently added songs, only playing 100 songs out of the playlist, etc.). I
know google music has also had long standing issues with shuffle play - and in
fact I left it over these kind of issues. Is it really difficult to implement
a shuffle?!

~~~
mrkeen
It may be the case that 100 tracks are sent to the device and the shuffle
logic chooses from them locally.

~~~
kingosticks
Not sure why you are being down voted, this is essentially how Spotify's
shuffle works. At least, if you MITM the official client and load a large
playlist/context you'll only see a small window worth of tracks being loaded.
And you won't see any request from the client when you then shuffle that
playlist, it's done locally.

This may, of course, have changed. My experiments while (badly) implementing
librespot's shuffle functionality were a few years ago now.

------
stilisstuk
No tech crunch... You can't have my cookies..

------
fs111
why is this link doing a redirect through some ad network?

~~~
Swtrz
I wonder why I never see this behavior despite every other person mentioning
it

~~~
jdormit
It's really quick. Open the network tab and check the "persist logs" checkbox
to ensure that the request logs don't disappear after every redirect, then
clear your cookies for advertising.com and guce.techcrunch.com and reload the
page. You'll see the request for techcrunch.com redirect to
guce.techcrunch.com, which redirects to guce.advertising.com, which redirects
back to techcrunch.com. It happens so fast it's not noticeable on page load.

------
swagonomixxx
This is interesting, but what I actually find even more interesting than this
is Spotify continuing it's usage of Google Cloud products even after being
acquired by Microsoft. Can anyone shed some light as to why this is the case?
Has that acquisition not been a "traditional" MS acquisition?

~~~
luhn
There was some news about Microsoft acquiring Spotify in April 2018, but as
far as I can tell that never went through.

~~~
dundun
There was also a breaking story about Google acquiring Spotify on the exact
same date a year later!

