
Managing Machines at Spotify - ran290
https://labs.spotify.com/2016/03/25/managing-machines-at-spotify/
======
scurvy
What happened to your DNS data? Did you switch to dynamic DNS based upon
database data? You talk about how much a burden the manual DNS information
was, but then you don't specify how you actually solved it using "automation."
Is it all dynamic? Everything use SRV records that have TTL's or are added and
removed?

Sorry for so many questions, but you made a big deal about how manual "DNS
curation" was a bad thing, then glossed over the solution.

~~~
brown9-2
Look at the section "DNS Pushes".

~~~
negz
Post author here. A bunch of stuff was glossed over as the post was more
focused on the stack's history and evolution than specific technical details.

Ideally we hope to provide some followup posts that go deeper into technical
detail about key pieces of the stack (DNS, initramfs framework, job broker,
GCP usage, etc).

------
not_kurt_godel
> Spotify has historically opted to run our core infrastructure on our own
> private fleet of physical servers (aka machines) rather than leveraging a
> public cloud

One has to wonder why they would opt for this. The entire story is a textbook
example of where using a cloud would have been immensely better. Instead of
leveraging mature pubic cloud offerings, they chose a path that evidently
required huge amounts of developer time and resulted in a tremendous amount of
pain/wasted time for downstream developers, only to scrap it in the end when
they finally realized there's no point in trying to re-implement AWS/GCE.
Think clouds are expensive? I'd love to quantify the number of wasted
developer-hours resulting from this decision to use physical servers and see
how it would stack up against even a very expensive AWS bill.

~~~
jba
Depending on their workloads, running their own datacenter(s) might save them
millions a month. I know it does in my own case. That said you give up
flexibility for the $ savings. They may think the additional flexibility is
worth the cost differential at this point.

Would be good to hear their perspective on this.

~~~
negz
I believe we expect moving to the cloud to be more expensive than running our
own DCs as you suggest, but I don't believe that takes into account any
'wasted developer time' you might factor into this.

I believe we started building this platform when AWS was very new, and hadn't
seen a compelling reason to transition from it to the cloud until now. There's
a couple of posts with more details behind our decision to go to GCP, but
primarily it was to leverage their data tooling.

------
matt_wulfeck
> While we heavily utilise Helios for container-based continuous integration
> and deployment (CI/CD) each machine typically has a single role – i.e. most
> machines run a single instance of a microservice.

It's strange to me that this is still so common. My theory is that the "one
machine one port" philosophy is still built into a lot of software
(monitoring, the ELB, etc). Another is that this is the philosophy we've
always known.

Take a look at Kubernetes. Everything is accessible via localhost:<some port>.
that breaks most home-built and enterprise orchestration and monitoring tools
spectacularly even though it's a much simpler mode (everything is a port, not
ip port combo).

Density is much easier to accomplish on larger machines with more cores, which
are elastic in the face of bursty residents. They are also generally cheaper
per compute/memory.

~~~
jsmthrowaway
All of those things are doing gymnastics with ports because nobody can be
bothered to ship IPv6. If you can bring v6 up you can assign every process an
IP and start assuming ports (80 is the service via HTTP, 443 via TLS, 8080 via
HTTP/2 gRPC, 9000 for monitoring, and so on). It's way cleaner than all the
work around ports in the current state of the art and means you can Just Use
DNS in a number of scenarios. There are whole systems around ports in pretty
much every orchestration system and it's such an antipattern, really. Half of
Docker's networking stack, a bunch of Kubernetes logic, Flannel, all of it
becomes unnecessary and they represent attempts to jam the right way into
limited IP and limited address table space on infrastructure.

IPv6 is practically built for containers, and, to Kubernetes's credit, they
architected with that in mind. (Learned from BNS.) Weirdly, what I'm saying
here was the original idea behind ports in the first place. There just aren't
enough of them, particularly when half your space is shared with client
sockets.

I want a world where v4 is pretty much just my control plane into the v6
cluster, since I'll die before IPv4. Google and far more importantly Amazon
need to come up with a v6 story in their cloud offerings already. AWS has had
a _decade_. This isn't just blind advocacy any more; the orchestration and
software side is starting to build entire parts of the OSI stack because the
network side of our industry is stuck without any sign of moving, no matter
how dire the v4 situation.

------
scurvy
Also, this whole solution sounds like a Linux clone of Microsoft's Automated
Deployment Services -- way ahead of its time and under-appreciated in 2004.

------
scurvy
One more: How is relying on a random Python library from OpenStack better than
relying on a UNIX command line tool that's used by 100x as many people?

------
be_erik
"We also assigned each server a static unique identifier in the form of a
woman’s name – a shrinking namespace with thousands of servers."

Let me fix that for you; stop gendering your servers.

~~~
akerl_
If they'd instead said "We assigned each server a static ID in the form of a
female computer scientist's name", we'd be here praising them for their
forward-thinking inclusion. Maybe lets not see everything as offense-worthy?

