
Don’t run this on any system you expect to be up they said, but we did it anyway - vdloo
https://www.byte.nl/blog/dont-run-this-on-any-system-you-expect-to-be-up-they-said-but-we-did-it-anyway
======
contingencies
So... they're not running any kind of devops system at all. If they were, they
could just run the upgrade on the image, test it, deploy it. All the "months
of careful planning and many many tests" they did are basically wasted time.

I wouldn't be proud of this, quite the opposite. I would suggest critically
reviewing the entire infrastructure management strategy since months lost to a
single upgrade is obviously indicative of greater problems.

~~~
quasse
It sounds to me like they are in fact running a giant devops system, all for
the purpose of not using virtual static IPs.

Instead of just provisioning fresh VMs and migrating customer data they're
doing this massive upgrade in place on existing machines to avoid losing the
assigned IPs.

I guess they decided the benefits of being cloud provider agnostic outweighed
the downside of spending months of man hours automating in-place OS upgrades.

~~~
contingencies
A good system can handle multiple requirements. You present a false dichotomy.

------
falcolas
A nice writeup of a neat (if risky) upgrade.

> static IPs

FWIW, I personally love Virtual IPs (VIPs) for this (basically, an existing
network interface advertises serving more than one IP, and can change that IP
dynamically between servers with an arp call). The downside is that there are
a lot of cloud providers who don't support externally available VIPs. They do,
however, offer their own nearly-identical solution (such as Elastic IPs from
Amazon).

The use of VIPs or similar could have potentially avoided the need for such a
risky upgrade, potentially also saving millions of dollars in the process. Of
course, I could simply be missing some hidden requirement from customers that
they _couldn 't_ use VIPs but that's pretty uncommon, even in the finance
industry.

~~~
Tijdreiziger
That's addressed in the article: "We purposely don’t employ dynamic IPs to
retain multi-cloud deployment capabilities and prevent vendor lock-in with one
platform."

~~~
luhn
I was really confused by this. Cloud vendors are not bring-your-own-IP AFAIK,
how can they even get a non-virtual static IP addresses on the cloud?

------
markatto
They're still taking downtime for this... Even if they're forced to have a no-
VIP no-HA no-LB setup (seems insane to me) it should be much simpler to set
the DNS TTL to a low value right before and switch it to the new IP after the
new box comes up.

~~~
lathiat
That assumes they have control over the DNS. Sounds like they don't with many
end customers.

~~~
markatto
They shouldn't be giving their customers a static IP, they should be giving
them a 'customername.ourplatform.com' address that the customer can point a
CNAME at.

~~~
vidarh
This was addressed in the article: There is a tendency within that industry of
whitelisting API access etc. by source IP, so their customers do need static
IPs not primarily for inbound traffic, but to be able to access the APIs they
need.

Now it's still stupid of them to not abstract that away from the individual
customer servers but, but this issue isn't solved with cname's.

------
gargravarr
On the one hand I am very impressed they managed this, but on the other, it
does seem very sledgehammer/nut-esque. Even without virtual IPs, it seems a
little silly that their customers weren't running N+1 redundant instances that
could be taken out, upgraded and then swapped without disrupting normal
operations.

Again, very impressive as an academic exercise, especially considering the
given script isn't actually that complicated, but wow, they had some serious
guts running this in production!

------
sdiq
"It was like replacing the wheels on a moving vehicle"

That reminds me of this crazy video I once watched.
[https://youtube.com/watch?v=Cad8fyYeFnY](https://youtube.com/watch?v=Cad8fyYeFnY)

------
mankash666
I think the authors lost a good opportunity to move towards containers to
avoid these problems in the future. While interesting academically, is wrong
for the long run

------
astrodust
Given how much memory some servers have these days, which for an application
node is often more than the necessary hard-disk capacity, this is quite a
clever approach.

------
loa_in_
My first thought was that it talks about `reboot` binary

------
conatus
Very nice!

