Would you have a write up in more detail of what you did, even high level. Seems...

vidarh · on April 7, 2022

Unfortunately not, but it's surprisingly straight-forward, apart from the database bit, but here's a bit more detail from memory. There are many ways of doing this and some will depend strongly on which tools you're comfortable with (e.g. nginx vs. haproxy vs. some other reverse proxy is largely down to which one you know best and/or already have in the mix) [Today I might have considered K8s, but this was before that was even a realistic option, but frankly even with K8s I'm not sure -- the setup in question was very simple to maintain]:

* Set up haproxy, nginx or similar as reverse proxy and carefully decide if you can handle retries on failed queries. If you want true zero-downtime migration there's a challenge here in making sure you have a setup that lets you add and remove backends transparently. There are many ways of doing this of various complexity. I've tended to favour using dynamic dns updates for this; in this specific instance we used Hashicorp's Consul to keep dns updated w/services. I've also used ngx_mruby for instances where I needed more complex backend selection (allows writing Ruby code to execute within nginx)

* Set up a VPN (or more depending on your networking setup) between the locations so that the reverse proxy can reach backends in both/all locations, and so that the backends can reach databases both places.

* Replicate the database to the new location.

* Ensure your app has a mechanism for determining which database to use as the master. Just as for the reverse proxy we used Consul to select. All backends would switch on promoting a replica to master.

* Ensure you have a fast method to promote a database replica to a master. You don't want to be in a situation of having to fiddle with this. We had fully automated scripts to do the failover.

* Ensure your app gracefully handles database failure of whatever it thinks the current master is. This is the trickiest bit in some cases, as you either need to make sure updates are idempotent, or you need to make sure updates during the switchover either reliably fail or reliably succeed. In the case I mentioned we were able to safely retry requests, but in many cases it'll be safer to just punt on true zero downtime migration assuming your setup can handle promotion of the new master fast enough (in our case the promotion of the new Postgres master took literally a couple of seconds, during which any failing updates would just translate to some page loads being slow as they retried, but if we hadn't been able to retry it'd have meant a few seconds downtime).

Once you have the new environment running and capable of handling requests (but using the database in the old environment):

* Reduce DNS record TTL.

* Ensure the new backends are added to the reverse proxy. You should start seeing requests flow through the new backends and can verify error rates aren't increasing. This should be quick to undo if you see errors.

* Update DNS to add the new environment reverse proxy. You should start seeing requests hit the new reverse proxy, and some of it should flow through the new backends. Wait to see if any issues.

* Promote the replica in the new location to master and verify everything still works. Ensure whatever replication you need from the new master works. You should now see all database requests hitting the new master.

* Drain connections from the old backends (remove them from the pool, but leave them running until they're not handling any requests). You should now have all traffic past the reverse proxy going via the new environment.

* Update DNS to remove the old environment reverse proxy. Wait for all traffic to stop hitting the old reverse proxy.

* When you're confident everything is fine, you can disable the old environment and bring DNS TTL back up.

The precise sequencing is very much a question of preference - the point is you're just switching over and testing change by change, and through most of them you can go a step back without too much trouble. I tend to prefer ensuring you do changes that are low effort to reverse first. Need to keep in mind that some changes (like DNS) can take some time to propagate.

EDIT: You'll note most of this is basically to treat both sites as one large environment using a VPN to tie them together and ensure you have proper high availability. Once you do, the rest of the migration is basically just failing over.

baq · on April 7, 2022

People get paid hard cash for lower quality plans than you’ve just provided, thanks a lot! :)

justsomehnguy · on April 7, 2022

> If you want true zero-downtime migration there's a challenge

It is astounding how many people require 24/7 ops... while working 8/5.

Otherwise this comment is an exemplar on how things should be done. My take on this is what OP is a sysadmin, not a dev. *smug smile*

vidarh · on April 8, 2022

> It is astounding how many people require 24/7 ops... while working 8/5.

In this case the client had an actually global audience. They could have afforded downtime for the actual transition, but it was a usual test for the high availability features that mattered for them.

I do agree with the overall principle, though - a whole lot of people think they need 24/7 and can't afford downtime, yet almost all of them are a lot less important than e.g. my bank, which do not hesitate to shut down their online banking for maintenance now and again. As it turns out, most people can afford downtime as long as it's planned and announced. Convincing management of that is a whole other issue.

> My take on this is what OP is a sysadmin, not a dev. smug smile

Hah. I'd say I was devops before devops was a thing. I started out writing code, but my first startup was an ISP where I was thrown head-first into learning networks (we couldn't afford to pay to have our upstream provider help set up our connection, so I learnt to configure cisco routers while having our provider on the phone and feigning troubleshooting with a lot of "so what do you have on your side?") and sysadmin stuff, and I've oscillated back and forth between operations and development ever since. Way too few developers have experienced the sysadmin side, and it's costing a lot of companies a lot of money to have devs that are increasingly oblivious to hardware and networks.

throwawayboise · on April 8, 2022

> It is astounding how many people require 24/7 ops

Yet when us-east-1 goes offline, it's mostly just shrug wait for it to come back because it's not our fault...

jrumbut · on April 7, 2022

Really well done keeping this simple!

It's also another one of those situations where good design principles and coding practices pay off. If the app is a tangled mess of interconnected services, scripts, and cron jobs this kind of transition won't be possible.

ketzo · on April 7, 2022

Damn, this is why I come to HN. This is awesome, thank you so much for taking the time to write it up.

cosmodisk · on April 7, 2022

This was really nice to read. Thanks!

sruffatti · on April 7, 2022

Bump! This sounds very interesting.

withinboredom · on April 7, 2022

Highly recommend WireGuard for this (see kilo for k8s specific that works with whatever network you have setup). Setting up a VPN that just works is super simple.

johnthescott · on April 7, 2022

yep, wireguard is the secret for intercloud, for sure.

czernobog · on April 7, 2022

Same. Bump!