Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do you restart a single server hosting VMs with zero downtime to the VMs?


Did I misunderstand that the original post was regarding being able to do it with Kubernetes, not a few physical hosts? I didn't see "single" in the comment when I originally responded.


Yeah, Top comment says

"You can easily get a box that has this many (or more) cores. I wouldn't be surprised if our cloud provider analyzed the topology of the connections, and decided to put the whole thing on the same server."

And the comment you replied to is referring to "the server"


VMware can do it by live migrating the vm, you will incur a short pause though, and the networking is a bit tricky to setup.. This of course doesn't happen during an unexpected downtime, it's a cold boot on another node in that case.


If you migrate a VM to a second hardware server then you have, by definition, a second server.

The question was how you reboot the posited singular hardware server with no downtime to any VMs running on it.


I have never seen this go smoothly on a production server; it's always WAY slower than expected (if you use any significant amount of memory) and something always gets f'd up wrt the network connectivity, broken caches, etc.


if I understand correctly, this is happening (nearly) transparently on GCP all the time. https://cloud.google.com/compute/docs/instances/live-migrati...

It involves copying the whole VM image over, and rewiring the network connections virtually on the fly.

I say (nearly) because it's not 100% transparent, but my understanding is that it works properly the vast majority of time.


Probably true but I'm also pretty sure Google isn't using VMware for live migrations under the good.


And what do you do when you need to restart the actual VM itself? You know, for when you need to patch the kernel and such...


VMWare has had a high availability mode for over a decade. It keeps checkpoints of system state elsewhere and synchronously replicates them.

If the primary catches fire or crashes, the secondary boots the VM quickly without data loss.

If the primary reboots, it checkpoints RAM to the secondary pauses the VM and unpauses it on the secondary a few milliseconds later.

Note that this transparently handles storage replication, which is something that is notoriously difficult in Kubernetes.

If your cluster fits on one machine, you've paid for a lot of (currently) unnecessary complexity up front, both in dev time and in hardware cost.

If your app scales to need the cluster, congrats. Sometimes delaying time to market to allow for a smoother ramp makes sense. Sometimes it does not.

(Wikipedia is a good example of succeeding without ever needing to scale out the back end. I doubt they'd have won out over their competition if they took an extra 12-24 months to launch.)


> VMWare has had a high availability mode for over a decade. It keeps checkpoints of system state elsewhere and synchronously replicates them.

So you don't have one VM's, but two. In case of Op's scenario you are paying for 196 CPUS instead of 98.

> If the primary reboots, it checkpoints RAM to the secondary pauses the VM and unpauses it on the secondary a few milliseconds later.

So a very small downtime, but not 0.


It depends on your SLA. Under 10ms 99.9% of the time is pretty tight. Even if the migration takes a second, you'll meet your SLA for the hour.


So not 0 down time then




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: