I use it a lot (without criu or libsoccr) for high-availability shenanigans, to ...

dj_gitmo · on Oct 26, 2022

If you don't mind me asking, what kind of workload requires this kind of "high-availability shenanigans". Sounds fascinating.

dinosaurdynasty · on Oct 27, 2022

9-1-1 callcenters often do, some use specialized server hardware that can in <1ms switch from one motherboard/CPU to another when it fails.

Apparently there are some workloads in finance that use similar hardware.

ale42 · on Oct 27, 2022

> Apparently there are some workloads in finance that use similar hardware.

Not surprising, I've heard from someone who worked at a financial institution doing high-speed trading that basically every ms counts for them.

touisteur · on Oct 27, 2022

Every microsecond, and HFT people rent datacenter space the closest possible to physical exchanges... When speed of light is your main concern, maybe the Linux kernel is not your friend anymore (although dpdk can help here). I'm happy people keep pushing the kernel so hard, and still try to keep the kernel generic and composable, so we can profit from this huge, amazing work.

touisteur · on Oct 27, 2022

Often times it's mission critical systems where latency is key, for many reasons:

- you have a human in the loop to take a split second decision, every millisecond counts

- you have a very short time to perform 'looped' operations - where the result of one measurement must be taken into account to effect the next measurement (adaptive optics, some radar systems, some mechanical control loops) and you can't wait.

You'd think 'oh but you got more than 1 ms for that' Well not always since one must take into account the time to detect the failure, and the time to switch other parts of the system (which sometimes must be done in sequence with the connection takeover).

I'd say 'forget tcp' there but we don't always get to decide the comm layer...

hamandcheese · on Oct 27, 2022

If RDS could fail over in 1ms I would be extremely happy. In practice, a few minutes of downtime for things like DB upgrades is usually acceptable in the business I am in, however this is enough time to cause quite a lot of alerting noise.

If the window of unavailability was instead 1ms, there would be dramatically less noise, potentially none.

u8080 · on Oct 27, 2022

Such redundance shenanigans used in Cellular networks, single node could handle thousands calls at the same time.

remram · on Oct 26, 2022

How do you know about the other machine's outgoing connections in real-time?

touisteur · on Oct 27, 2022

You capture and stream the state of the connection as often and fast as you can. Ideally the tcp state would be streamed in the same operation as checkpointing, still in io_uring.