I use it a lot (without criu or libsoccr) for high-availability shenanigans, to avoid the reconnection delay. Machine A is 'main' and had established a connection to distant-machine 1. A crashes (the hardware stops), machine B takes over its MAC, IP and all its established TCP connections. Nothing can be seen from the distant-machine side (maybe a gratuitous ARP slips out... to no avail). And yes, for <1 millisecond takeover this is necessary and I thank the criu project with all my engineering heart for not going the 'just put a module in there and be done with it' but actually making sockets checkpointable and restartable, and saving me from the pains of a userland network stack.
The actual details are far more funny and interesting (we could talk about checkpoint not being an atomic operation for the kernel, how you need to do some magic with "plug" qdiscs and qdiscs being applicable on egress only you'll look into IFBs and I love Linux it is so versatile and full of amazing little features). Don't forget to hot-update conntrack too...
And since libsoccr is GPL you might need to do this yourself, and you'll want to do it anyway, because it's interesting and you'll learn so many things.
My only gripe is the checkpoint still being a bit slow and maybe if I keep annoying Jens Axboe on twitter maybe soon it'll be a io_uring chain <3.
Every microsecond, and HFT people rent datacenter space the closest possible to physical exchanges... When speed of light is your main concern, maybe the Linux kernel is not your friend anymore (although dpdk can help here). I'm happy people keep pushing the kernel so hard, and still try to keep the kernel generic and composable, so we can profit from this huge, amazing work.
Often times it's mission critical systems where latency is key, for many reasons:
- you have a human in the loop to take a split second decision, every millisecond counts
- you have a very short time to perform 'looped' operations - where the result of one measurement must be taken into account to effect the next measurement (adaptive optics, some radar systems, some mechanical control loops) and you can't wait.
You'd think 'oh but you got more than 1 ms for that' Well not always since one must take into account the time to detect the failure, and the time to switch other parts of the system (which sometimes must be done in sequence with the connection takeover).
I'd say 'forget tcp' there but we don't always get to decide the comm layer...
If RDS could fail over in 1ms I would be extremely happy. In practice, a few minutes of downtime for things like DB upgrades is usually acceptable in the business I am in, however this is enough time to cause quite a lot of alerting noise.
If the window of unavailability was instead 1ms, there would be dramatically less noise, potentially none.
You capture and stream the state of the connection as often and fast as you can. Ideally the tcp state would be streamed in the same operation as checkpointing, still in io_uring.
The actual details are far more funny and interesting (we could talk about checkpoint not being an atomic operation for the kernel, how you need to do some magic with "plug" qdiscs and qdiscs being applicable on egress only you'll look into IFBs and I love Linux it is so versatile and full of amazing little features). Don't forget to hot-update conntrack too...
And since libsoccr is GPL you might need to do this yourself, and you'll want to do it anyway, because it's interesting and you'll learn so many things.
My only gripe is the checkpoint still being a bit slow and maybe if I keep annoying Jens Axboe on twitter maybe soon it'll be a io_uring chain <3.