Post author here: there are other projects that will create a proxy for CUDA calls and use the log of CUDA operations to checkpoint / restore or live migration tasks. We haven’t used them. I don’t believe they are very popular nor used outside specific orgs.
This is the only API available for snapshotting NVIDIA GPU memory, afaik.
As for needing to combine it with a host memory snapshot step, this is required because CUDA sessions need to be mapped to a host process, so you need to snapshot both things in order for the program to be restored correctly.
CRIU is another project that uses the same technique (CUDA snapshot + host memory snapshot). Different than CRIU, our snapshots work at the function level so we’re able to take snapshots after functions have been initialized (including GPU memory), making Modal cold boots fast. One would have to implement this entire process using CRIU.
We have been using the new CUDA Checkpoint API (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CH...) in combination with gVisor's checkpoint / restore API and our custom file system to greatly reduce container cold boot. This is particularly impactful if you need to warm-up GPUs, for example if you are using torch.compile (i.e. you entirely skip torch.compile on restore cold boot).
I see no linked evidence therein; to either clinical trial results or equivalent accredited sources. Sounds like good news, but without any accompanying evidence this is just random click-bait.
At least the early news report I have read about the Biontech/Pfizer and the Moderna vaccines didn't have links to the scientific studies either. This seems to be pretty standard for new these days :-(
Even if the effectiveness number doesn't pan out to 92%, every working vaccine is another useful tool in the toolbox. I could imagine that many South- and Middle American countries have higher trust in Cuban than in US or European vaccines.
We are working to make our new publishing platform, Bertie, into a AI-powered writing assistant. The idea is to have the machine writing stories alongside human writers, enhancing the latter's ability to write performant stories. I'd love your feedback!
This is the only API available for snapshotting NVIDIA GPU memory, afaik.
As for needing to combine it with a host memory snapshot step, this is required because CUDA sessions need to be mapped to a host process, so you need to snapshot both things in order for the program to be restored correctly.
CRIU is another project that uses the same technique (CUDA snapshot + host memory snapshot). Different than CRIU, our snapshots work at the function level so we’re able to take snapshots after functions have been initialized (including GPU memory), making Modal cold boots fast. One would have to implement this entire process using CRIU.
reply