Check out Determined https://github.com/determined-ai/determined to help manage ...

bigbillheck · on Oct 7, 2021

Oh hey I interviewed with y'all a few years back, glad to see you're still around.

dylanbfox · on Oct 7, 2021

Interesting. How do you guys manage spot interruptions when training on spot instances?

etrain · on Oct 7, 2021

Users expose their model to our Trial API (https://docs.determined.ai/latest/topic-guides/model-definit...), the base class then implements a training loop (which can be enhanced with user-supplied callbacks, metrics, etc.) that has a whole bunch of bells and whistles. Easy distributed (multi-GPU and multi-node) training, automatic checkpointing, fault tolerance, etc.

Concretely, the system is regularly taking checkpoints (which include model weights and optimizer state) and so if the spots disappear (as they do), the system has enough information to resume from where things were last checkpointed when resources become available again.

birch · on Oct 7, 2021

Thanks for going open source!