Hacker News new | past | comments | ask | show | jobs | submit login

Check out Determined https://github.com/determined-ai/determined to help manage this kind of work at scale: Determined leverages Horovod under the hood, automatically manages cloud resources and can get you up on spot instances, T4's, etc. and will work on your local cluster as well. Gives you additional features like experiment management, scheduling, profiling, model registry, advanced hyperparameter tuning, etc.

Full disclosure: I'm a founder of the project.




Oh hey I interviewed with y'all a few years back, glad to see you're still around.


Interesting. How do you guys manage spot interruptions when training on spot instances?


Users expose their model to our Trial API (https://docs.determined.ai/latest/topic-guides/model-definit...), the base class then implements a training loop (which can be enhanced with user-supplied callbacks, metrics, etc.) that has a whole bunch of bells and whistles. Easy distributed (multi-GPU and multi-node) training, automatic checkpointing, fault tolerance, etc.

Concretely, the system is regularly taking checkpoints (which include model weights and optimizer state) and so if the spots disappear (as they do), the system has enough information to resume from where things were last checkpointed when resources become available again.


Thanks for going open source!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: