
PyTorch elastic training - jonbaer
https://github.com/pytorch/elastic
======
vikas-kumar
We at AWS worked on this last year. We saw some promising results with our
implementation with Apache MXNet DeepLearning framework.

[https://aws.amazon.com/blogs/machine-learning/introducing-
dy...](https://aws.amazon.com/blogs/machine-learning/introducing-dynamic-
training-for-deep-learning-with-amazon-ec2/)

Good to see more work in this direction and we would be happy to collaborate
on this.

------
pythux
I have a more meta question related to Python, and only remotely related to
this post. With the huge use case of machine learning and deep learning that
Python is loved for; I am wondering if other traditional use cases that Python
was used for could suffer. What about future developments for the language
itself, is ML big enough to influence the direction and priorities of the
language? This is far fetched but, if a critical mass of companies and people
start using Python for ML exclusively, could it hurt Python as a "general
purpose language"?

~~~
aivosha
One can always fork python to keep it "general" if it ever comes to that. In
fact I would argue both ML and non-ML domains would benefit from that fork.
The reason python is used in ML is not because its intrinsically good for it,
but because it happened to have a good set of libraries and community at the
right time and place. Having specialized subset of python fine-
tuned/"compiled" for ML would be a good path for it as a product to evolve
towards. I would not mind having all kinds of ML related, algebraic operations
as part of the language so the code reads more naturally. And not only that,
all frameworks have to support that language and thus become much more
interchangeable. Just like you can have CPython and pypy and others that
converge into the same language, imagine the same thing with TensorFlow,
PyTorch and alike converging to a singe language that support builtin ML-
operations.

------
sytelus
This is great to see! This framework will allow defining functions that then
can be run on many machines, gathering the output in fault-tolerant scale-out
way. However, only AWS example is very limiting. I hope Azure and GCP gets
added soon. Better docs on how infrastructure works underneath
(Ray/Kubernetes?) would also be appreciated. I'd love to see an example that
allows training ImageNet in just few minutes if cheap spot instances were
available in cloud of your choice.

------
choppaface
At a high level, this looks similar Spark barrier mode for Tensorflow /
Horovod. Except this system relies on etcd, which k8s folk know has some
limitations and admin costs...

For a modeling-focused project, one will still probably do better with a
multi-gpu machine versus elastic complexity.

[https://medium.com/plumbersofdatascience/whats-new-in-
spark-...](https://medium.com/plumbersofdatascience/whats-new-in-
spark-2-4-121162f1c385)

------
sandGorgon
im actually not sure why everyone is investing in their own scaling
frameworks.

Would it have been hard to enhance Dask and leverage that ? For example there
is a huge conversation and lots of work that has happened in Dask to
specifically support Pytorch -
[https://github.com/dask/distributed/issues/2581](https://github.com/dask/distributed/issues/2581)

Dask has support for AWS ECS, Kubernetes, GKE, EMR, etc built in -
[https://docs.dask.org/en/latest/setup/cloud.html](https://docs.dask.org/en/latest/setup/cloud.html)

