
Polyaxon – An open source platform for reproducible machine learning at scale - jonbaer
https://polyaxon.com/
======
aflam
It's great to see this sort of tool open-sourced. I am excited by new tools
enabling better algo/ML engineering workflows.

In addition to the infra management, it's quickly tricky to support
qa/viz/tuning/debugging tools for very different sorts of
algorithms/outputs/configurations/metrics. Do you see your project going in
those directions?

~~~
mmq
I can't say much about long-term roadmap apart from the fact that the platform
will be open source and that it will try to introduce features that will
increase the productivity of data scientists.

For short-term roadmap, I am trying to work on stability, it's very hard to
have default values since you don't know how it will be used, e.g. on Minikube
or by a team scheduling a lot of parallel experiments, so what I am trying to
do is at least having an automatic or simple way to scale workers responsible
for scheduling, hyper params tuning, and monitoring.

For tuning, the platform will keep supporting some algorithms to automate the
hyper params search, maybe introducing more priors for the Bayesian
optimization, I also think more tests are needed to validate the behavior of
the Bayesian optimization and Hyperband.

For visualization, currently, you can start a Tensorboard for any project
created on the platform, but there are some problems with this assumption, if
the project has a lot of experiments, Tensorboard becomes slow to
irresponsive. Next release will introduce the possibility to create
Tensorboard jobs per experiment or per hyper tuning experiment group, and
possibly any collection of experiments to compare them.

The platform collectes already metrics from experiments, so a basic
visualization is also planned to have a quick overview before diving into a
Tensorboard.

And most importantly, I think there are some usability issues that need to be
solved to make the experience better.

There are also a couple of ideas around team collaboration that will be
introduced in the mid term.

~~~
aflam
Thanks a lot for the answer! It's very promising.

------
mmq
Hi, I am the author of Polyaxon, a bit late to notice, but thanks for sharing
Polyaxon here. I will be around to answer questions, and any feedback is
welcome.

~~~
mallochio
Thanks a bunch for taking the effort to create this, and for making it open
source.! The project looks amazing.

Could you maybe also explain what the target audience is? Are there any
benefits for using Polyaxon in (solo) research projects on a cluster, or is it
tailored towards production-ready environments at corporations?

~~~
mmq
I think the target audience, is individuals or small teams who want to have an
organized workflow, immutable and reproducible experiments with an organized
and easy way to access logs and outputs.

The platform also provides a lot of automation to schedule concurrent
experiments.

There are a couple of things that need to be polished to be used, notes on
experiments, notification of finished experiments, especially if you are
running hundreds of experiments.

Depending on how organized you are, many times you will end up with
experiments that you did not know how you started, having a platform that
takes care of that could be beneficial.

If you have already a cluster, for running your experiments you will most
probably end up ssh-ing to the machine to check which experiments finished,
probably in a screen, and their results and logs, Polyaxon simplifies that
part as well.

~~~
jorgemf
This is all the reasons why I want to use your tool.

Thanks for creating it and I hope one day soon to contribute to your project.

------
mallochio
Can someone tell me how this is different from/improves over pachyderm?

~~~
syllogism
Pachyderm is a system for the nouns; this is a system for the verbs.

Polyaxon makes it easy to schedule training on a Kubernetes cluster. The
problem this solves is that machine learning engineers generally spend too
long running their jobs in series, rather than parallel. Instead of running
one thing and waiting for it to finish, it's both more efficient and better
methodology to plan out the experiments and then run them all at once.

Pachyderm is more concerned with versioning and asset management. It's more
like Git+Airflow.

Let's say your experiment depends on training word vectors from Common Crawl
dumps. You need to download the dump, extract the text you want, and train
your word vectors models. Pachyderm is all about the problem of caching the
intermediate results of that ETL pipeline, and making sure that you don't lose
track of like, which month of data was used to compute these vectors. Polyaxon
is all about the problem of, there are so many ways to train the word vectors
and use them downstream. You want to explore that space systematically, by
scheduling and automatically evaluating a lot of the work in parallel.

------
stared
For reproducible ML I recommend Neptune - Machine Learning Lab
[https://neptune.ml/](https://neptune.ml/) (disclaimer: I work with people,
who created it).

Not only it allows to run/enqueue things in the cloud, but also does very well
tracking of source code (with code snapshots and git integration), parameters
and output statistics (e.g. so you can select all models with #lstm tag, and
sort by log-loss on the validation dataset).

------
sytelus
Great to see infrastructure like this come along. I'm wondering what everyone
else is using...

------
syllogism
I've been starting to use this in spaCy, so I'm glad to see it posted here!
It's still a young project, but Mourad has been very dedicated to it, and I
think it's already at the point where it's useful. I hope more people can
contribute. Here's a quick review.

Most people doing machine learning at the moment are using a pretty bad
workflow. It's difficult to avoid the trap people refer to as "grad student
descent": endless tinkering, where you run two or three jobs, monitor the
results, and then kick off another one. You don't really have a hypothesis in
this cycle, so you don't know when to stop. At the end of the process you've
generally gained intuition and insight, but nothing you can reliably pass on.

The solution to this trap is to commit to a matrix of results you're going to
collect, program up the experiments, and let them run. Once you have the
proper comparison, you can then decide what to do next.

Most university research groups get a grant to buy some hardware, create a
cluster, and then schedule jobs on the machines using SLURM or HTCondor. These
technologies are mature, but they leave individual researchers with a lot to
do. You can schedule your jobs, but you have to write the experiment
management yourself. My hunch is there are maybe 5-50 companies in the world
with internal systems significantly more sophisticated than this.

Polyaxon brings the shiny new "cloud native" workflow to the problem. It runs
on Kubernetes, which is much easier to use, especially with heterogenous
hardware. I think just switching to Kubernetes and containers would be helpful
to a lot of teams. On top of the cluster solution, Polyaxon brings a nice
experiment management layer, with hyper-parameter search. It also manages the
containerisation for you, so that the researcher doesn't need to interact with
Docker directly.

There are still a number of things that are under-developed. The most
noticeable are dataset management and artifact export. You currently have to
do this yourself, e.g. by adding persistent disks to the cluster. I use
GCSFuse to mount GCS (Google S3) buckets as directories, which works pretty
well in the meantime. There are also a few defaults that could use refinement.
If large clusters are being created, the management services are currently a
bit under-resourced. Finally, there are a few more minor rough spots. For
instance, the web app is currently a little unpolished. It's a Django app, so
everything takes two or three more clicks and refreshes than you'd ideally
want. A more AJAXy front-end would be nice.

There are several commercial competitors. There's a pretty obvious analogy
between the use-case here and the CRM. Companies hope to own the "system of
record", and be the shared space where ML teams collaborate. DominoDataLab.com
, Neptune.ml , Cloudera.com , Datascience.com and others all have slightly
different takes on this problem, to different degrees. Many of the above are
built around a Jupyter Notebooks-based experience, and are targeted more
towards workflows where the primary outputs are insights and reports rather
than product development.

I think an open-source framework is valuable for a few reasons. We should be
reluctant to buy in to commercial platforms for precisely the reasons vendors
are so interested in this space. Lock-in can hurt here, and if the computation
is scheduled via the vendor, they get the chance to tax you a % of your
compute spending. Given how expensive GPU experiments can be, that's a big
ongoing cost to sign up for.

I also think uploading all your training data to someone else's system is a
bad idea, that your data sharing agreements often won't permit.

Finally, it's nice to have a local Kubernetes cluster for other reasons.
Kubernetes is basically an OS. Polyaxon is an app that runs on that OS. This
is nice: you can develop other apps to work with it as well. In contrast, if
you rent a service, the easiest way to meet your next requirements will be
with further services. As soon as you hit custom requirements, costs and risks
rise rapidly. The in-house approach signs you on to a better future --- it's a
step in the right direction. The commercial services may or may not be easier
today, but at some point you'll want to switch out of them. It may still be
worth using them today --- but the long-term perspective is at least a point
in Polyaxon's favour.

~~~
mmq
Thank you for the great feedback, a lot of things are indeed planned to
enhance the dashboard as well as the cli in terms of search, filtering, and
ordering of the experiments based on some rule, e.g. parameters or metrics.

Also, some of your, and other users', feedback were also very helpful to bring
some changes to the infrastructure for the next release, to maximize the usage
of the cluster's resources.

------
albertzeyer
It's only reproducible if the computation would be deterministic. This is
often not the case for machine learning, esp TensorFlow on GPU. How do you
deal with that?

~~~
mmq
As you said if the library/framework provides deterministic computation, then
it should not be a problem, I am not sure if it's already fixed in Tensorflow,
but parallel computation and order of operations could all influence the
reproducibility of the run, so sometimes you need to provide that in your
code.

What Polyaxon provides, is a way to restart a run based on as many parameters
collected, by default it will use same code based on the internal git commit,
it will reuse the same configuration, same Dockerfile, and if provided, it
will use same resources, CPU, GPU, and memory of the original run, if the
experiment had a node selector, the restarted experiment will be as well
scheduled on the same node.

------
amelius
No support for (Py)Torch?

~~~
syllogism
It supports PyTorch.

It also supports anything --- you give it a script and it builds a Docker
container and runs it. You don't need to use any particular language or
framework.

