
Run tracking liberates ML teams - lewq
https://dotscience.com/blog/run-tracking/
======
king_magic
I 100% agree with the premise of this, I certainly attest to run tracking
being a critical component of my own personal model
development/training/tuning pipeline. Not sure how I feel about using yet
another 3rd party data science platform though - it'd be nice if the run
tracking piece was just a component I could import into my notebook, and have
it automagically track things for me (like... how seamless tqdm is for showing
progress bars while iterating over loops).

Regardless of my uncertainty around trying another 3rd party platform, the
premise is spot on.

~~~
lewq
Thank you! I am very interested in finding open source and/or non-platform
ways to deliver the solution to the community. I know how off-putting 3rd
party platforms can be. That's why I put the word "tooling" on the product
page, not "platform" :)

Please reach out to me on luke@dotscience.com if you'd be up for finding a way
to work on that together.

~~~
lukas
For sure! I'm emailing you now.

------
asdfman123
This is an ad, but they're not wrong.

At my work, we use Azure ML Studio. I think the solution for deployment is to
run some scripts to save the model information to git and automatically deploy
from there. It will take a little bit of effort to set up, but I think it
should work.

~~~
lewq
Yeah, sorry for the ad but us startup founders have to launch our products
somehow :)

That makes sense about saving the model info to git, but how do you track the
provenance of the data? Do you see that as being important? (e.g. to track
back from a model to which data it was trained on, when later updating or
fixing issues on the models?)

Feel free to get in touch directly - luke@dotscience.com if you'd be willing
to try our stuff.

~~~
BubRoss
Maybe you should pay for advertising.

~~~
ac2u
Jesus, give them a break. Someone who's practically a mega-corp like Stripe
can launch a new product with a flashy landing page on HN and everyone
discusses it without issue.

There's a bit of a double standard.

~~~
BubRoss
Not from me, I hate that stuff too. Unless something is really game changing
or a step forward it isn't news.

------
lukas
I totally agree with this and I built wandb (wandb.com) to solve this problem.
We try to do this in as lightweight a way as possible - for example we can do
keras tracking with a single line ([https://www.wandb.com/articles/visualize-
keras-models-with-o...](https://www.wandb.com/articles/visualize-keras-models-
with-one-line-of-code)) and pytorch with just a couple lines
([https://www.wandb.com/articles/monitor-your-pytorch-
models-w...](https://www.wandb.com/articles/monitor-your-pytorch-models-with-
five-extra-lines-of-code)). Would love any feedback on it.

~~~
lewq
Hey Lukas! Love your work on wandb and very keen to find ways to
integrate/collaborate :)

------
woeirua
I understand that for some people this might help, but frankly I find all of
these "reproducibility" frameworks fall flat as soon as truly big data enters
the picture. Data versioning is not sufficient, because I typically cannot
roll back my datasets to a previous version (and we moved forward for a
reason).

Also, we are deliberately not using Databricks for this to avoid vendor lockin
for something that will almost certainly be open-source soon.

~~~
lewq
I agree that there should be open source solutions and the core of Dotscience
is a project called Dotmesh ([https://github.com/dotmesh-
io/dotmesh](https://github.com/dotmesh-io/dotmesh)) which is an open source
wrapper around ZFS. You can see more info about how this works and integrates
with Dotscience at
[https://dotscience.com/technology/](https://dotscience.com/technology/).

However I don't quite understand your point that "data versioning is not
sufficient" because "I cannot roll back my datasets to a previous version".
Surely data versioning would _solve_ the ability to roll your datasets back to
a previous version? Or, are you not convinced of the need for reproducibility
in data science? My rationale for this is the following: if you are building
ML models that are going to make important decisions in production, it's
imperative for debuggability that you're able to re-run that model training
run later. If a model makes a bad decision in production, you need to know
what dataset it was trained on, which means needing to be able to retrieve
that dataset. That's because you can't isolate and fix the problem without
being able to re-run it.

Yes, data changes and marches forwards, and that's why you should retrain
models. But you also need to be able to go backwards to do robust ML. My 2
cents.

I also wrote a DevOps for ML Manifesto:
[https://dotscience.com/manifesto/](https://dotscience.com/manifesto/)

To summarize:

1\. All models must be reproducible by someone else 6 months later.

2\. All models must be accountable, that means you must be able to justify the
basis on which they made their decision, in particular you need to know which
data version was used and where it came from (provenance).

3\. Model development must be collaborative. That means I need to be able fork
a copy of your project and maintain all the metadata about which runs you did,
with their respective provenance history.

4\. Models must have a continuous lifecycle. You're not done when you ship -
because models are about finding patterns in data, and the world is constantly
changing, you need statistical monitoring and retraining to compensate for
model drift.

Do you disagree? Did I miss anything? Let me know your thoughts!

~~~
woeirua
No, data versioning typically does not solve this problem unless you are
completely isolated from everything else. Consider the case where you're
pulling data from a database to use for training a model. Typically DS teams
have no control over the backup schedule for that database, how long those
backups are maintained, etc. For large systems, if you try to restore a backup
3 or 6 months down the road, you may only have a weekly snapshot (if even that
frequent) to use. Certainly that won't give you the same data that you
originally had. Admittedly, a team that was strongly focused on
reproducibility could _try really hard_ to ensure that their queries are
reproducible, but if you want to be productive you typically delegate most of
that work to other libraries which may or may not generate reproducible
queries.

It seems to me that people working on really large-scale problems always have
these external dependencies that they just can't control and subsequently data
versioning only goes so far when you cannot make a full-copy of the data for
each version.

I agree with you in general that we should strive for reproducibility.

~~~
lewq
Hmm interesting, what if you had database snapshots that could be driven by ML
workflows? The primitive behind Dotscience can support lightweight database
snapshots quite easily...

~~~
lewq
To be clear, that primitive is dotmesh and ZFS - see
[https://dotscience.com/technology/](https://dotscience.com/technology/)

------
gyre007
So run tracking as described here is about tracking every "variable" which
comes into play when training your model?

~~~
lewq
Yes, exactly. And in particular, capturing those variables _at the point of
the run_ , not just some time before/after when you remember to record it
manually.

Lots more detail here:
[https://dotscience.com/product/](https://dotscience.com/product/) and a super
long deep dive demo wih lots of examples (I would have made it shorter if I'd
had more time ;))

------
andbberger
Seems like there's been an explosions of startups trying to win B2B dollars
for this.

There is an excellent open source project that nails this called sacred. It's
not perfect, but it works, and as far as I can tell it has won the popularity
contest.

Please join me in using and contributing back to sacred!

~~~
lewq
I will take a look at that and whether we can integrate with it, thanks for
the tip! What do you like about it?

~~~
andbberger
Mostly that it exists. It neatly solves a common gripe. It might not be the
best possible version of this tool, but for better or worse it is currently
the dominant open source effort in this space, and we all benefit from jumping
on the bandwagon.

Probably google will pull a tensorflow soon and sacred will go the way of
theano but until then...

------
visarga
Does it also do hyper-parameter search? It's usually something you usually
want to have.

~~~
lewq
It's an environment for tracking runs of data engineering and ML training code
in, so you can use tools like H2O and sklearn grid search within Dotscience.
But we could also build on the run mechanism to automate kicking off a lot of
runs in parallel over a search space of parameters.

What are you using at the moment?

