We're Ben & Andreas, and we made Replicate. It's a lightweight open-source tool for tracking and analyzing your machine learning experiments: https://replicate.ai/
Andreas used to do machine learning at Spotify. He built a lot of ML infrastructure there (versioning, training, deployment, etc). I used to be product manager for Docker's open source projects, and created Docker Compose.
We built https://www.arxiv-vanity.com/ together for fun, which led to us teaming up to build more tools for ML.
We spent a year talking to lots of people in the ML community and building all sorts of prototypes, but we kept on coming back to a foundational problem: not many people in machine learning use version control.
This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced.
So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t store trained machine learning models, it can’t handle key/value metadata, and it’s not designed to record information automatically from a training script. There are some solutions for these things, but they feel like band-aids.
We came to the conclusion that we need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on Git.
We believe the tool should be small, easy to use, and extensible. We found people struggling to migrate to “AI Platforms”. A tool should do one thing well and combine with other tools to produce the system you need.
Finally, we also believe it should be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.
Replicate is a first cut at something we think is useful: It is a Python library that uploads your files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get back to any point in time using the command-line interface, analyze your results inside a notebook using the Python API, and load your models in production systems.
We’d love to hear your feedback, and hear your stories about how you’ve done this before.
Also – building a version control system is rather complex, and to make this a reality we need your help. Join us in Discord if you want to be involved in the early design and help build it: https://discord.gg/QmzJApGjyE
Basically Mr Kurtz saying "The horror! The horror!" gives you the right impression.
For small datasets and short training times it isn't so bad, a deterministic training script on git and a dataset with gitlfs. So can easily reproduce the model when needed.
Having to do this on very large datasets (that are possibly expensive to keep a copy of for just experiments and not prod) or really slow training times is basically disgusting. I'll be interested to see what you have done.
What do you mean by that? Is a JSON no good? I guess you mean the diffs will be unordered?
There are some examples of these things on the home page, all of which would be very fiddly to do with JSON files in Git: https://replicate.ai/#features
We want to have more ways to do this though. We were close to adding SFTP support, but didn't get round to it. Another method could be to implement our own server, but we're trying to keep it simple for now. I'd be curious to hear your feedback here: https://github.com/replicate/replicate/issues/366
It isn’t able to render all the papers i tried but still very useful.
Add a dark theme to it and I’ll be forever happy!
* Kubeflow: https://github.com/kubeflow/kubeflow
* MLFlow: https://github.com/mlflow/mlflow
* Pachyderm: https://github.com/pachyderm/pachyderm
* DVC: https://github.com/iterative/dvc
* Polyaxon: https://github.com/polyaxon/polyaxon
* Sacred: https://github.com/IDSIA/sacred
* pytorch-lightning + grid:
* DeterminedAI: https://github.com/determined-ai/determined
* Metaflow: https://github.com/Netflix/metaflow
* Aim: https://github.com/aimhubio/aim
* And so many more...
In addition to this list, several other hosted platform offer experiments tracking and model management. How do you compare to all of these tools, and why do you think users should move from one of them to use replicate, thank you.
I think it’s a mix of both, honestly, but we’re betting that there’s more of the latter in the mix. :)
I could do comparisons of each of these tools, and some of them are solving quite different problems, but the overarching difference is we’re just trying to do less. These systems might make sense if you’re setting up a company’s ML pipeline, but we found lots of individuals struggling to keep track of their work and store their models. They balked at the idea of setting up things like Kubeflow or MLflow.
What I’ve seen is that most vendor solutions aren’t flexible enough. Many firms have their own on-prem and data privacy restrictions that make hosting model training artifacts on a vendor’s servers impossible, and it’s way easier to sell building in house.
I’m very surprised you are hearing that nobody is using MLflow and Kubeflow (though Kubeflow has a lot of genuine usability and bug problems). I am hearing the exact opposite. Everybody, from tiny 5 person startups to giant ecommerce firms is just spinning up simple stuff for model tracking, mostly around MLflow.
From the people we have talked to who use MLflow, we hear it gets the job done, but individual contributors don't love it.
We really believe that widespread adoption comes from making something individuals love and use every day. That's the reason Docker was so successful, for example.
The lack of flexibility really resonates. That's the reason we're trying to be small and not too opinionated. We're something we can drop into your in-house system as a component, kinda like lots of deployment systems are built around Docker.
I think it's going to be a tough road to build a sustainable business in this space.
We've used https://github.com/iterative/dvc for a long time and quite happy. What's the main difference between replicate.ai and dvc?
DVC is closely tied to Git. We've heard people find that quite heavyweight when you're running experiments.
We think we can build a much better experience if we detach ourselves from Git. With Replicate, you just run your training script as usual, and it automatically tracks everything from within Python. You don't have to run any additional commands to track things.
DVC is really good for storing data sets though, and we see potential for integration there: https://github.com/replicate/replicate/issues/359
TL;DR: I think it should be compared with the upcoming DVC feature - https://github.com/iterative/dvc/wiki/Experiments . Stay tuned - it'll be released very soon but you can try it now in beta.
First of all, congrats on the launch! I do really like the aesthetics of the website, and the overall approach. It resonates with our vision and philosophy!
Good feedback on experiments feeling heavyweight! We've been focused on doing great foundation to manage data and pipelines in the previous DVC versions and were aware about this problem (https://github.com/iterative/dvc/issues/2799). As I mentioned - Experiments feature is already there in beta testing. It means that users don't have to do commits anymore until they are ready, still can share experiments (it's a long topic and we'll write a blog post at some point since I really excited about the way it'll be implemented using custom Git refs), support for DL workflow (auto-checkpoints), and more. Would love to discuss and share any details, it would be great to compare the approaches.
In terms of features, Replicate points directly at an S3 bucket (so you don't have to run a server and Postgres DB), it saves your training code (for reproducibility and to commit to Git after the fact), and it has a nice API for reading and analyzing your experiments in a notebook.
>MLflow is an all-encompassing "ML platform"
Not really. We're trying to use MLflow with our "ML platform". Namely, it can save a model that expects high dimensional inputs, which is most models I've seen that aren't trivial, and can "deploy" the model but with an expectation of two dimensional DataFrame inputs. Apparently, they're working on that.
There are also many ambiguities concerning Keras and Tensorflow stemming from "What is a Keras model? Is it a Tensorflow model now they're integrated? Why are Keras models logged with the tensorflow model logger when you use the autolog functionality?". These are shared ambiguities, as there are several ways to save and load models with Tensorflow, and we're looking into the Keras/Tensorflow integration closely. MLflow uses `cloudpickle` and unpickling expects not only the same 'protocol', but the same Python version. Had to dig deeper than necessary.
One other problem is when a model relies on ancillary functions, which you must be able to ship somehow. You end up tinkering with its guts, too.
Could you shed some light on how do you deal with these matters. Namely, high dimensional inputs for models, pre-processing/post-processing functions, serialization brittleness, and Keras/Tensorflow "duality".
We have to inherit that complexity to spare our users from having to mentally think of saving their experiments (we do that automatically to save models, metrics, params). The workflow is data --> collaborative notebooks with scheduling features and job --> (generate appbooks) --> automatically tracked models/params/metrics --> one click deployment --> 'REST' API or form to invoke model.
Aaaaaand again, congrats on the launch!
- : https://iko.ai
We haven't thought about it in great detail yet, so I'd be curious to hear your thoughts and ideas if you'd like to add a comment to that issue!
My project is in its infancy (open-sourced less than a month ago), but I'm pleased with its UX thus far. There's lots to add in terms of documentation, but Dud currently uses Rclone for remote syncing.
Thank you for your amazing work!
Do you have the intention to integrate it with PT Lightning or as a PT Lightning Logger?
It would be nice to have it maintained there.
It’s used all over huggingface-Transformers examples.
And great idea to integrate with PT Lightning. I just opened an issue: https://github.com/replicate/replicate/issues/367, feel free to add more detail and comments! -andreas
However what I love about them is the amazing UI that allows me to compare experiments.
But it's also out of principle: we think such a foundational thing needs to be open source. There is a reason most people use Git and not Perforce.
Replicate can work alongside visualization tools -- your data is safe in your own S3 bucket, but you can use the hosted visualization tool to complement that.
You could also imagine visualization tools built on top of Replicate. One thing we've been thinking about is doing visualization inside notebooks. It's like a programmable Tensorboard: https://colab.research.google.com/drive/18sVRE4Zi484G2rBeOYj...
I'd be curious to hear your thoughts about that. It's pretty primitive so far, but we've got to start somewhere I suppose. :)
I have built an open source tool (called hyperML) for similar problem sometime back.
I think the problem is not just storing in version control but being able to quickly retrieve those in live /test systems or containers.
Mounting and loading datasets and models is painful. It kind of what makes local training a better option.
If only the weights were version controlled by libraries (tf, pytorch or scikit) this whole problem will be much easier to solve.
Is R support planned? Or might this just work already with reticulate?
There are a number of businesses we could build around the project. A cloud service or enterprise products/support are the obvious ones. Right now, we're focused on community building, because a potential open source business can't be successful with a healthy open source project.
So you got funding from YCombinator without a concrete plan to make a business? That's pretty interesting, I always thought they wanted profitable businesses, and turned down ideas they didn't think would work.
Great to hear they're betting on open-source more!
Even still, I think most people apply to YC without a concrete plan of how to make a business. It's normally so early stage, that the plan is yet to be validated and will probably change. A "plausible" plan is perhaps a better way to put it. ;)
So we're trying to automate recording this metadata, but then give you that metadata in various ways for you to inspect it. One of those ways is actually spreadsheets: https://github.com/replicate/replicate/issues/289
Differences in tools used: (spreadsheets, flat files, logs, pen and paper, human memory). Forgetting to do it. Snippets to do it flying around. Different locations (laptop, group workstation, git repository, cloud sheet). Dissociated from the notebook that produced the model.
Tighter tracking should answer questions like: what notebook ran on which data and produced which model with which parameters and which scores? Then questions like: give me all notebooks that ran on this dataset which produced a model with scores that are [condition].
Once you do that, the "spreadsheet" can just be a "view" of the underlying data. Something you can export as, but not the thing itself.
I think it's good there are tools with this granularity that can be composed.
Yes, and sorry because rereading my comment I wasn't clear, but this is what I meant. I view Spreadsheets as the view (but also an editor of the view), but I don't mean people should be working with XLS files (TSVs seem to work the best— I personally hate tabs but may have lost that battle). I do think though of Spreadsheets as the primary view, and so always design my data structures with the understanding of "how will this interoperate with spreadsheets". JSON is the pits.
IMO 2-D DSLs with Spreadsheets as the primary view/editor paradigm are the future.
I think Spreadsheets are 1, if not 2 OOM better than notebooks for doing actual work (notebooks can be good for presenting results in a narrative):
- non-linear for both humans and machines (allows for really creative and fast out-of-order parsing techniques on the machine side)
- concise signal with high information density
- unlimited cursors
- fantastically easier for version control and multi-player experiences
>I would always save my experiment results so they were ready to analyze in spreadsheets (and other vis tools).
I should have elicited further before making assumptions. My assumption was that you were referring to results of training machine learning models. Is this correct?
If my assumption is correct, how do you go about training machine learning models?
Also, one question concerning your "scare" of "Throw away your spreadsheets"... Do you mean that you'd like a tool that exports results to spreadsheets or something in that direction?
I think we are addressing different problems, and a large part of that is due to assumptions I have made earlier.
I would set my hyper params in a spreadsheet, which would kick off training runs on a cluster, and report the results back in the spreadsheet hours/days later(really TSVs, but my UI was a spreadsheet). And repeat.
This was primitive stuff, and haven't done DL in a while, so not even sure if hyperparam tuning is still a thing or if that is all automated now.
Also, how did you deal with changing which hyperparameters you used or which algorithms you used? Did you make a spreadsheet per project per model?
Second question - can you tag or attach a comment to a group of experiments?
DVC would definitely be a good fit, and we have a ticket on our roadmap to integrate Replicate with DVC, Tecton, etc. https://github.com/replicate/replicate/issues/294
We also have a roadmap ticket for grouping experiments: https://github.com/replicate/replicate/issues/297, but for now we're recommending params for tags as well.
If you have ideas for the design of these features, we really appreciate feedback and comments on these Github issues!