Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Replicate (YC W20) – Version control for machine learning (replicate.ai)
204 points by bfirsh 4 days ago | hide | past | favorite | 53 comments





Hello HN!

We're Ben & Andreas, and we made Replicate. It's a lightweight open-source tool for tracking and analyzing your machine learning experiments: https://replicate.ai/

Andreas used to do machine learning at Spotify. He built a lot of ML infrastructure there (versioning, training, deployment, etc). I used to be product manager for Docker's open source projects, and created Docker Compose.

We built https://www.arxiv-vanity.com/ together for fun, which led to us teaming up to build more tools for ML.

We spent a year talking to lots of people in the ML community and building all sorts of prototypes, but we kept on coming back to a foundational problem: not many people in machine learning use version control.

This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced.

So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t store trained machine learning models, it can’t handle key/value metadata, and it’s not designed to record information automatically from a training script. There are some solutions for these things, but they feel like band-aids.

We came to the conclusion that we need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on Git.

We believe the tool should be small, easy to use, and extensible. We found people struggling to migrate to “AI Platforms”. A tool should do one thing well and combine with other tools to produce the system you need.

Finally, we also believe it should be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.

Replicate is a first cut at something we think is useful: It is a Python library that uploads your files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get back to any point in time using the command-line interface, analyze your results inside a notebook using the Python API, and load your models in production systems.

We’d love to hear your feedback, and hear your stories about how you’ve done this before.

Also – building a version control system is rather complex, and to make this a reality we need your help. Join us in Discord if you want to be involved in the early design and help build it: https://discord.gg/QmzJApGjyE


"...and hear your stories about how you’ve done this before."

Basically Mr Kurtz saying "The horror! The horror!" gives you the right impression.

For small datasets and short training times it isn't so bad, a deterministic training script on git and a dataset with gitlfs. So can easily reproduce the model when needed. Having to do this on very large datasets (that are possibly expensive to keep a copy of for just experiments and not prod) or really slow training times is basically disgusting. I'll be interested to see what you have done.


> it can’t handle key/value metadata

What do you mean by that? Is a JSON no good? I guess you mean the diffs will be unordered?


Yep, and we can do lots of other nice things. We can produce nice tables with the key/value data, filter it ("show me all experiments with an accuracy greater than 0.9"), produce well-formatted diffs across an arbitrary number of things, give you a nice Python API for analyzing the data in a notebook, and so on.

There are some examples of these things on the home page, all of which would be very fiddly to do with JSON files in Git: https://replicate.ai/#features


Wait? So you can upload to Amazon or Google but nowhere else? Like to your own servers, for example?

You can save data to a path on the filesystem, so one way to do this is with a network mount. Lots of academic departments have their own GPU clusters, and they tend to have a shared network filesystem.

We want to have more ways to do this though. We were close to adding SFTP support, but didn't get round to it. Another method could be to implement our own server, but we're trying to keep it simple for now. I'd be curious to hear your feedback here: https://github.com/replicate/replicate/issues/366


If you support S3 compatible storage providers that could mean people can use things like minio, that could be a shorter path to supporting lots of backends.

Just commenting to say how awesome arxiv-vanity is. I didn’t know it existed, but I’ve wanted it to exist for years! Thanks!

It isn’t able to render all the papers i tried but still very useful.

Add a dark theme to it and I’ll be forever happy!


Congrats on the launch! This looks interesting, however I feel like this space is quite crowded. You mentioned that your most important feature is the fact that you are open-source, but off the top of my head I can think of several projects:

* Kubeflow: https://github.com/kubeflow/kubeflow

* MLFlow: https://github.com/mlflow/mlflow

* Pachyderm: https://github.com/pachyderm/pachyderm

* DVC: https://github.com/iterative/dvc

* Polyaxon: https://github.com/polyaxon/polyaxon

* Sacred: https://github.com/IDSIA/sacred

* pytorch-lightning + grid: https://github.com/PyTorchLightning/pytorch-lightning

* DeterminedAI: https://github.com/determined-ai/determined

* Metaflow: https://github.com/Netflix/metaflow

* Aim: https://github.com/aimhubio/aim

* And so many more...

In addition to this list, several other hosted platform offer experiments tracking and model management. How do you compare to all of these tools, and why do you think users should move from one of them to use replicate, thank you.


Yeah, I agree this space is crowded. But we’ve found so few ML researchers/engineers are actually using these tools. This could either be that people aren’t aware of them yet, or that they’re not good enough.

I think it’s a mix of both, honestly, but we’re betting that there’s more of the latter in the mix. :)

I could do comparisons of each of these tools, and some of them are solving quite different problems, but the overarching difference is we’re just trying to do less. These systems might make sense if you’re setting up a company’s ML pipeline, but we found lots of individuals struggling to keep track of their work and store their models. They balked at the idea of setting up things like Kubeflow or MLflow.


Can you share any of the data or market research on this? I am an ML manager in a large ecommerce firm and we stand up our own feature store, experiment tracking system and model training diagnostic / metric system (a la Tensorboard). It is exceedingly easy to DIY, I’ve been doing that stuff DIY with teams of less than 8 engineers for many years.

What I’ve seen is that most vendor solutions aren’t flexible enough. Many firms have their own on-prem and data privacy restrictions that make hosting model training artifacts on a vendor’s servers impossible, and it’s way easier to sell building in house.

I’m very surprised you are hearing that nobody is using MLflow and Kubeflow (though Kubeflow has a lot of genuine usability and bug problems). I am hearing the exact opposite. Everybody, from tiny 5 person startups to giant ecommerce firms is just spinning up simple stuff for model tracking, mostly around MLflow.


Sorry yeah -- not saying nobody's using MLflow. We see teams and organizations using MLflow. What we don't see though is individual researchers/engineers/data scientists pick it up and use it.

From the people we have talked to who use MLflow, we hear it gets the job done, but individual contributors don't love it.

We really believe that widespread adoption comes from making something individuals love and use every day. That's the reason Docker was so successful, for example.

The lack of flexibility really resonates. That's the reason we're trying to be small and not too opinionated. We're something we can drop into your in-house system as a component, kinda like lots of deployment systems are built around Docker.


I am in a similar role and have the exact same opinion. Most vendors lack flexibility, can have significant costs even for small teams, and, at the end of the day, an in-house built solution isn't that difficult.

I think it's going to be a tough road to build a sustainable business in this space.


Congratulations with the launch.

We've used https://github.com/iterative/dvc for a long time and quite happy. What's the main difference between replicate.ai and dvc?


Thanks!

DVC is closely tied to Git. We've heard people find that quite heavyweight when you're running experiments.

We think we can build a much better experience if we detach ourselves from Git. With Replicate, you just run your training script as usual, and it automatically tracks everything from within Python. You don't have to run any additional commands to track things.

DVC is really good for storing data sets though, and we see potential for integration there: https://github.com/replicate/replicate/issues/359


Hey! I'm one of the founders at Comet.ml. We believe that Git should continue to be the approach for managing code (similar to dvc) but we adapted it to the ML workflow. Our approach is to compute a git patch on every run so later you can 'git apply' if you'd like (https://www.comet.ml/docs/user-interface/#the-reproduce-butt...).

Hey, one of the DVC maintainers here!

TL;DR: I think it should be compared with the upcoming DVC feature - https://github.com/iterative/dvc/wiki/Experiments . Stay tuned - it'll be released very soon but you can try it now in beta.

First of all, congrats on the launch! I do really like the aesthetics of the website, and the overall approach. It resonates with our vision and philosophy!

Good feedback on experiments feeling heavyweight! We've been focused on doing great foundation to manage data and pipelines in the previous DVC versions and were aware about this problem (https://github.com/iterative/dvc/issues/2799). As I mentioned - Experiments feature is already there in beta testing. It means that users don't have to do commits anymore until they are ready, still can share experiments (it's a long topic and we'll write a blog post at some point since I really excited about the way it'll be implemented using custom Git refs), support for DL workflow (auto-checkpoints), and more. Would love to discuss and share any details, it would be great to compare the approaches.


Would love to chat -- I'll shoot you an email. :)

I'd be curious about comparison with https://github.com/mlflow/mlflow

We talked to a bunch of MLflow users, and the general impression we got is that it is heavyweight and hard to set up. MLflow is an all-encompassing "ML platform". Which is fine if you need that, but we're trying to just do one thing well. (Imagine if Git called itself a "software platform".)

In terms of features, Replicate points directly at an S3 bucket (so you don't have to run a server and Postgres DB), it saves your training code (for reproducibility and to commit to Git after the fact), and it has a nice API for reading and analyzing your experiments in a notebook.


Congrats on the launch!

>MLflow is an all-encompassing "ML platform"

Not really. We're trying to use MLflow with our "ML platform"[0]. Namely, it can save a model that expects high dimensional inputs, which is most models I've seen that aren't trivial, and can "deploy" the model but with an expectation of two dimensional DataFrame inputs. Apparently, they're working on that.

There are also many ambiguities concerning Keras and Tensorflow stemming from "What is a Keras model? Is it a Tensorflow model now they're integrated? Why are Keras models logged with the tensorflow model logger when you use the autolog functionality?". These are shared ambiguities, as there are several ways to save and load models with Tensorflow, and we're looking into the Keras/Tensorflow integration closely. MLflow uses `cloudpickle` and unpickling expects not only the same 'protocol', but the same Python version. Had to dig deeper than necessary.

One other problem is when a model relies on ancillary functions, which you must be able to ship somehow. You end up tinkering with its guts, too.

Could you shed some light on how do you deal with these matters. Namely, high dimensional inputs for models, pre-processing/post-processing functions, serialization brittleness, and Keras/Tensorflow "duality".

We have to inherit that complexity to spare our users from having to mentally think of saving their experiments (we do that automatically to save models, metrics, params). The workflow is data --> collaborative notebooks with scheduling features and job --> (generate appbooks) --> automatically tracked models/params/metrics --> one click deployment --> 'REST' API or form to invoke model.

Aaaaaand again, congrats on the launch!

- [0]: https://iko.ai


Congrats on the launch! This looks exciting. My company has been using Comet.ml and they cover a few use cases that are missing here. Specifically things like real time visualizations and sharing experiments which is key when working in a team. Are you planning on adding those?

Thank you! We have an issue on the roadmap for adding a web GUI: https://github.com/replicate/replicate/issues/295

We haven't thought about it in great detail yet, so I'd be curious to hear your thoughts and ideas if you'd like to add a comment to that issue!


Congrats on the launch! Very exciting tool. I'm one of the creators of https://DAGsHub.com which lets you host and collaborate on data science projects – think data science pull requests and data merging. We're integrated with DVC for data versioning and experiments, but what you're building is definitely super interesting.

As someone who tried to use git to do this for large sets of data, I'm very glad this exists. Will be trying this out in the future.

You may also be interested in a simple tool I'm building that works in concert with source control to store, version, and reproduce large data: https://github.com/kevin-hanselman/dud

My project is in its infancy (open-sourced less than a month ago), but I'm pleased with its UX thus far. There's lots to add in terms of documentation, but Dud currently uses Rclone[1] for remote syncing.

[1]: https://rclone.org/


I'd checkout dolthub.com! Dolt is built for git workflows (branches and merges) on top of large datasets.

Thanks for the shout out here. CEO of DoltHub speaking :-)

I was looking for something similar today. I just adopted it. :)

Thank you for your amazing work!

Do you have the intention to integrate it with PT Lightning or as a PT Lightning Logger?

It would be nice to have it maintained there.

It’s used all over huggingface-Transformers examples.


Fantastic, thank you for those kind words!

And great idea to integrate with PT Lightning. I just opened an issue: https://github.com/replicate/replicate/issues/367, feel free to add more detail and comments! -andreas


Congrats on the launch. Have you looked at https://comet.ml If so, how do you compare to them?

Thanks! It's open source and you're in control of your own data. See https://news.ycombinator.com/item?id=25151741

This looks similar in philosophy and approach to https://guild.ai although guild ai can parse stdout to extract metrics to track, and is integrated with tensorboard.

Congrats on the launch. To me this feels like DVC but with a slightly more convenient python API, and without the pipeline which might be really great. How are you organizing experiments though? Without Git it seems too easy to have experiments become unusable. And filtering according to accuracy is not really a solution..

How does this compare to tools like neptune.ai, weights and biases and so on? I can see the advantage of having control of one's data, whereas these tools use their own servers.

However what I love about them is the amazing UI that allows me to compare experiments.


This came out of a practical problem: at Spotify, Andreas couldn't let any data leave their network. He wasn't going to go through procurement to buy an enterprise version of one of those products, so his only option left was open source software.

But it's also out of principle: we think such a foundational thing needs to be open source. There is a reason most people use Git and not Perforce.

Replicate can work alongside visualization tools -- your data is safe in your own S3 bucket, but you can use the hosted visualization tool to complement that.

You could also imagine visualization tools built on top of Replicate. One thing we've been thinking about is doing visualization inside notebooks. It's like a programmable Tensorboard: https://colab.research.google.com/drive/18sVRE4Zi484G2rBeOYj...

I'd be curious to hear your thoughts about that. It's pretty primitive so far, but we've got to start somewhere I suppose. :)


The idea looks very good! I have to say however that I am a bit wary of notebooks. I agree how great they are to visualize things, but they also tend to have a negative effect on the coding standards of a team. This is again probably a personal bias and they look to be suited for this use case. Great launch anyway!

Congrats on the launch.

I have built an open source tool (called hyperML) for similar problem sometime back. I think the problem is not just storing in version control but being able to quickly retrieve those in live /test systems or containers.

Mounting and loading datasets and models is painful. It kind of what makes local training a better option.

If only the weights were version controlled by libraries (tf, pytorch or scikit) this whole problem will be much easier to solve.


Great to see this! I've been looking at alternatives for experiment tracking, but most of what I've come across have looked way too heavy (and expensive) for a small team.

Is R support planned? Or might this just work already with reticulate?


Congrats on the launch! What's the business concept behind the tool? Can't find anything on the homepage, or in the post. I didn't know YC funded opensource tools like this, it's kind of refreshing.

Yeah, YC is funding lots of open source projects. PostHog[0] was in our batch. GitLab, Docker, Mattermost, and CoreOS come to mind as other open source YC companies.

There are a number of businesses we could build around the project. A cloud service or enterprise products/support are the obvious ones. Right now, we're focused on community building, because a potential open source business can't be successful with a healthy open source project.

[0] https://news.ycombinator.com/item?id=22376732


> There are a number of businesses we could build around the project.

So you got funding from YCombinator without a concrete plan to make a business? That's pretty interesting, I always thought they wanted profitable businesses, and turned down ideas they didn't think would work.

Great to hear they're betting on open-source more!


Funnily we applied with a different thing. We tried a number of different ideas before we settled on this lower-level thing, as I describe in the main comment.

Even still, I think most people apply to YC without a concrete plan of how to make a business. It's normally so early stage, that the plan is yet to be validated and will probably change. A "plausible" plan is perhaps a better way to put it. ;)


nit: "Throw away your spreadsheet" scares me a little. I love spreadsheets, and think there are 100x+ more users of spreadsheets than notebooks (though the overlap of notebook users and ML users is probably close to 1, so I see your point). I would always save my experiment results so they were ready to analyze in spreadsheets (and other vis tools).

Hi, Andreas here. Yes spreadsheets are great, and better than notebooks in many cases. But I always felt like I was doing something wrong when I used spreadsheets and markdown files to manually record metrics and hyperparameters for my experiments. It's error prone and easy to forget to update the spreadsheet with new experiments.

So we're trying to automate recording this metadata, but then give you that metadata in various ways for you to inspect it. One of those ways is actually spreadsheets: https://github.com/replicate/replicate/issues/289


Do you not think that using spreadsheets for ML experiment tracking is a symptom of broken tooling? I'm asking because one of the reasons we're building our platform[0] with automatic experiment tracking, collaborative notebooks, and a bunch of things, is because experiment tracking was inconsistent between team members.

Differences in tools used: (spreadsheets, flat files, logs, pen and paper, human memory). Forgetting to do it. Snippets to do it flying around. Different locations (laptop, group workstation, git repository, cloud sheet). Dissociated from the notebook that produced the model.

Tighter tracking should answer questions like: what notebook ran on which data and produced which model with which parameters and which scores? Then questions like: give me all notebooks that ran on this dataset which produced a model with scores that are [condition].

Once you do that, the "spreadsheet" can just be a "view" of the underlying data. Something you can export as, but not the thing itself.

I think it's good there are tools with this granularity that can be composed.

- [0]: https://iko.ai


> Once you do that, the "spreadsheet" can just be a "view" of the underlying data. Something you can export as, but not the thing itself.

Yes, and sorry because rereading my comment I wasn't clear, but this is what I meant. I view Spreadsheets as the view (but also an editor of the view), but I don't mean people should be working with XLS files (TSVs seem to work the best— I personally hate tabs but may have lost that battle). I do think though of Spreadsheets as the primary view, and so always design my data structures with the understanding of "how will this interoperate with spreadsheets". JSON is the pits.

IMO 2-D DSLs with Spreadsheets as the primary view/editor paradigm are the future.

I think Spreadsheets are 1, if not 2 OOM better than notebooks for doing actual work (notebooks can be good for presenting results in a narrative):

    - non-linear for both humans and machines (allows for really creative and fast out-of-order parsing techniques on the machine side)
    - concise signal with high information density
    - unlimited cursors
    - fantastically easier for version control and multi-player experiences
    
I maintain a list of all data science tools to try and stay on top of best practices, and when I see a tool is notebook first, I think "good, less work for me to track this one because they aren't getting the core things right yet." (Though often times notebooks will have really innovative orthogonal features).

I think we may be talking about different things. My first reply was based on my understanding you saved results in spreadsheets.

breck wrote:

>I would always save my experiment results so they were ready to analyze in spreadsheets (and other vis tools).

I should have elicited further before making assumptions. My assumption was that you were referring to results of training machine learning models. Is this correct?

If my assumption is correct, how do you go about training machine learning models?

Also, one question concerning your "scare" of "Throw away your spreadsheets"... Do you mean that you'd like a tool that exports results to spreadsheets or something in that direction?

I think we are addressing different problems, and a large part of that is due to assumptions I have made earlier.


> My assumption was that you were referring to results of training machine learning models. Is this correct?

I would set my hyper params in a spreadsheet, which would kick off training runs on a cluster, and report the results back in the spreadsheet hours/days later(really TSVs, but my UI was a spreadsheet). And repeat.

This was primitive stuff, and haven't done DL in a while, so not even sure if hyperparam tuning is still a thing or if that is all automated now.


How did you do that? Were you using some kind of hooks or polling for the spreadsheet state and triggering training jobs whenever it had changed?

Also, how did you deal with changing which hyperparameters you used or which algorithms you used? Did you make a spreadsheet per project per model?


I forget the exact details, but my guess is I made my own DSLs and then "executed" them.

Something like:

https://jtree.treenotation.org/designer/#grammar%0A%20inferr......


What is the best way to do dataset versioning when using Replicate? I get that Replicate saves the dataset version but not the data itself. Is DVC a good fit?

Second question - can you tag or attach a comment to a group of experiments?


Great questions! At the moment we recommend passing dataset URIs as params to replicate.init(): https://replicate.ai/docs/guides/training-data, but of course this assumes immutable and stable URIs.

DVC would definitely be a good fit, and we have a ticket on our roadmap to integrate Replicate with DVC, Tecton, etc. https://github.com/replicate/replicate/issues/294

We also have a roadmap ticket for grouping experiments: https://github.com/replicate/replicate/issues/297, but for now we're recommending params for tags as well.

If you have ideas for the design of these features, we really appreciate feedback and comments on these Github issues!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: