- GH: https://GitHub.com/Google/Kubeflow
- kubeflow-Discuss: https://groups.google.com/forum/m/#!forum/kubeflow-discuss
NOTE: The name "flow" does not refer directly to TensorFlow; if anything it's a nod at all the river themes that pop up in the ML community (eg FBLearner Flow)
Disclosure: I work at Google on Kubeflow
Any thoughts on this vs managed ml-engine? Cost-aside, seems like this nibbles on the smaller scale "but ml tooling is too hard" use cases?
Does that help?
Disclosure: I work at Googlr on Kubeflow
The use case I'm think of is an ml dev team building on kubeflow and proving a system. Then wanting to transfer it to a non-engineering team, yet wash their hands of any ongoing infrastructure ops responsibility.
Knowing that a "ml-engine aligned" kubeflow config would transfer cleanly (including associated bells and whistles) would make that a much more attractive option.
Caveat: I'll admit I'm not keeping up on what's in the managed offering, but I'm assuming there are a number of value-adds of the type that end users like (visualizations, etc).
- Containerize the DL platform
- Create a k8s manifest (similar to our CRD if necessary)
- Create a service endpoint
- integrate all that into the JH deployment
This is less hard than it sounds, but we'd love help! We only started with TF because that's what we know.
Scalability for people with existing on-premise (or cloud based), kubernetes workflows, especially once it comes to training or heavy crunching.
That's not to say that Docker Machine/Swarm/Compose couldn't handle the same, but it's an extra step for kubernetes users and pushes people onto a slightly different toolchain than minikube->K8s.
If you have a single container, and a simple pipeline, this may be a bit more than you need. We've just found that there are normally 5 or more services/systems that people wire together to create an ML stack, and that's what we're trying to solve for/simplify.
The following are included:
- A JupyterHub to create & manage interactive Jupyter notebooks.
- A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting.
- A TF Serving container.
Disclosure: I work at Google
However, to benefit from GPUs you need to configure the controller correctly. Default configuration are there for GCP and Azure, but you would need to do that manually for other cloud (not that it is very hard)