Is pachyderm actually used anywhere though? Especially compared to the hadoop distros?
Why would I use this over just spinning something up on EMR? If I were on prem, I likely already have hadoop installed. K8s is typically a separate cluster (mainly because it's still a bit error prone for anything with data applications, which is I'm assuming what you're claiming to solve?
Beyond that, re: kubeflow. The problem with just assembling some of these things and calling it "production" is you're still missing a lot of the basic things when people need to go to production:
1. Experiment tracking for collaboration
2. Managed Authentication/connectors for anything outside K8s
3. Connecting with/integrating with existing databases
4. Proper role management: Data Scientists aren't deploying things to prod (at least customer facing prod at scale where money is on the table, "prod" could also mean internal tools for experiments), they typically need to be integrating with external processes and different teams.
Many of these things are left as an exercise for the reader (especially in k8s land). Granted, tutorials and the hosted environments exist, but nothing is a "1 click" deploy that is being promised on any of these "See how easy it is!" blog posts that run something on 1 laptop.
A lot of things being promised here just don't line up for me yet. K8s is maturing quite a bit but when the story is still "Run managed k8s on the cloud or spend time upgrading your cluster every quarter" - I have to say it's far from close to anything a typical data scientist running sk learn on their mac are going to be able to get started with.
The closest I've seen to that that isn't our own product is AWS Sagemaker which actually solves real world problems by actually gluing components together in a fairly seamless way.
Let's just be clear here that we still have a long way to go yet. In practice, we have separate teams that need to collaborate. Data scientists aren't going to download minikube tomorrow and go to production without someone else's approval and going through a huge learning curve yet.
We're moving in the right direction though!
There's also Polyaxon: https://github.com/polyaxon/polyaxon
Fwiw, our free data science platform is also Dockerized, and also bundles TensorFlow: https://docs.skymind.ai/docs
The Docker image instructions are here: https://docs.skymind.ai/docs/docker-image
The platform uses Spark to run on multi-GPUs, and includes a robust machine-learning model server. It takes you from notebooks to production, and serves as a kind of basis for ML apps. Here's a demo:
[Disclosure: I am a co-founder of Skymind.]
For those who don't understand why this is important: In NSF and DOE funded projects, you are increasingly being told to use supercomputing centers, which effectively all install global GPFS and Lustre file systems to the HPC computers, usually fronted by Slurm, but maybe Torque, LSF, GridEngine, or HTCondor.
There are two Container runtimes that (mostly) work for HPC. Shifter, which is just Docker + image patching (and no layers) + some other unhelpful things like mounting in your home directory. Shifter came from NERSC who sort of gave it to Cray but still maintain it, and Singularity, which came from LBNL, which actually runs NERSC. NERSC thinks their Shifter is going to win, and they won't even, in fact, install Singularity on their clusters (they say it's for security reasons). Outside of NERSC, there is a tiny bit of traction of Shifter, but not much. Cray is pushing it hard.
Singularity is inferior in many ways to Docker. The most glaring being you can't use it with Kubernetes easily (if at all). It exports a large image. In the grid, those aren't that big of a problem. When you pair it with CVMFS, you get a nice way to cache and distribute containers (with the CVMFS cache) and a bunch of other benefits that are relatively nice. The Kubernetes thing is still a big issue, because it means you can't provide a secure way of making something like JupyterHub+Jupyterlab available with your mounted GPFS file systems (though you can sort of get away with it by exporting GPFS through NFS with root squashing, but that's not a great or complete solution).
The recommendations I give to projects are start with Docker, export to Singularity. Don't give your scientists Singularity containers for their Macs, lord knows how crappy that experience is on a Mac.
The situation sucks. If Docker was willing to do something about it, Singularity would die tomorrow.
Not to derail the post, but what does everyone else use?
There is a fairly informative post here: https://www.quora.com/How-are-most-data-sets-for-large-scale...
I have tried to use Docker for reproducible research, as a way of distributing ConceptNet with all its dependencies . I am about to give up and stop supporting the Docker version.
This was the process I needed to explain to researchers who wanted to use my stuff before Docker:
1. Get the right version of Python (which they can install, for example, from the official package repo of any supported version of Ubuntu).
2. Set up a Python virtualenv.
3. Install Git and PostgreSQL (also from standard packages).
4. Make sure they have enough disk space (but they'll at least get obvious error messages about it if they don't).
5. Configure PostgreSQL so that your local user can write to a database (this is the hardest step to describe, because PostgreSQL's auth configuration is bizarre).
6. Clone the repo.
7. Install the code into your virtualenv.
8. Run a script that downloads the data.
9. Run a script that builds the data and puts it in the database. (This takes up to five hours, because of PostgreSQL, but you get sorta-sensible output to watch while it's going on.)
10. Run the API on some arbitrary port.
11. Optionally, set up Nginx and uwsgi to run the API efficiently (but really nobody but me is going to do this part).
12. Optionally, extend the Python code to do something new that you wanted to do with it (this is the real reason people want to reproduce ConceptNet).
Now, here is the process after Docker:
1. Install Docker and Docker Compose.
2. No, not the ones in Ubuntu's repositories. Those are too old to work. Remove those. You need to set up a separate package archive with sufficiently new versions of Docker software.
3. Configure a Docker volume that's mapped onto a hard disk that has enough space available. (If you do this wrong, the build will fail and you will not find out why.)
4. Configure your user so that you can run Docker stuff and take over port 80. If you can't take over port 80, jump through some hoops.
5. Run the Docker build script. This will install the code, set up PostgreSQL, download the data, build the data, and put the data in the database. (The built data, of course, can't be distributed with the Docker image; that would be too large.)
6. At this point I tell people to wait five hours and then see if the API is up. If it is, it worked! But of course it didn't work, because the download got interrupted or you ran out of disk space or you had the wrong version of Docker or something.
7. Check your Docker logs.
8. Disregard all the irrelevant messages from upstream code, especially uwsgi, which is well known for yelling incoherently when it's working correctly. Somewhere around the middle of the logs, there might be a message about why your build failed.
9. Translate that message into a solution to the problem. (There is no guide to doing this, so at this point users have to send me an e-mail, a Git issue, or a Gitter message and I have to take the time to do tech support for them.)
10. Learn some new commands for administering Docker to fix the problem.
11. Delete your Docker volumes so that the PostgreSQL image knows it needs to reload the data, because Docker images don't actually have a concept of loading data and everything is a big hack.
12. Try again.
13. Wait five hours again. (Look, I can't just give you a tarball of what the PostgreSQL data directory should end up as; that's not portable. It needs to all go in through a COPY command.)
14. Now you can probably access the API.
15. Oh shit, you didn't just want the API, you wanted to be able to extend the code. Unfortunately, the code is running out of sight inside a Docker image with none of the programming tools you want. This is where you come and ask me for the non-Docker process.
Attempting to use Docker for data science has led to me doing unreasonable amounts of tech support, and has not made anything easier for anyone. I constantly run into the fact that Docker is designed for tiny little services and not designed for managing any amount of data. What's the real story about Docker for data science?
To me Docker serves as a sure way to get programs work regardless of the OS while also being composable to end-user needs. Also they should be easily portable to cloud/production environment. The purpose of containers was in the first place handle complexity at scale with some performance benefits.
Anyway suffice to say that we've dealt with the same issues you've listed, and our general thinking on this is that a version of container software adapted specifically to the needs of academics is going to go further here than one designed by and for software engineers.
I can see how you would have run into issues with Compose, as it creates multiple containers and links them instead of creating them individually. It doesn't display output by default, so you'll have to hit your logs if something went away.
Your order could have also been optimized and automated further:
1. Clone repo
2. Install Docker and Compose, preferably from a setup script
3. Fetch data set (though shouldn't this live somewhere remote?). Wait. This can be done with a container, btw
4. Have Compose bring up database container
5. Have Compose bring up API server
6. Run tests to verify all is operational
You can't just give arbitrary people remote access to a PostgreSQL database.
The rest of what you've described is pretty much the happy path of what I've already got, where one container has the DB and one has the API. There's a complication you're missing because you can't load data into the DB if you're not running the DB container, and also there is apparently no way to script Compose to start the API when the DB is ready.
So actually you start the containers first, and the DB container spends a while loading the data and the API spends a few hours saying "I don't have data yet, sorry :(". So even the happy path has this step where you wait for five hours and hope things are working.
But most people don't land on the happy path.
The exception is if I make a complete image of the machine, including PostgreSQL and its data, using AWS or VirtualBox or something.
If there's a way to package up a self-contained, non-root copy of PostgreSQL with data already in it (like you can do with SQLite), I would love to know about it.
And yet I won't give up on PostgreSQL; despite the awkwardness of setting it up, it really is the best database out of many that I've tried.
FWIW, you can zip up and move a data directory around with some caveats: the data directory stores the data in ways specific to how the PostgreSQL binary was built (including Postgres version), so if the environment and build is the same, you're often fine. However, that's not a great method if you're looking for portability.
Docker almost answers these questions, but it creates an equal number of new questions.
Not fragile at all. Just mount the appropriate socket and there you go.
Alernatively, you can also run vnc or rdp inside the container:
Maybe it's still possible to mount an X11 socket from Windows/Mac to the VM (and then to Docker?)??
It's still trickery, but if you can get that to work on the Mac and Windows Docker clients, then you could distribute the Linux version of the custom apps and everyone is happy :)
You would have to do similar things if this were running in a VM.
Disclosure: I work on this project.
If you are lucky! Docker makes no guarantees about this. I'm betting most data science dockerfiles start with "RUN apt-get update && apt-get install -y -q \ && ... ". How much do you stick in numerous run commands following that? Docker images are not deterministic by default.
However, a docker image is deterministic (each one gets a SHA256 identifier based on its contents -- if the hashes match, then the images are identical), and if you care about the particular versions of dependencies, an image is what you should be sharing with your colleagues/readers/etc. in order to let them reproduce your results.
I like Docker and I get a lot of value from using it, but personally I feel this is one place where the project made an error in design. Too many git repos these days have a Dockerfile in them that you can use to build the code, but that will only work if the stars align and the dependencies you install when you build your image are the same (or compatible with) the ones the author used. IMO a better design would be for docker images to be flat files e.g. "docker build Dockerfile > my-cool-image.img" and then "docker run my-cool-image.img ...". I think if this were the pattern, more people would be adding their `my-cool-image.img` files to their git repos instead of using the Dockerfile as a flawed source of truth.