Hacker News new | comments | show | ask | jobs | submit login
Docker for Data Science (towardsdatascience.com)
118 points by rbanffy 5 months ago | hide | past | web | favorite | 59 comments

Docker is really starting to be used a lot in data science. Kubernetes too as it makes it easy to run that code in a distributed way. There's starting to be an ecosystem of tools that help with this too. Such as Kubeflow [0] which brings Tensorflow to Kubernetes in a clean way. I work on one myself [1] that manages the data as well as using docker containers for the code so that your whole data process is reproducible.

[0] https://github.com/kubeflow/kubeflow

[1] https://github.com/pachyderm/pachyderm

Dislcaimer: I am not fully on bored with k8s and enable clusters that are not kubernetes as well as k8s deployments.

Is pachyderm actually used anywhere though? Especially compared to the hadoop distros?

Why would I use this over just spinning something up on EMR? If I were on prem, I likely already have hadoop installed. K8s is typically a separate cluster (mainly because it's still a bit error prone for anything with data applications, which is I'm assuming what you're claiming to solve?

Beyond that, re: kubeflow. The problem with just assembling some of these things and calling it "production" is you're still missing a lot of the basic things when people need to go to production:

1. Experiment tracking for collaboration

2. Managed Authentication/connectors for anything outside K8s

3. Connecting with/integrating with existing databases

4. Proper role management: Data Scientists aren't deploying things to prod (at least customer facing prod at scale where money is on the table, "prod" could also mean internal tools for experiments), they typically need to be integrating with external processes and different teams.

Many of these things are left as an exercise for the reader (especially in k8s land). Granted, tutorials and the hosted environments exist, but nothing is a "1 click" deploy that is being promised on any of these "See how easy it is!" blog posts that run something on 1 laptop.

A lot of things being promised here just don't line up for me yet. K8s is maturing quite a bit but when the story is still "Run managed k8s on the cloud or spend time upgrading your cluster every quarter" - I have to say it's far from close to anything a typical data scientist running sk learn on their mac are going to be able to get started with.

The closest I've seen to that that isn't our own product is AWS Sagemaker which actually solves real world problems by actually gluing components together in a fairly seamless way.

Let's just be clear here that we still have a long way to go yet. In practice, we have separate teams that need to collaborate. Data scientists aren't going to download minikube tomorrow and go to production without someone else's approval and going through a huge learning curve yet.

We're moving in the right direction though!

jdoliner is right to mention kubeflow.

There's also Polyaxon: https://github.com/polyaxon/polyaxon

Fwiw, our free data science platform is also Dockerized, and also bundles TensorFlow: https://docs.skymind.ai/docs

The Docker image instructions are here: https://docs.skymind.ai/docs/docker-image

The platform uses Spark to run on multi-GPUs, and includes a robust machine-learning model server. It takes you from notebooks to production, and serves as a kind of basis for ML apps. Here's a demo:


[Disclosure: I am a co-founder of Skymind.]

This article misses the point for the Scientist in data science. HPC will not support docker, Singularity seems to be the winner: http://singularity.lbl.gov/

This is a great point. People on HN often forget that data science is more than just machine learning, and that we have existing computing infrastructure (beyond AWS) to interface with.

Well... Singularity can import Docker containers, so I assume it's easy to cross over to that space.

More than that really:

For those who don't understand why this is important: In NSF and DOE funded projects, you are increasingly being told to use supercomputing centers, which effectively all install global GPFS and Lustre file systems to the HPC computers, usually fronted by Slurm, but maybe Torque, LSF, GridEngine, or HTCondor.

There are two Container runtimes that (mostly) work for HPC. Shifter, which is just Docker + image patching (and no layers) + some other unhelpful things like mounting in your home directory. Shifter came from NERSC who sort of gave it to Cray but still maintain it, and Singularity, which came from LBNL, which actually runs NERSC. NERSC thinks their Shifter is going to win, and they won't even, in fact, install Singularity on their clusters (they say it's for security reasons). Outside of NERSC, there is a tiny bit of traction of Shifter, but not much. Cray is pushing it hard.

Singularity is inferior in many ways to Docker. The most glaring being you can't use it with Kubernetes easily (if at all). It exports a large image. In the grid, those aren't that big of a problem. When you pair it with CVMFS, you get a nice way to cache and distribute containers (with the CVMFS cache) and a bunch of other benefits that are relatively nice. The Kubernetes thing is still a big issue, because it means you can't provide a secure way of making something like JupyterHub+Jupyterlab available with your mounted GPFS file systems (though you can sort of get away with it by exporting GPFS through NFS with root squashing, but that's not a great or complete solution).

The recommendations I give to projects are start with Docker, export to Singularity. Don't give your scientists Singularity containers for their Macs, lord knows how crappy that experience is on a Mac.

The situation sucks. If Docker was willing to do something about it, Singularity would die tomorrow.

Can you explain by what you mean w.r.t. "NSF and DOE funded projects, you are increasingly being told to use supercomputing centers"?

If you're a DOE project at a lab and you need compute, they aren't willing to give you money to purchase hardware or cloud unless you have an extremely compelling reason why you need to do that, and "Our software doesn't support GPUs/KNL/the Cray OS" isn't a valid excuse. They don't want you buying pizza boxes and they don't want you using cloud, they want you using the supercomputers they've already sunk money in. Small NSF projects are probably fine to do what they want, 50M+ range is probably where they kick you off to an XSEDE service provider.

We are experimenting with Nix (https://nixos.org/nix/) to create reproducible "containers" of the tools used to build a model. Nix allows to pin versions of R, python, R/Python packages, etc. and all the underlying c++/fortran/etc. libraries all the way down to (but excluding) the kernel. Deployment (or sharing with other team members, Nix works on both Linux and MacOS) of such "derivations" is also very easy if Nix is installed everywhere since the dependencies are not visible by other derivations and I can deploy e.g. R models requiring different versions of dplyr package on the same node. It looks very promising, here (https://blog.wearewizards.io/why-docker-is-not-the-answer-to...) is a blog post which I found very helpful.

One issue my team is dealing with is how to store our large models. We are using git lfs at the moment but it feels woefully inadequate. It is trivial for a team member to accidentally commit the 500mb file to git and completely destroy our repo size.

Not to derail the post, but what does everyone else use?

Pachyderm (my company) offers versioning large data sets in a similar fashion to git without having to deal with gitLFS. Pachyderm isn't Git under the hood, but we borrow some of the major concepts -- repos as versions data sets, commits, branches, etc. The analogy to Git breaks down as you go a little deeper because building efficient data versioning and management is a surprisingly different problem than code VCS

Have you considered using HDFS?

There is a fairly informative post here: https://www.quora.com/How-are-most-data-sets-for-large-scale...

Upload the serialized model to S3/GS, configure the URL to be re-pulled--treat model versions like feature flags

This article doesn't tell me anything about using Docker for data science, just about using Docker in general.

I have tried to use Docker for reproducible research, as a way of distributing ConceptNet with all its dependencies [1]. I am about to give up and stop supporting the Docker version.

[1] https://github.com/commonsense/conceptnet5/wiki/Running-your...

This was the process I needed to explain to researchers who wanted to use my stuff before Docker:

1. Get the right version of Python (which they can install, for example, from the official package repo of any supported version of Ubuntu).

2. Set up a Python virtualenv.

3. Install Git and PostgreSQL (also from standard packages).

4. Make sure they have enough disk space (but they'll at least get obvious error messages about it if they don't).

5. Configure PostgreSQL so that your local user can write to a database (this is the hardest step to describe, because PostgreSQL's auth configuration is bizarre).

6. Clone the repo.

7. Install the code into your virtualenv.

8. Run a script that downloads the data.

9. Run a script that builds the data and puts it in the database. (This takes up to five hours, because of PostgreSQL, but you get sorta-sensible output to watch while it's going on.)

10. Run the API on some arbitrary port.

11. Optionally, set up Nginx and uwsgi to run the API efficiently (but really nobody but me is going to do this part).

12. Optionally, extend the Python code to do something new that you wanted to do with it (this is the real reason people want to reproduce ConceptNet).


Now, here is the process after Docker:

1. Install Docker and Docker Compose.

2. No, not the ones in Ubuntu's repositories. Those are too old to work. Remove those. You need to set up a separate package archive with sufficiently new versions of Docker software.

3. Configure a Docker volume that's mapped onto a hard disk that has enough space available. (If you do this wrong, the build will fail and you will not find out why.)

4. Configure your user so that you can run Docker stuff and take over port 80. If you can't take over port 80, jump through some hoops.

5. Run the Docker build script. This will install the code, set up PostgreSQL, download the data, build the data, and put the data in the database. (The built data, of course, can't be distributed with the Docker image; that would be too large.)

6. At this point I tell people to wait five hours and then see if the API is up. If it is, it worked! But of course it didn't work, because the download got interrupted or you ran out of disk space or you had the wrong version of Docker or something.

7. Check your Docker logs.

8. Disregard all the irrelevant messages from upstream code, especially uwsgi, which is well known for yelling incoherently when it's working correctly. Somewhere around the middle of the logs, there might be a message about why your build failed.

9. Translate that message into a solution to the problem. (There is no guide to doing this, so at this point users have to send me an e-mail, a Git issue, or a Gitter message and I have to take the time to do tech support for them.)

10. Learn some new commands for administering Docker to fix the problem.

11. Delete your Docker volumes so that the PostgreSQL image knows it needs to reload the data, because Docker images don't actually have a concept of loading data and everything is a big hack.

12. Try again.

13. Wait five hours again. (Look, I can't just give you a tarball of what the PostgreSQL data directory should end up as; that's not portable. It needs to all go in through a COPY command.)

14. Now you can probably access the API.

15. Oh shit, you didn't just want the API, you wanted to be able to extend the code. Unfortunately, the code is running out of sight inside a Docker image with none of the programming tools you want. This is where you come and ask me for the non-Docker process.


Attempting to use Docker for data science has led to me doing unreasonable amounts of tech support, and has not made anything easier for anyone. I constantly run into the fact that Docker is designed for tiny little services and not designed for managing any amount of data. What's the real story about Docker for data science?

Hah I applaud you for writing all the details. I have to agree that in some cases using Docker doesn't really do much anything at all. When speed and productivity is on the line at times it's better to just stick with the old ways. I'm sure some Docker-genius might be able to streamline that process a bit but agony-wise it might not matter.

To me Docker serves as a sure way to get programs work regardless of the OS while also being composable to end-user needs. Also they should be easily portable to cloud/production environment. The purpose of containers was in the first place handle complexity at scale with some performance benefits.

What do you mean by "regardless of the OS"? Docker is primarily for running Linux programs.

I read that as: using Docker, you can run an application (or an application stack) in an OS independent of the host OS. This is no different really from hosting any virtual machine, but Docker has made it a lot easier.

When you say "in an OS", what possible OSes might these be?

Hi rspeer - as it happens I work for a company, Code Ocean, that aims to simplify the use of tools like Docker for scientists. In practice, it looks like https://codeocean.com/2017/07/14/3d-convolutional-neural-net.... The user put in all the necessary work to configure their environment, uploaded their code and data, and appropriate metadata, and a person who wishes to reproduce their results has to just press 'run' (conditional on having an account -- this Docker container is running on an AWS server with a GPU, and to avoid becoming a bitcoin mining platform, we require signup to run stuff).

Anyway suffice to say that we've dealt with the same issues you've listed, and our general thinking on this is that a version of container software adapted specifically to the needs of academics is going to go further here than one designed by and for software engineers.

The answer is to use Docker to configure your data analysis runtime environment, but not actually your data. Especially since we're often dealing with large datasets, sticking with a simple storage solution like S3 works pretty well.

I always thought that Docker made sense if the code or environment in which your code ran would be replicated onto many and different bits of hardware. In other words, Docker makes sense for maximum portability. Without knowing more about your use case, this seems like bespoke, one-time work. Docker doesn't make sense for that.

I can see how you would have run into issues with Compose, as it creates multiple containers and links them instead of creating them individually. It doesn't display output by default, so you'll have to hit your logs if something went away.

Your order could have also been optimized and automated further:

1. Clone repo

2. Install Docker and Compose, preferably from a setup script

3. Fetch data set (though shouldn't this live somewhere remote?). Wait. This can be done with a container, btw

4. Have Compose bring up database container

5. Have Compose bring up API server

6. Run tests to verify all is operational

What do you mean by "shouldn't this live somewhere remote"? The input data is on S3 and/or Zenodo, but you need to download it and put it in PostgreSQL.

You can't just give arbitrary people remote access to a PostgreSQL database.

The rest of what you've described is pretty much the happy path of what I've already got, where one container has the DB and one has the API. There's a complication you're missing because you can't load data into the DB if you're not running the DB container, and also there is apparently no way to script Compose to start the API when the DB is ready.

So actually you start the containers first, and the DB container spends a while loading the data and the API spends a few hours saying "I don't have data yet, sorry :(". So even the happy path has this step where you wait for five hours and hope things are working.

But most people don't land on the happy path.

Similar experience - I've flirted with the idea from time to time using Docker, but it added more overhead than really felt justified.

Thanks for taking the time to write this. I’m confused on point 13, would "RUN wget tarball_url" work?

Getting the data into PostgreSQL is a problem in nearly any setup. You can't really copy PostgreSQL's actual data directory around. Subtle differences in configuration (I don't know what they are) will probably cause PostgreSQL to consider the database corrupted and lock up. So you have to export a portable version of the database and spend time loading it back in.

The exception is if I make a complete image of the machine, including PostgreSQL and its data, using AWS or VirtualBox or something.

If there's a way to package up a self-contained, non-root copy of PostgreSQL with data already in it (like you can do with SQLite), I would love to know about it.

And yet I won't give up on PostgreSQL; despite the awkwardness of setting it up, it really is the best database out of many that I've tried.

Yeah, I know what you mean. Best suggestion I have is to get really familiar with pg_dump and pg_restore. It's really quite a powerful pair of tools with a lot of options. The COPY command is also really flexible if you'd like to use CSV for interop with non-PG tools. Likely this is all stuff you already know. At the end of the day, I consider it just part of massaging data between systems.

FWIW, you can zip up and move a data directory around with some caveats: the data directory stores the data in ways specific to how the PostgreSQL binary was built (including Postgres version), so if the environment and build is the same, you're often fine. However, that's not a great method if you're looking for portability.

I am using the COPY command; it's faster than pg_restore, as far as I can tell. The COPY plus building the index is the part that takes a few hours and doesn't fit well into Docker.

It depends on how you're using pg_dump: you can choose to dump data and indexes separately. I suspect your COPY + index builds are comparable to the pg_restore, depending on your degree of parallelism with the index builds.

Shouldn’t method A be fairly easily scriptable?

Sure; what tools should I assume are on the target computer to script it, and how do I safely use these tools to modify system configuration like /etc/postgresql/10/main/pg_hba.conf?

Docker almost answers these questions, but it creates an equal number of new questions.

At RunKit we back code executions not only with Docker (which “freeze dries” the filesystem), but with CRIU, an awesome open source project that “freeze dries” the state of the memory as well. This allows you to “rewind” code executions without losing initial state. More information here if you’re interested: http://blog.runkit.com/2015/09/10/time-traveling-in-node-js-...

The issue with Docker is that it doesn't support a GUI at all. Jupyter (and it's ilk) work for some needs, but I haven't found a good one that works for many of the (weird) libraries/applications that scientists use that need a GUI. My wife does IT for Physics/Astronomy and would loves something like Docker, but supports a GUI. There's an option of doing X-Windows trickery, but that's very fragile.

Any ideas?


Not fragile at all. Just mount the appropriate socket and there you go.

Alernatively, you can also run vnc or rdp inside the container:


A robust solution for Windows, Mac, and Linux is needed :/

Doesn't Docker with with VMs on Windows and Mac? (At least I know it does on the Mac).

Maybe it's still possible to mount an X11 socket from Windows/Mac to the VM (and then to Docker?)??

It's still trickery, but if you can get that to work on the Mac and Windows Docker clients, then you could distribute the Linux version of the custom apps and everyone is happy :)

Yeah, an X11 solution is one path forward - but it seems to be a bit brittle on Windows and Mac. But it is possible!

A lot of custom lab software is written in things like Python3 / PyQT5. While I prefer web interfaces, and think Chrome is as portable an interface as any. Its parcel to a more germane problem I have seen across domains: a way to visually peek into long running server process or computation progress.

At Kyso.io we provide a ready-made docker image (with all the popular data science and machine learning libraries) in the Jupyter environment. With our GUI it's super simple to start up projects with one click. Use our terminal to easily install additional packages.

I was able to get Firefox running from a Docker container pretty easily. Just need to export your local $DISPLAY to your container with the -e switch (by IP) and volume-mount your .Xauthority directory.

You would have to do similar things if this were running in a VM.

I have found it useful for some purposes to install lubuntu-desktop and tigervnc in a docker container. Leads to enormous images, but it runs GUI applications via a VNC viewer like remmina.

What about just using Vagrant with GUI VBox?

Is this robust across Windows, Mac, and Linux?

I'm pretty sure, yeah.

Cool, giving it a try today - thanks :)

Flatpak was designed for packaging desktop applications, but it seems like it would fit your needs reasonably well.

Environment is Windows, Mac, and Linux. This looks linux only.


Disclosure: I work on this project.

I think your definition of GUI and the parents are very different.

We do x-windows trickery to enable inline graphics within Jupyter natively.

at least on a linux host docker can support running GUI applications just fine

That's the issue - the environment consists of Windows, Mac, and Linux, so it isn't as simple as the Linux case :/

This article seems miss the mark on reproducibility. It’s not, “Oh how do I run pip install?” It’s the data. It’s the cleaning. It’s knowing what little script you ran to massage the data a bit. It’s all the stuff that happens after you install software. Docker doesn’t help you with any of this, and none of this was even discussed.

Reminds me of this post from a month ago: “Improving your data science workflow with Docker”


>Reproducible Research: I think of Docker as akin to setting the random number seed in a report. The same dependencies and versions of libraries that was used in your machine are used on the other person’s machine.

If you are lucky! Docker makes no guarantees about this. I'm betting most data science dockerfiles start with "RUN apt-get update && apt-get install -y -q \ && ... ". How much do you stick in numerous run commands following that? Docker images are not deterministic by default.

IMO one thing the author could have improved about the article is to make a more clear distinction between a Dockerfile and a docker image. What you say is true about builds in docker: using the `docker build` command to create an image from a Dockerfile isn't necessarily deterministic, for the reason you stated.

However, a docker image is deterministic (each one gets a SHA256 identifier based on its contents -- if the hashes match, then the images are identical), and if you care about the particular versions of dependencies, an image is what you should be sharing with your colleagues/readers/etc. in order to let them reproduce your results.

I like Docker and I get a lot of value from using it, but personally I feel this is one place where the project made an error in design. Too many git repos these days have a Dockerfile in them that you can use to build the code, but that will only work if the stars align and the dependencies you install when you build your image are the same (or compatible with) the ones the author used. IMO a better design would be for docker images to be flat files e.g. "docker build Dockerfile > my-cool-image.img" and then "docker run my-cool-image.img ...". I think if this were the pattern, more people would be adding their `my-cool-image.img` files to their git repos instead of using the Dockerfile as a flawed source of truth.

Follow up: "Nix for Data Science"?

Out of curiosity, anybody come up with a way to start a Spark job from within a Docker container?

This wasn't a very good article.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact