
Docker for Data Science - rbanffy
https://towardsdatascience.com/docker-for-data-science-4901f35d7cf9
======
jdoliner
Docker is really starting to be used a lot in data science. Kubernetes too as
it makes it easy to run that code in a distributed way. There's starting to be
an ecosystem of tools that help with this too. Such as Kubeflow [0] which
brings Tensorflow to Kubernetes in a clean way. I work on one myself [1] that
manages the data as well as using docker containers for the code so that your
whole data process is reproducible.

[0]
[https://github.com/kubeflow/kubeflow](https://github.com/kubeflow/kubeflow)

[1]
[https://github.com/pachyderm/pachyderm](https://github.com/pachyderm/pachyderm)

~~~
agibsonccc
Dislcaimer: I am not fully on bored with k8s and enable clusters that are not
kubernetes as well as k8s deployments.

Is pachyderm actually used anywhere though? Especially compared to the hadoop
distros?

Why would I use this over just spinning something up on EMR? If I were on
prem, I likely already have hadoop installed. K8s is typically a separate
cluster (mainly because it's still a bit error prone for anything with data
applications, which is I'm assuming what you're claiming to solve?

Beyond that, re: kubeflow. The problem with just assembling some of these
things and calling it "production" is you're still missing a lot of the basic
things when people need to go to production:

1\. Experiment tracking for collaboration

2\. Managed Authentication/connectors for anything _outside_ K8s

3\. Connecting with/integrating with existing databases

4\. Proper role management: Data Scientists aren't deploying things to prod
(at least customer facing prod at scale where money is on the table, "prod"
could also mean internal tools for experiments), they typically need to be
integrating with external processes and different teams.

Many of these things are left as an exercise for the reader (especially in k8s
land). Granted, tutorials and the hosted environments exist, but nothing is a
"1 click" deploy that is being promised on any of these "See how easy it is!"
blog posts that run something on 1 laptop.

A lot of things being promised here just don't line up for me yet. K8s is
maturing quite a bit but when the story is still "Run managed k8s on the cloud
or spend time upgrading your cluster every quarter" \- I have to say it's far
from close to anything a typical data scientist running sk learn on their mac
are going to be able to get started with.

The closest I've seen to that that isn't our own product is AWS Sagemaker
which actually solves real world problems by actually gluing components
together in a fairly seamless way.

Let's just be clear here that we still have a long way to go yet. In practice,
we have separate teams that need to collaborate. Data scientists aren't going
to download minikube tomorrow and go to production without someone else's
approval and going through a huge learning curve yet.

We're moving in the right direction though!

------
tbenst
This article misses the point for the Scientist in data science. HPC will not
support docker, Singularity seems to be the winner:
[http://singularity.lbl.gov/](http://singularity.lbl.gov/)

~~~
codles
This is a great point. People on HN often forget that data science is more
than just machine learning, and that we have existing computing infrastructure
(beyond AWS) to interface with.

~~~
rbanffy
Well... Singularity can import Docker containers, so I assume it's easy to
cross over to that space.

------
alex-v
We are experimenting with Nix
([https://nixos.org/nix/](https://nixos.org/nix/)) to create reproducible
"containers" of the tools used to build a model. Nix allows to pin versions of
R, python, R/Python packages, etc. and all the underlying c++/fortran/etc.
libraries all the way down to (but excluding) the kernel. Deployment (or
sharing with other team members, Nix works on both Linux and MacOS) of such
"derivations" is also very easy if Nix is installed everywhere since the
dependencies are not visible by other derivations and I can deploy e.g. R
models requiring different versions of dplyr package on the same node. It
looks very promising, here ([https://blog.wearewizards.io/why-docker-is-not-
the-answer-to...](https://blog.wearewizards.io/why-docker-is-not-the-answer-
to-reproducible-research-and-why-nix-may-be)) is a blog post which I found
very helpful.

------
qudat
One issue my team is dealing with is how to store our large models. We are
using git lfs at the moment but it feels woefully inadequate. It is trivial
for a team member to accidentally commit the 500mb file to git and completely
destroy our repo size.

Not to derail the post, but what does everyone else use?

~~~
jaz46
Pachyderm (my company) offers versioning large data sets in a similar fashion
to git without having to deal with gitLFS. Pachyderm isn't Git under the hood,
but we borrow some of the major concepts -- repos as versions data sets,
commits, branches, etc. The analogy to Git breaks down as you go a little
deeper because building efficient data versioning and management is a
surprisingly different problem than code VCS

------
rspeer
This article doesn't tell me anything about using Docker for data science,
just about using Docker in general.

I have _tried_ to use Docker for reproducible research, as a way of
distributing ConceptNet with all its dependencies [1]. I am about to give up
and stop supporting the Docker version.

[1] [https://github.com/commonsense/conceptnet5/wiki/Running-
your...](https://github.com/commonsense/conceptnet5/wiki/Running-your-own-
copy#running-conceptnet-using-docker)

This was the process I needed to explain to researchers who wanted to use my
stuff before Docker:

1\. Get the right version of Python (which they can install, for example, from
the official package repo of any supported version of Ubuntu).

2\. Set up a Python virtualenv.

3\. Install Git and PostgreSQL (also from standard packages).

4\. Make sure they have enough disk space (but they'll at least get obvious
error messages about it if they don't).

5\. Configure PostgreSQL so that your local user can write to a database (this
is the hardest step to describe, because PostgreSQL's auth configuration is
bizarre).

6\. Clone the repo.

7\. Install the code into your virtualenv.

8\. Run a script that downloads the data.

9\. Run a script that builds the data and puts it in the database. (This takes
up to five hours, because of PostgreSQL, but you get sorta-sensible output to
watch while it's going on.)

10\. Run the API on some arbitrary port.

11\. Optionally, set up Nginx and uwsgi to run the API efficiently (but really
nobody but me is going to do this part).

12\. Optionally, extend the Python code to do something new that you wanted to
do with it (this is the real reason people want to reproduce ConceptNet).

-

Now, here is the process after Docker:

1\. Install Docker and Docker Compose.

2\. No, _not_ the ones in Ubuntu's repositories. Those are too old to work.
Remove those. You need to set up a separate package archive with sufficiently
new versions of Docker software.

3\. Configure a Docker volume that's mapped onto a hard disk that has enough
space available. (If you do this wrong, the build will fail and you will not
find out why.)

4\. Configure your user so that you can run Docker stuff and take over port
80. If you can't take over port 80, jump through some hoops.

5\. Run the Docker build script. This will install the code, set up
PostgreSQL, download the data, build the data, and put the data in the
database. (The built data, of course, can't be distributed with the Docker
image; that would be too large.)

6\. At this point I tell people to wait five hours and then see if the API is
up. If it is, it worked! But of course it didn't work, because the download
got interrupted or you ran out of disk space or you had the wrong version of
Docker or something.

7\. Check your Docker logs.

8\. Disregard all the irrelevant messages from upstream code, especially
uwsgi, which is well known for yelling incoherently when it's _working
correctly_. Somewhere around the middle of the logs, there might be a message
about why your build failed.

9\. Translate that message into a solution to the problem. (There is no guide
to doing this, so at this point users have to send me an e-mail, a Git issue,
or a Gitter message and I have to take the time to do tech support for them.)

10\. Learn some new commands for administering Docker to fix the problem.

11\. Delete your Docker volumes so that the PostgreSQL image knows it needs to
reload the data, because Docker images don't actually have a concept of
loading data and everything is a big hack.

12\. Try again.

13\. Wait five hours again. (Look, I can't just give you a tarball of what the
PostgreSQL data directory should end up as; that's not portable. It needs to
all go in through a COPY command.)

14\. Now you can probably access the API.

15\. Oh shit, you didn't just want the API, you wanted to be able to extend
the code. Unfortunately, the code is running out of sight inside a Docker
image with none of the programming tools you want. This is where you come and
ask me for the non-Docker process.

-

Attempting to use Docker for data science has led to me doing unreasonable
amounts of tech support, and has not made anything easier for anyone. I
constantly run into the fact that Docker is designed for tiny little services
and not designed for managing any amount of data. What's the _real_ story
about Docker for data science?

~~~
tekkk
Hah I applaud you for writing all the details. I have to agree that in some
cases using Docker doesn't really do much anything at all. When speed and
productivity is on the line at times it's better to just stick with the old
ways. I'm sure some Docker-genius might be able to streamline that process a
bit but agony-wise it might not matter.

To me Docker serves as a sure way to get programs work regardless of the OS
while also being composable to end-user needs. Also they should be easily
portable to cloud/production environment. The purpose of containers was in the
first place handle complexity at scale with some performance benefits.

~~~
sigjuice
What do you mean by "regardless of the OS"? Docker is primarily for running
Linux programs.

~~~
grzm
I read that as: using Docker, you can run an application (or an application
stack) in an OS independent of the host OS. This is no different really from
hosting any virtual machine, but Docker has made it a lot easier.

~~~
sigjuice
When you say "in an OS", what possible OSes might these be?

------
tolmasky
At RunKit we back code executions not only with Docker (which “freeze dries”
the filesystem), but with CRIU, an awesome open source project that “freeze
dries” the state of the memory as well. This allows you to “rewind” code
executions without losing initial state. More information here if you’re
interested: [http://blog.runkit.com/2015/09/10/time-traveling-in-node-
js-...](http://blog.runkit.com/2015/09/10/time-traveling-in-node-js-
notebooks/)

------
gilbetron
The issue with Docker is that it doesn't support a GUI at all. Jupyter (and
it's ilk) work for some needs, but I haven't found a good one that works for
many of the (weird) libraries/applications that scientists use that need a
GUI. My wife does IT for Physics/Astronomy and would loves something like
Docker, but supports a GUI. There's an option of doing X-Windows trickery, but
that's very fragile.

Any ideas?

~~~
diffeomorphism
[https://blog.jessfraz.com/post/docker-containers-on-the-
desk...](https://blog.jessfraz.com/post/docker-containers-on-the-desktop/)

Not fragile at all. Just mount the appropriate socket and there you go.

Alernatively, you can also run vnc or rdp inside the container:

[https://gist.github.com/gmacario/11385611](https://gist.github.com/gmacario/11385611)

~~~
gilbetron
A robust solution for Windows, Mac, and Linux is needed :/

~~~
mbreese
Doesn't Docker with with VMs on Windows and Mac? (At least I know it does on
the Mac).

Maybe it's still possible to mount an X11 socket from Windows/Mac to the VM
(and then to Docker?)??

It's still trickery, but if you can get _that_ to work on the Mac and Windows
Docker clients, then you could distribute the Linux version of the custom apps
and everyone is happy :)

~~~
gilbetron
Yeah, an X11 solution is one path forward - but it seems to be a bit brittle
on Windows and Mac. But it is possible!

------
jonathankoren
This article seems miss the mark on reproducibility. It’s not, “Oh how do I
run pip install?” It’s the data. It’s the cleaning. It’s knowing what little
script you ran to massage the data a bit. It’s all the stuff that _happens
after_ you install software. Docker doesn’t help you with any of this, and
none of this was even discussed.

------
stablemap
Reminds me of this post from a month ago: “Improving your data science
workflow with Docker”

[https://news.ycombinator.com/item?id=16071612](https://news.ycombinator.com/item?id=16071612)

------
justinwp
>Reproducible Research: I think of Docker as akin to setting the random number
seed in a report. The same dependencies and versions of libraries that was
used in your machine are used on the other person’s machine.

If you are lucky! Docker makes no guarantees about this. I'm betting most data
science dockerfiles start with "RUN apt-get update && apt-get install -y -q \
&& ... ". How much do you stick in numerous run commands following that?
Docker images are not deterministic by default.

~~~
jake-low
IMO one thing the author could have improved about the article is to make a
more clear distinction between a Dockerfile and a docker image. What you say
is true about _builds_ in docker: using the `docker build` command to create
an image from a Dockerfile isn't necessarily deterministic, for the reason you
stated.

However, a docker _image_ is deterministic (each one gets a SHA256 identifier
based on its contents -- if the hashes match, then the images are identical),
and if you care about the particular versions of dependencies, an _image_ is
what you should be sharing with your colleagues/readers/etc. in order to let
them reproduce your results.

I like Docker and I get a lot of value from using it, but personally I feel
this is one place where the project made an error in design. Too many git
repos these days have a Dockerfile in them that you can use to build the code,
but that will only work if the stars align and the dependencies you install
when you build your image are the same (or compatible with) the ones the
author used. IMO a better design would be for docker images to be flat files
e.g. "docker build Dockerfile > my-cool-image.img" and then "docker run my-
cool-image.img ...". I think if this were the pattern, more people would be
adding their `my-cool-image.img` files to their git repos instead of using the
Dockerfile as a flawed source of truth.

------
bane
Out of curiosity, anybody come up with a way to start a Spark job from within
a Docker container?

------
muninn_
This wasn't a very good article.

