
Docker for data scientists: Introduction and use cases - raab
https://unsupervisedpandas.com/data-science/docker-for-data-science/
======
tbenst
I would recommend looking at Singularity as well:
[http://singularity.lbl.gov/](http://singularity.lbl.gov/). Docker is not
supported in the HPC world but Singularity is showing rapid adoption

~~~
thebeardedone
It really depends on what you are trying to do with it, for example we have
around 15 different integration test configurations that run every night for
which VMs may better suited as we want to test installation and deployments
automatically for 3 distributions (6 different versions ex. Ubuntu 14.04,
ubuntu 16.04, centos 6, centos 7 etc..) and the last 2 windows server. The
good thing is that they reproduce the customer environment.

But they have large down sides as well which slow us down. They are a pain to
maintain as they are somewhat undocumented (you make a poc for 1 and
management always wants more without improvments), a lot of edge cases cause
issues which are tough to reproduce (locally sometimes impossible and waste a
lot of time) and it takes them a while to start, run, etc.

This is not too tragic for nightly tests as we get the results in the morning
but for tests which are started every hour, you do not want to wait that long
to verify your changes work/didnt break anything. You can do these in stages,
where you create different images based on the result of a previous job (run
basic tests that cover base functionality that should always work, then run
more in depth tests, then run performance tests at the end to ensure no
significant degradation was introduced, etc..) and send out notifications asap
in case of failure. The Dockerfile is essentially the documentation as you can
see what is installed/configured. You can run everything locally just as it
would in a k8 env. which for some reason every one always struggles with.

I am sure there are also edge cases with Docker that are a pain as well but
the other selling points show it may be the right direction. You just havevto
find use cases and evaluate them.

~~~
mbreese
I'm not sure what that has to do with Singularity?

I know that the HPC clusters I've used in the past few years have all
supported Singularity, but none have supported Docker (aside from our small
lab cluster). Many HPC admins are (understandably) hesitant to allow non-
admins access to start Docker containers (requiring root), but Singularity has
no such user permission issues -- and it's faster than initializing a full VM
to run a job. I don't expect that to change so long as starting a container
requires root-effective permissions.

I suspect that many data scientists will be in a similar situation w.r.t HPC
clusters (except for those that are using custom clouds like Seven Bridges).

~~~
thebeardedone
Sorry for the mix up, I was replying to rb808's post using the HN app on my
tablet, no idea what happened :(.

Regarding HPC, from what I remember they usually have old kernels which are
running (2.6) for compatibility reasons where Docker usually is not supported
(unless it is backported like in RHEL).

------
wyattjoh
I'd think that Pachyderm [0] would be mentioned in the article, but it wasn't.
It uses Docker under the hood for the data pipelines. I've always wanted an
excuse to do some data science...

[0]: [https://www.pachyderm.io/](https://www.pachyderm.io/)

~~~
jdoliner
Thanks for the plug wyattjoh, I'm one of the founders of Pachyderm so I'd like
to expand on what you said just a bit to clarify.

Pachyderm does use Docker under the hood, but we don't obfuscate it away, so
Data Scientists get the full power of Docker (and most of the power of
Kubernetes) at their fingertips. This means you can easily grab prefabricated
environments such as Jupyter or Tensorflow container images and deploy them
directly. Docker is so good for packaging environments we didn't want to
conceal it.

On the other hand, we felt the data orchestration capabilities of Docker were
pretty lacking for Data Science use cases, so that's where we've focused our
energy. Volumes are a good basic tool for getting data to your code, but
they're a pretty blunt instrument. How do you split data up to parallelize
over it? How do you make sure you've got the right version of data? How do you
schedule new computations when data becomes available? Those are some of the
use cases we solve with our distributed file system PFS.

------
robszumski
As someone dabbling in ML for the first time, setting up the correct Python
env has been a nightmare just to run some sample code. I am familiar with
Docker and would love if this got more mainstream adoption.

Fighting with dependencies sucks when you aren’t intimately familiar with the
ecosystem: python, node and other js, and even go.

~~~
smnrchrds
I haven't worked in ML, but I've used Python extensively for scientific
computing. Setting up the environment used to be a headache, but Anaconda
solved that issue. Are ML tools not available in Anaconda?

~~~
robszumski
This is on my list of tools to investigate next. Glad I’m headed in the right
direction.

~~~
JPKab
Move it to the top of your list. Anaconda is absolutely where you should start
if you are trying to mess with ML/Predictive.

------
rb808
I spent a lot of time learning docker and k8s, thinking it would be really
useful. Interest from the rest of the team - is nearly zero. Actual VMs work
great, I'm thinking its not really necessary except for if you need
scalable/serverless applications.

------
vonnik
[Disclosure: Skymind co-founder here.]

The Skymind Intelligence Layer helps data scientists operationalize their
models on-prem and in the public cloud, and uses Docker to do that.

[https://skymind.readme.io/v1.0.1/docs/quickstart](https://skymind.readme.io/v1.0.1/docs/quickstart)

[https://skymind.readme.io/v1.0.1/docs/docker-
image](https://skymind.readme.io/v1.0.1/docs/docker-image)

------
bane
Does anybody have some good patterns for bundling dependencies for analytics
in Docker Containers, but then handling the execution of the analytics on
Spark clusters? There seems to be various permutations of this notion, but
I've heard that most of them have various issues or don't work quite as
expected.

------
ganeshkrishnan
Pedantic: it's "whet your appetite" not "wet your appetite"

~~~
raab
[https://github.com/Raab70/Raab70.github.io/commit/d69addbf12...](https://github.com/Raab70/Raab70.github.io/commit/d69addbf124cd35f804d3fe29795816a72472496)

------
iamwil
How many data scientists are interested in deploying their own models? And how
many of you would learn docker in order to do so?

~~~
minimaxir
There is a strong overlap between data science and devops, which is something
not often discussed in all the data science/machine learning MOOCs.

~~~
claytonjy
I agree this is a big gap in the curricula; "engineering chops" (managing
servers, environments, etc.) is what I've seen be the biggest force multiplier
for data scientists (esp. at non-huge orgs), not how good at stats they are or
how fancy a model they can build.

~~~
iamwil
Which orgs have you seen this as the biggest force multiplier?

From the sibling comments in this thread, it seems like there are some agree,
and others don't. I was curious if there's a segment that does, and who they
might be.

~~~
claytonjy
Biased sample for sure, but me and my friends/colleagues in the midwest at
companies with DS teams of <10 people, often <5\. Companies/teams far too
small to separate the builders from the shippers.

I also practice data science from a more engineering angle than many which
further colors my opinions. Building models is fun, but I also enjoy
deployment and maintenance, so I don't _want_ to hand that part off.

------
cle
How does this compare with the development and deployment functionality in
e.g. AWS SageMaker?

~~~
raab
SageMaker has three parts. Hosted notebooks, similar to case #2. API access to
ML algorithms which are "optimized" and deployment by providing API access to
trained models.

------
road2stat
Personally, I don't think sealing dependencies to certain versions is the best
idea, while Docker can be a very valuable addition to a data scientist's
toolbox. Reproducibility, just like integration, deployment, and delivery,
should be a continuous process.

For data scientists who write workflows as R Markdown documents and want to
containerize them, you might want to check out our R package liftr:
[https://liftr.me/](https://liftr.me/).

