
Ask HN: How does your data science or machine learning team handle DevOps? - mlthoughts2018
Machine learning teams often face operating needs not seen in many other domains.<p>Some example:<p>- instrumenting observability that not only monitors data quality and upstream ETL job status, but also domain specific considerations of training ML models, like overfitting, confusion matrices, business use case accuracy or validation checks, ROC curves and more (all needing to be customized and centrally reported per each model training task).<p>- standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more.<p>Machine learning engineers and data scientists tend to have a comparative advantage when they can focus on understanding the data, running experiments to decide which models are best, pairing with product managers or engineers to understand constraints around the user experience, and designing software tools and abstractions around unique training or serving architectures (like the GPU queuing example).<p>Increasingly teams of data scientists are required to do devops work configuring and maintaining eg kubernetes &amp; CI&#x2F;CD workloads, alerting and monitoring, logging, instrumenting security or data access control compliance solutions.<p>This is harmful because it reduces the time or effort these engineers can spend on their comparative advantages, a direct loss to the customer or user, at the expense of doing devops jobs they are not trained to do and not interested in (which leads data scientists to burnout often) and that many other non-specialists can do.<p>How do you structure teams, build tools and establish compliance or operations expectations that allow data scientists and related statistical scientists and ML backend engineers to flourish?
======
viig99
ML engineer here, team started as a research team, now that we have things in
production and have a lot of devops, engineering work, we bifurcated into pods
and work on specific bits and pieces, lot of constant fire-fighting though.
Re-wrote entire stack from python to C++ threadpool async grpc (is thrift the
only good threadpool server implementation available ?), deployed on
openshift, used vector + influx + grafana for dashboards / internal model
monitors, elastic search for loggings, lot of other tools for validation,
filtering for potential training candidates etc. Right now working CI/CD for
ml, during training if model finds a better model based on different
validation sets, have one click deployment ready for approval etc

~~~
dnautics
Wow, thanks for the detail.

> Re-wrote entire stack from python to C++ threadpool async grpc

Incredible. Presumably this is for latency/performance on the inference side?

~~~
viig99
Yes accuracy, latency & throughput are the 3 poles we try to achieve, c++
helps with latency & throughput and helps keep the cost low.

~~~
mlthoughts2018
Why would c++ help with latency in comparison to say Python with numpy / numba
/ Cython? All the production critical “this needs to be as fast as possible
stuff” I’ve ever worked on has been all Python, achieving complete speed
parity with C, at a much faster development speed and with way way less
boilerplate code.

~~~
disgruntledphd2
If you have hard constraints at inference time, then it can be much easier to
tune to a time budget with C++.

Like, it's normally not worth it, but when you need it, you really need it.

~~~
mlthoughts2018
I definitely agree that could be a case where you want a statically compiled
module that avoid any interpreted language overheads or high cost
abstractions. But what would make C++ easier to write, tune, integrate or
deploy in that case than using Cython to create the C++ extension for you?

~~~
viig99
I personally find C++ + pybind11 vastly easier to work with, also
transitioning completely to c++ from there was a pretty small leap.

~~~
mlthoughts2018
Interesting, I’ve never heard anyone who frequently uses Python and C++
together express this preference, it’s always the other direction that Cython
is easier.

~~~
viig99
pytorch is pybind11 + c++

~~~
mlthoughts2018
True, but that one project is just a drop in the bucket of scientific
computing and C++ interop in Python, even despite the success and popularity
of PyTorch - so it doesn’t really say much in favor of pybind that this or
that project got good mileage out of it, it’s still such a deep minority
compared to Cython.

------
lettergram
Depends on what you mean by machine learning. In deep learning applications
optimizations and getting the thing to train effectively is 90% of the job.

My team manages a platform / framework we built similar to FloydHub (we wrote
a django app that integrates with AWS, but any cloud provider would do) and
another similar to SigOpts (we built a server-client system that utilizes the
first systems APIs to deploy nodes)[1]. This lets us effectively develop and
then hyperparameter tune our models. Finally, we deploy them within a library
and flask app. This makes them easily digestible across the enterprise.

We are a team of three that leverage other teams to make improvements (aka
“inner sourcing” development). It’s a busy job, but managing the whole stack
gives us the ability to develop models much faster and effectively. With
multiple teams utilizing our frameworks we have a kind of critical mass to
keep everything running quite well.

[1] [https://medium.com/capital-one-tech/system-language-
agnostic...](https://medium.com/capital-one-tech/system-language-agnostic-
hyperparameter-optimization-at-scale-and-its-importance-for-
automl-92d9f9add416)

------
hprotagonist
would you accept “very badly”?

a colleague of mine attempted to share a dropbox link of a git repo and
working directory he had helpfully zipped up. including 4 datasets.

So in order to get the 50 lines of code i was meant to merge in, he thought it
was reasonable to have me download a headless 4 GB.

I told him no.

~~~
dnautics
it's good to hear that people are reaching for these sorts of things which are
obvious to most devs (I mean this is something I did once early in my career -
with a container, and was told "never do this again"), but maybe not so
obvious for ML devs.

~~~
Jugurtha
It is due to one central point: the product is not simply software that
_processes_ data, the product _is a product of data_.

Imagine if you write a mobile application that stops working if the mood of
the user changes. How much of a headache would you have developing, deploying,
and maintaining that kind of apps?

Concrete example: You work on a churn problem. You're good and you have
support, so you get the data fast. You produce a great model. That model is
perishable. The market changes and the model you trained with the data your
client gave you becomes stale and loses its "predictive power". In the
simplest scenario, you must get fresh data, and you do it all over again with
training, deployment, etc.

One other difference: for normal software development, your stack is pretty
much set and you spend most of time using that stack to develop, test, push.
In ML, a lot of the effort is in exploration. You want to try a new paper, a
new algorithm but that algorithm is only implemented in one library and not
the other, and that library conflicts with another one. You want to try as
many combinations as possible. This doesn't really happen in standard software
development. Components change relatively slowly.

There's also the data problem. Unless one does Kaggle competitions, you don't
get to have JSON or CSV in projects. In most cases, you get whatever the
client has [emails, powerpoints, files, archives, audio, video, esoteric third
party systems you have to interface with without vendor support]. There's no
"API" to tap into, and there isn't only one source of data you can build an
interface for and call it a day. Hence a lot of custom code to process that.

There are many problems like these. We spend time with applicants who do
competitions and imagine that the job is building models to tell them we're
not there yet.

------
navbaker
We have been fortunate enough to be able to hire people for our group of about
60 that fall into two categories:

1) Researchers/mathematicians

2) DevOps/software engineers that are fluent enough in ML/AI processes and
methodologies that they can listen to what group 1 says they need and
implement a system that will efficiently solve their problem.

~~~
ynouri
Are 1) and 2) separate teams? How many people in each?

~~~
navbaker
They are separate, but we all sit in two person offices along one long
hallway. This allows us to quickly talk through issues or bounce ideas off
someone, then go back and spend long stretches implementing. It’s about 2/3rds
researchers, 1/3rd devs.

------
was_boring
I work on part of this problem on my team at my job. How we are organized is
specific teams do exactly what you describe.

I work on data ingestion and getting it to a safe place as quickly as possible
(this has more senior members then other teams because of it’s importance).
Another team does ETL on it, another for data storage and access, another for
ML governance, another for creating the models, and so on.

We focus on our domain and communicate. Seems to work well.

~~~
Jugurtha
Thanks for the input. What would be a bad day for each of the teams? What
would be the most common problems they face, the most irritating, and the
worst? What does each of these teams dread?

Also, "It's always sunny in Philadelphia".. What's the "It's always [x] in
[team]"?

------
firstfewshells
All the companies I've worked at have a dedicated Platform team that does all
of the things that you mentioned.

~~~
LSTMeow
Heh, that's why we open sourced our platfrom+devops. Check it out - its
totally next-level magical stuff
[https://github.com/allegroai/trains](https://github.com/allegroai/trains)

------
atmosx
They have a dedicated devops team who handles infrastructure and operations on
top of AWS.

~~~
calebkaiser
Shameless self-plug, but I maintain an open source project that automates
DevOps for ML deployments, built on top of AWS:

[https://github.com/cortexlabs/cortex](https://github.com/cortexlabs/cortex)

~~~
atmosx
I'll share your project with the team, maybe they'll use it to automate stuff!

Ty!

------
msapaydin
Some projects started by researchers from Stanford University seem to address
these issues. Some keywords I have come across are MLflow, sisu and
Databricks. The last one is aka spark. Sisu is a company I did not try, and I
had trouble working with MLflow however the ideas are worth taking a look.

------
theo31
I couldn't agree more, I started building a ML hosting platform to solve your
second point. I'm thinking of building a managed NN service as it is a common
pain point.

[https://inferrd.com](https://inferrd.com)

------
LSTMeow
Disclosure/plug: Evangelist for AllregoAi here, but I'm going to only allude
to our FREE open-source platform+devops solution - Allegro Trains-
[https://github.com/allegroai/trains](https://github.com/allegroai/trains)

 __*

1100% Agree with you about unnecessary time spent on configuration and
maintenance.

As a research-oriented professional, you need something that will seamlessly
integrate with _your own_ flow.

We are in the ML stone-age, the playbook is not really written yet. Currently,
CI/CD + agile is (necessary?) overhead that costs us precious time-to-product.

Here is my manifesto:

1\. Anything related to "production" should be taken care of by DevOps peeps,
yes even if it is "MLOps". Monitoring, standardization etc should not be your
responsibility. If it is somehow on you, then it should be part of the same
experimentation platform you are using. Extra tools? Extra people.

2\. Likewise, anything related to data-engineering, preparation etc. should be
compartmentalized and have separate version control (it is not as complicated
as doing it the DVC way, BTW). If you do have to do these tasks - you guessed
it - it should be part of the same experimentation platform you are using.

3\. Research MLOps (ResOps?): Did I say experimentation platform? Any team
member should be able to work as she wishes - Notebooks, scripts, whatnot. And
if you forget to commit something before you run? You want to know about all
the changes. Sharing? Comparing? - must have. Reproducible experimentation?
Need to be able to automatically track environment variables, packages
installed etc. Most importantly - Need to be able to offload to the cluster in
the same running environment with a button click. I am not going to spend
hours deciphering logs to find out that the wrong version of package was
installed in our container. I am not going to spend days sorting out
containers to find "the one that works"

4\. Lastly - IT work ("devops") on cluster management: Monitoring your GPU
usage per task, scheduling experiments, early stopping with a button click,
on-prem managed platform - WHY IS THIS OUR JOB? - well, it isn't. But if it
is, it should be integrated with your platform, and day-to-day operations
should be "automagical", cluster config should be done once, by professionals
(even outsourced help).

If you feel me here, then know that you are REALLY not alone. We took to heart
what our clients & friends told us, and we launched Allegro Trains as a
solution for all of this. Magically simple, and FREE.

Sorry for caps, I tend to be emotional on this ;) Hit me up on twitter
@LSTMeow

~~~
viig99
This is super cool!

------
Jugurtha
Like anyone working in that space who wants to keep their sanity, we're
building our machine learning platform[0]. We shipped _many_ projects and it
can be taxing to work on different stacks, especially given the fact that we
built for enterprise and they always want _complete_ solutions. The model is
but a foot in the door, and you must do custom front-end/back-
end/model/pipe/data acquisition.

We decided to build tooling around that workflow. We shipped and were paid
with that workflow, so we wanted to make it efficient and effective. Other
solutions didn't fit our needs. Straight out of
[https://xkcd.com/927/](https://xkcd.com/927/), except it needs to address
_our_ use cases. We backtest with our past projects and use it for our current
ones to scale our consulting capabilities.

> _\- standardizing end to end tooling for special resources, eg queueing and
> batching to keep utilization high for production GPU systems, high RAM use
> cases like approximate nearest neighbor indexes, and just run of the mill
> stuff like how to take a trained model and deploy it behind a microservice
> in a way that bakes in logging, tracing, alerting, and more._

We do schedule notebooks[1]. We also publish AppBooks[2], which are
automatically parametrized notebooks: it automatically generates a form so
anyone can set variables the notebook author's chose to expose and run a
notebook without changing the code. Extremely useful when you want to have a
domain expert tweak a domain specific variable, without them having to know
what a notebook is. In some projects, there's someone with deep, deep
expertise in a field for whom a variable is really important, but that
variable gets dismissed by the ML practitioner because they didn't see a
correlation or an impact on AUC or something, so the domain expert has an
input, whether on relevant variables, or the _real_ world metrics we're
working for. We also added instrumentation for the basic CPU/RAM/GPU, data,
servers running, etc. Again, so that our teammates don't bother with this. We
use different Docker images for notebook servers with some that are 30GB so
members don't bother with dependencies, GPU/tensorflow/cuda and version
conflicts.

We automatically track metrics and parameters without the notebook's author
writing boilerplate code, because they forget. The models are saved, and the
ML practitioner can click on a button to deploy a model[3]. We stressed a lot
on self service because its absense put a lot of stress on us: an ML
practitioner wants to deploy a model, asks someone in the team who's probably
busy. So we said: anyone should click and deploy. This is also useful because
a developer downstream will only need to send in HTTP requests to interact
with the model. We used to have application developers who also needed to know
more than they should have on the internals/dependencies of the models. Not
anymore.

We also added near real-time collaboration/editing so many people can work on
the same notebook, especially useful when a team member is struggling to
implement something, and others chime in to help debug/refactor, and
review[4]. Everyone sees everyone's cursor for better awareness of what's
being done. A use case is an ML practitioner playing with a paper who's
struggling on the algorithms or data structures part of the paper. They can
sollicit another team member who'll chime in and help. One of the features
that's useful is the multiple checkpoints[5], which allows one to revert to an
arbitrary checkpoint, not just the last one. This, again, is porcelain because
an ML practitioner playing with git is a context switch, and they don't really
like it.

So, we've done a few things to make our life easier. The workflow isn't
perfect, and the tools aren't perfect, but we're removing friction.

We add applications for the usual timeseries forecasting, sentiment analysis,
anomaly detection, churn, to leverage projects we did in the past.

[0]: [https://iko.ai](https://iko.ai)

[1]: [https://iko.ai/docs/notebook/#long-running-
notebooks](https://iko.ai/docs/notebook/#long-running-notebooks)

[2]: [https://iko.ai/docs/appbook/](https://iko.ai/docs/appbook/)

[3]: [https://iko.ai/docs/appbook/#deploying-a-
model](https://iko.ai/docs/appbook/#deploying-a-model)

[4]:
[https://iko.ai/docs/notebook/#collaboration](https://iko.ai/docs/notebook/#collaboration)

[5]: [https://iko.ai/docs/notebook/#multiple-
checkpoints](https://iko.ai/docs/notebook/#multiple-checkpoints)

~~~
D2187645
Is appbook a long running notebook?

~~~
Jugurtha
Are you a colleague? We're working on that. We preferred making the notebooks
asynchronous first to ship the feature since that's the "natural habitat"
where users spend the most time, then adding that functionality to the
AppBook.

Great question!

------
akx
I'm the CTO at Valohai (we almost got in YC some years back!) - we solve many
of these issues to let data scientists focus on the interesting bits. See
[https://valohai.com](https://valohai.com) :)

