
Ask HN: What's Your CI/CD Workflow for Your Machine Learning Projects? - swyea
We are working on an ML project (Computer Vision specifically) and we are in the process of productionizing our models. What kind of tools do you use and what&#x27;s your workflow for integrating, testing and deploying your models? Do you have any suggestions or tips about what to do and what to avoid?
======
volker48
Having put many models into production in an almost real time environment (ad
servers that need predictions in < 10 ms) I would say a large portion of what
you need depends on your production requirements. The system I work on is very
high volume 100k request / sec and we have massive amounts of data so it
complicates everything.

First, I would highly recommend wrapping your ML models in some kind of
microservice. Depending on your production requirements and if the ML is in
Python a fairly simple Flask/Sanic web server should be sufficient. This is
great because you can leave all your feature transformation code as is in
Python.

If your production environment has very low latency requirements you are going
to have some work cut out for you. You'll most likely have to rewrite all your
transformation code in a faster language like Go or Java. You might also need
to implement the inference code as well to get the speed you need. This adds
considerable time and adds a ton of surface error for potential insidious
bugs. The ML will still make predictions, but they will be wrong or very
slightly wrong.

Because I'm working with larges of amounts of data and my source of truth is
Parquet logs in S3, the pipelines start with Spark. We do as much data
wrangling as possible in Spark to get things into a manageable size to create
our train/dev/test sets. This data gets uploaded to S3.

The datasets are then trained on EC2 instances using Pandas & sklearn. When
everything is fully automated the Spark job will push a message onto an SQS
queue with the S3 path of the fresh dataset. An EC2 instance will be polling
that queue and pull down the data and train a new model.

The final result of training my case is a text or binary model file that goes
back up to S3. Our prediction microservice polls an S3 bucket and pulls down
any updated model files and swaps out the running models.

Tips:

    
    
      1. Instrument everything! Hopefully you have something like graphite/datadog/prometheus in place already, but you'll want metrics on your predictions.
      2. Exception tracking on everything especially anything in your model creation pipeline. Sentry or something like that.
      3. Try and keep everything as simple as possible.

~~~
claytonjy
Thanks for this; I really appreciate the detail here. There seems to be a lack
of these kinds of explanations around.

One piece did give me a bit of surprise:

    
    
      You might also need to implement the inference code as well to get the speed you need
    

I've never had the super-low-latency requirements you have, but as you point
out this seems amazingly error-prone. I'd love to hear anything else you can
share about the cost-benefit analysis you do before deciding to go this route,
and if there's any tools or languages you've had a better time with than
others for this.

~~~
volker48
Yeah, the requirements are pretty different than most Data Science teams,
especially the very low latency requirements.

The constraints force us to use simple models like linear regression and
logistic regression some of the time or at least as a version 1. The inference
here is straightforward, multiply and add then take the sigmoid if doing
logistic regression.

What we tried to do initially was integrate with C/C++ APIs where possible. We
ran into some issues with speed and bugs doing this though, which is why we
wrote the inference ourselves. Where we had issues was calling the XGBoost C
API from Go. It was extra overhead and too slow. In our benchmarks our
implementation in pure Go was many times faster than calling the C API. We
also found the multithreaded version to be slower than the single threaded. We
found this to be true when calling XGBoost from Java and from Go. We also
found this to be true in our own inference implementation it was always faster
to walk the trees in a single go routine rather than create some number of
worker go routines to walk the trees in parallel.

We were very careful implementing the inference ourselves to make sure the
predictions matched. What we did to verify this was create a few toy datasets
of about 100 rows with sklearn's make_classification function. We then trained
a model using the reference implementation, saved the predictions and the
model. We then loaded this model into our implementation and made predictions
on the same dataset. We wrote unit tests to compare the predictions and make
sure they are the same within some delta. We were able to get our
implementation to be within 1e-7 of the reference implementation, in this
specific case XGBoost. It was actually more time consuming to deal with
parsing the inconsistent JSON model output of XGBoost than it was to implement
the GBDT inference algorithm. We also had to make a slight change to the
XGBoost code to write out floats to 18 decimal places when writing out the
JSON model in order to get the two implementations to match.

~~~
claytonjy
This is all SO fascinating to me. Multiple threads slowing stuff down & 18
decimal places being relevant stick out as surprising.

Part of me is thankful to not have these problems, while another part thinks
it'd be a lot of fun to do this kind of last-mile engineering.

~~~
volker48
Yeah, I had a very hard time believing that the multithreaded approach would
be slower. Its so counterintuitive since at first blush it seems that walking
N trees is an embarrassingly parallel problem. I tested up to 1000 trees and
single threaded was still faster. I'm sure at some point the multithreaded
approach will win out, but its beyond the number of trees and max depth we are
using.

~~~
claytonjy
I've half-convinced myself it's because we're talking about GBM's and not
Random Forests (where my mind goes first). One of the smart things about
XGBoost is parallelizing training by multithreading the variable selection at
each node, but that doesn't apply to inference; I imagine you gotta predict
trees sequentially since each takes the previous output as an input? Now I
wonder what those extra threads were even doing...

~~~
volker48
Each tree can be traversed independently. Each tree is traversed until a leaf
node is reached. Those leaf values are summed for all the trees. The sum of
all the leafs plus a base score is then returned. In the case of binary
classification that sum is passed through a sigmoid function. For linear
regression the sum is returned.

------
jeffreysmith
Shameless plug: I cover this topic in my book, Reactive Machine Learning
Systems: [https://www.manning.com/books/reactive-machine-learning-
syst...](https://www.manning.com/books/reactive-machine-learning-systems) In
particular, chapters 7 and 8 cover the design of model servers and model
microservices, from publishing through to serving. And chapter 9 focuses on
how build CI/CD pipelines for ML systems.

Definitely lots more options than I even had space to cover in a single book,
though. This is an area where the tooling continues to rapidly improve.

------
sevmardi
I read this [1] a while back. Have a look, might help out or given you some
ideas.

[1] [https://engineering.semantics3.com/moving-machine-
learning-f...](https://engineering.semantics3.com/moving-machine-learning-
from-practice-to-production-9c462eeef9fa)

------
shabazp
I have deployed ML algorithms into production including computer vision and
data science models. From my experience, it's based on the application and
cost we are comfortable with.

1\. Keras/Tensorflow based algorithm(Applicable for any compute intensive or
GPU-capable algorithm): Deployed the method (as a flask service) inside a
Docker container along with a queueing system(for reliability w/ redis). We
can now decide the server type and the kind of orchestration tool we can use
to manage these containers. Following are some options for it,

    
    
        a. ECS on AWS
        b. Kubernetes 
        c. Docker Swarm
        d. custom orchestration tool
    

2\. If your ML model is a simpler with a small (enough) model size, then using
Lambdas on AWS would also work. This can provide high throughput and low cost
per request if your computation time isn't very high.

Tips:

    
    
       i. Have the memory flush in the code after the service is used so that there is no memory leak. 
       ii. You can use tools such as htop to understand the memory usage.
       iii. Regarding system performance, you can use prometheus to gather stats along with grafana dashboard to view them.
    

I consider CI/CD essential for achieving a seamless workflow for a data
scientist, and after having faced the same problems ourselves, our team and I
have been working on Datmo, a tool to help companies more easily and cost-
effectively deploy + manage their models in production.

~~~
shabazp
3\. Another important component for CI/CD is the integration of tools like
Airflow in order to schedule and monitor workflows. This helps us in deploying
newly trained models.

------
asampat
We faced the issue of building CV models a lot as grad students but at that
time reliability wasn't really something we had to solve for. Once we had to
implement them for industrial applications we found we had to ensure there was
reproducibility and versioning throughout the process. Now that we have put a
few computer vision algorithms into production we decided to architect our
code in the following way.

1) training code is written with normal packages (we tend to prefer keras with
a tensorflow backend), these are trained on our own GPUs since this is often
the cheapest and are done in bulk --> side note is that TPUs/GPUs may one day
be better but certainly too expensive currently.

2) prediction / inference code will also be written in the similar packages
but will be tied with the final weights files that we get from the model

3) deployment code -- in order to enable a reliable system we have used celery
for distributed queueing and converting our prediction functions into celery
tasks which can then me passed to workers that can process and return the
result to an API endpoint. This allows us to scale our workloads as needed
with throughput (depending on request requirements)

This architecture allows us to test during training time using our training
code and validation sets, while also enabling testing of different models
versions through our prediction APIs. We often would just write scripts for
testing that we then run via our CI/CD workflow.

Tip: keep your code as simple as possible and don't rewrite your code unless
your throughput requirements mandate it. good error handling will go a long
ways here and likely will be easier if written in a language you're most
familiar with.

------
claytonjy
I don't work on computer vision specifically, and can generally suffer a few
seconds of latency, but we have a two-stage process that I think applies
fairly generally

1\. all lowish-level production code (importing, transforming, modeling) is
written in packages, which are thoroughly unit-tested with a CI system

2\. that code is wrapped up into docker container(s) in separate repositories,
which are built and integration-tested with CI. In addition to the Dockerfile
and any testing scripts, there's usually a single code file here which handles
I/O specifics, API endpoints, and primarily calls code from the package

This works well with R or Python and should work with others; we use Gitlab
for the free private repos and awesome built-in CI.

This doesn't cover where the data or models are stored, but that varies more
per-project for us. Lately we've been using Pachyderm and loving it, but you
can get pretty far with a postgres instance for data and storing trained model
objects in S3/GCS.

------
lee101
Hi i'm founder of [https://bitbank.nz](https://bitbank.nz) a crypto currency
live prediction dashboard/API/bulk data service.

Our system streams in market data from exchanges, creates forecasts with
python/sk-learn and displays the data, we also have background processes that
updates our accuracy over time once real data is available.

We test our code with the normal python unit/integration/end to end testing
methods locally with local copies of all of our components (except firebase
for live UI updates, we don't have a dev version of that yet just use live
forecast data when testing the UI charts/display), would probably get
expensive/cumbersome to setup dev environments with local firebases in them.

with deployment we simply ssh into machines, git pull latest code and
supervisorctl restart so its fairly low tech, the forecaster has a roughly
minute outage when we deploy new models because there is a decent process that
computes and caches a data structure of historical features for use in the
forecaster.

In terms of maintaining a reliable online stream of data input, feature
computation/prediction pipeline we run the code under supervisor aswell as
running a manager process under supervisor, that manager process checks if
conditions are turning bad (OOM/no progress updates by the forecaster) and
restarts things if anything goes wrong.

For testing we also use the standard training/test data split when running
backtesting/machine learning optimisation algorithms to train parameters of
the algorithm. If things perform better on training and test data over a long
enough time period to build confidence then we will deploy a new model.

Using graphite/graphana to monitor prediction accuracy over time is a good
idea as mentioned already :) and some kind of alerting/monitoring if things go
down.

------
larrydag
I'm a statistician by trade so I mostly do prototype work. As far as building
models my key workflow is using a R and Rstudio. The biggest issue is data
management. I suggest a good API or wrapper for a data source that has all of
the ETL already done for the most part. R connects very well to most database
systems. RStudio makes development easier with connectivity to GitHub or other
popular version control systems.

As far as putting into production I'm not as familiar. Yet I hear that a good
Python workflow would probably work best.

~~~
vfulco
Also Rstudio mates well with bitbucket for those who want private repos for
free.

~~~
claytonjy
For private repos, I'd say gitlab is an order of magnitude (or two) better
than bitbucket. Or, it clearly was 2 years ago, and while I haven't kept up
with bitbucket, gitlab has improved by leaps and bounds in those two years.

The killer features for me are nested subgroups (which bitbucket may have, but
github does not) and a really awesome CI system with a generous free tier
(2000 minutes/month). For R packages, we have it setup very similar to github
+ travis (devtools::check() every push), and for deployable bits we have it
build containers and run integration tests on them. Super impressed with all
we get for free there.

------
agibsonccc
Disclaimer: I compete in this space and may compete with your internal team,
your cloud vendor, or something you are interested in.

FWIW: Production is an overloaded term. CI may not even be applicable here.
Say you're doing batch inference where you need to run jobs every 24 hours on
a large amount of data: That might be tied to some cron job.

That being said you _could_ use a CI system for that in theory.

There are also other factors here: What other kind of things do you want to
track? Experiments results? Wrong results by your machine learning algorithm?

Concisely: What kind of deployment requirements do you have and what are your
goals?

If you are edoing real time, what does "deployment" even mean? Are you serving
in real time via a rest api? Are you doing streaming? What are your throughput
requirements? What about latency? Is that even hooked up to a CI system?

Something that _is_ vaguely related: How do you test the accuracy of different
models across your cluster? Say you want to do a self deployment, what if you
want to tie that to say: a workspace where you produced the results?

Is that hooked up to a CI system? If so, what's your use case?

Then there's that common hand off from data scientist to production, what does
that look like? A sibling thread mentioned some of these things.

If anyone else is curious about this stuff, we deploy deep learning models in
locked down environments both on kubernetes as well as touching hadoop
clusters. Happy to answer questions.

------
MisterEd
Shameless plug (I'm head of product at Algorithmia): we've written a lot of
algorithms ourselves including many computer vision algos [1] that are
published publicly for use. The underlying platform at Algorithmia is able to
run production models at scale and puts a simple REST API in front of all of
the algorithms so developers can easily call them from your production apps.
The algorithms you see in the marketplace are publicly published but the
platform runs private algorithms for software teams as well. Check it out: [2]

[1]: [https://algorithmia.com/tags/computer-
vision](https://algorithmia.com/tags/computer-vision) [2]:
[https://algorithmia.com](https://algorithmia.com)

------
pilooch
We use DeepDetect everywhere. I started coding it up in 2015 out of immediate
need for my customers in production at the time and waited for something else
to come out. But until now, we've sticked to it and customers ask for it when
they see us using the pre-configured pipelines. Some run their production with
the DD server. So three years later we're still putting down some time on
improving it, and support for caffe2 is coming up among other things. It
already has built in semantic search due to customer demand. Welcome any
feedback an comments btw. Also we're hiring, contact us, we're at
[http://jolibrain.com/](http://jolibrain.com/)

------
eggie5
I try to do as much of the transformations as possible in tensorflow using the
Datasets API so that I don't have to write them in another language/system in
production.

~~~
sseveran
I had tried to do the same. However given the CPU/GPU imbalance on AWS GPU
instances I have resorted to building a fully "rendered" training set and
doing all the transforms in spark.

See:
[https://github.com/tensorflow/tensorflow/issues/13610](https://github.com/tensorflow/tensorflow/issues/13610)

------
mindhash
heard of datmo.io they automate the versioning and deployments. Check them out
if it suits you.

------
mikeyanderson
If you use Algorithmia.com you can add your model in the language of your
choice (on GPUs if you want) and it will do all of the Devops and give you an
API end point. You get free credits at sign up and quite a few each month for
testing.

~~~
JosephRedfern
Full Disclosure (usually considered courteous to provide this yourself):
@mikeyanderson is head of Marketing at Algorithmia.com.

It does look like a cool service, though!

