
Ask HN: What does your ML pipeline look like? - mallochio
Experienced machine learning professionals - How do you create scalable, deployable and reproducible data&#x2F;ML pipelines at your work?
======
tomasdpinho
As a DevOps Engineer working for a ML-based company and have had worked for
others in the past, these are my quick suggestions for production readiness.

DOs:

If you are doing any kind of soft-realtime (i.e. not batch processing)
inference, by exposing a model on a request-response lifecycle, use Tensorflow
Serving for concurrency reasons.

Version your models and track their training. Use something like MLFlow for
that. Divise a versioning system that makes sense for your organization.

If you are using Kubernetes in Production, mount NFS in your containers to
serve models. Do not download anything (from S3, for instance) on container
start up time unless your models are small (<1Gb).

If you have to write some sort of heavy preprocessing or postprocessing steps,
eventually port them to a more efficient language than Python. Say Go, Rust,
etc.

DO NOTs:

Do NOT make your ML engineers/researchers write anything above the model
stack. Don't make them write queue management logic, webservers, etc. That's
not their skillset, they will write poorer and less performant code. Bring in
a Backend Engineer EARLY.

Do NOT mix and match if you are working on an asynchronous model, i.e. don't
have a callback-based API and then have a mix of queues and synchronous HTTP
calls. Use queues EVERYWHERE.

DO NOT start new projects in Python 2.7. From past experiences, some ML
engineers/researchers are quite attached to the older versions of Python.
These are ending support in 2020 and it makes no sense to start a project
using them now.

~~~
fcolas
+1, tomasdpinho. Yes to everything, and notably the queues everywhere,
versioning the models, and the issue to mix sync and async (go for queues).

As a scientist designing risk management systems, I also like to:

. avoid moving the data;

. bring the (ML/stats) code to the data;

. make in-memory computations (when possible) to reduce latency
(network+disk);

. work on live data instead of copies that drift out-of-date; and

. write software to keep models up to date because they drift with time too
and that's a major, operationally un-noticed, and extremely costly problem.

I'm not yet into Tensor/ML-Flow, but I use R, JS, and Postgres, thereby
relying on open-source eco-systems (and packages) that are:

. as standard as possible;

. well-maintained;

. with a long expected support; and

. as few dependencies as possible.

~~~
tixocloud
+2 for bringing the (ML/stats) code to the data instead of the other way
around

~~~
thiggy
Could you speak to your experience with this particular list item?

~~~
tixocloud
We deal with fairly large volumes of data on a frequent basis so it would not
make sense for each data scientist to create a copy within their own
environment. Everyone works off a centralized data source and we provide them
with Jupyter/Spark in an internal cloud environment.

------
AndrewKemendo
Depends on what you're trying to do.

Are you putting a trained inference model into production as a product? Is it
a RL system (completely different architecture than an inference system)? Are
you trying to build a model with your application data from scratch? Are you
doing NLP or CV?

As a rule of thumb I look at the event diagram of the application/systems
you're trying to implement ML into, which should tell you how to structure
your data workflows in line with the existing data flows of the application.
If it's strictly a constrained research effort then pipelines are less
important, so go with what's fast and easy to document/read.

Generally speaking, you want your ML/DS/DE systems to be loosely coupled to
your application data structures - with well defined RPC standards informed by
the data team. I generally hate data pooling, but if we're talking about
pooling 15 microservices vs pooling 15 monoliths, then the microservices
pooling might be necessary.

Realistically this ends up being a business decision based on organizational
complexity.

~~~
mallochio
Thanks for the reply. Could you give some more insight into how and what tools
you choose for the different sort of tasks (say NLP vs CV vs RL)? Also, how
and why are different tools/pipelines better for production and product
building?

~~~
AndrewKemendo
How you parse and manage the inputs is significantly different between those
types.

With NLP as one example, you need to determine when are you going to do
tokenization? - aka break up the inputs into "tokens." So do you do this at
ingest, in transit, at rest?

With CV you don't need to do tokenization at all (probably).

So the tools really come out of the use case and how/when you put them into
the production chain.

------
lostdog
Here are some things that have helped me make shipping models easier.

\- Tag and version your datasets, especially your test set, so that you are
confident when you compare performance across models or over time.

\- Test your training pipeline. Run it on a single batch or a tiny dataset.
The whole test should take less than 1 minute to confirm that your training
pipeline didn't break. Once one test works, people will write others.

\- You should be able to measure the accuracy of your model with a single
command, or as part of your CI process. This command should spit out a single
report (PDF or html) which is all you need to look at to decide whether to
ship a new model or not, including run-time performance (if needed).

\- Don't create hermetic training environments that prevent you from doing
debugging. Sometimes you just need to ssh in and put in some print statements
to track down a problem.

------
bertil
The one I’m working with _now_ is very low tech: daily Python processing data
from GCP, and writing back to GCP; a handful of scripts that check everything
is reasonnable. That’s because we serve internal results, mostly read by
humans.

The most impressive description that I’ve seen run live is described here:
[https://towardsdatascience.com/rendezvous-architecture-
for-d...](https://towardsdatascience.com/rendezvous-architecture-for-data-
science-in-production-79c4d48f12b)

I’d love to have feedback from more than Jan because I’m planning on
encouraging it internally.

The best structure that I’ve seen is at scale (at a top 10 company) was:

\- a service that hosted all models, written in Python or R, stored as Java
objects (built with a public H2O library);

\- Data scientists could upload models by drag-and-drop on a bare-bones
internal page;

\- each model was versioned (data and training not separate) by name, using a
basic folders/namespace/default version increment approach;

\- all models were run using Kubernetes containers; each model used a standard
API call to serve individual inferences;

\- models could use other models output as input, taking the parent-model
inputs as their own in the API;

\- to simplify that, most models were encouraged to use a session id or user
id as single entry, and most inputs were gathered from a all-encompasing live
storage, connected to that model-serving structure;

\- every model had extensive monitoring for distribution of input (especially
missing), output, delay to respond to make sure both matched expectation from
the trained model;

e.g.: if you are training a fraud model, and more than 10% of output in the
last hour was positive, warn the DS to check and consider calling security;

e.g.a.: if more than 5% of “previous page looked at” are empty, there’s
probably a pipeline issue;

\- there were some issues with feature engineering: I feel like the solution
chosen was suboptimal because it created two data pipelines, one for live and
one for storage/training.

For that problem, I’d recommend that architecture instead:
[https://www.datasciencefestival.com/video/dsf-
day-4-trainlin...](https://www.datasciencefestival.com/video/dsf-
day-4-trainline-2019-dan-taylor-and-sam-taylor/)

~~~
georgeburdell
What does "model" mean? What kind of data are contained in a machine learning
model? Second, how do you decide a model is robust? I'm asking because I'm
looking at using ML to more efficiently use some quality assurance tools for a
product line. The idea is to develop a model such that product A, B, C can
have existing (or supplementary data) QA data plugged into a model, and then
an appropriate sampling plan can be output.

An intern showed proof of concept of such a model based on one product, and
it's fantastic work that could save thousands of dollars, but we're struggling
with how to "qualify" it. How do we know we won't get a "garbage in/garbage
out" situation?

~~~
joshvm
So you want to figure out how often you need to sample products for QA?

A model is two things: a description of what's in the black box (could be a
linear model, a neural network architecture, etc) and some weights which
uniquely define "that specific model". Each model will have some known input
(eg image, tabular data) and output (eg number, image, list etc).

You need to store both the structure and weights: for example your model is y
= mx + c, but you need to know m and c to uniquely define it.

To answer your second question robustness means a smart test strategy. Train
on representative data, validate during training on a second data set and test
on hold-out data that the model has never seen.

Unfortunately it's quite hard to prove model robustness (in the case of deep
learning anyway), you have to assume that you trained on enough realistic
data.

If you really have no idea about robustness, then you should probably do a
kind of soft-launch. Run your model in production alongside what you currently
use, and see whether the output makes sense.

You could try, for example, sampling with your current strategy as well as the
schedule defined by your ML model (so you lose nothing but a bit of time if
the ML system is crap). Then compare the two datasets and see whether the ML
model is at least performing the same as your current method.

Surely you can make some naive estimates of robustness though? eg if the model
says sample 5% of your product, you then have a bound on the chance that you
miss something.

------
tixocloud
We're building a deployment system based on the following principles that
we've learned:

\- Version control of models and datasets

\- Code, naming conventions and formatting consistency

\- Testing of models before deployment into production (in most cases, this
helps gain credibility across engineering)

\- Model review process with a senior data scientist

\- Kubernetes/Docker for deployment serving as an API or running a scheduled
job

\- Model performance monitoring when in production to identify degradation

~~~
mallochio
Could you also give us some details about the software you specifically use in
the pipeline, other than kubernetes/docker? Do you use any available (possibly
open source)? tools for versioning and monitoring? Or are you building this
all up from scratch for your needs?

~~~
tixocloud
We use Jenkins for our builds (and soon simple testcases), which then feeds
into our system that's built from scratch. However, we are looking to
productized the system as we had some discussions with other corporate
partners who are interested in as well.

The system we've built in-house is fairly simple to keep development fast -
versioning is manual ([https://packaging.python.org/guides/hosting-your-own-
index/](https://packaging.python.org/guides/hosting-your-own-index/)) but no
reason why you couldn't use a repository manager.

For monitoring, we essentially track any activity related to the model
including inputs, outputs, timestamps, duration, etc. to a database and have
JavaScript charts render. We might put this into Kafka but seems overkill at
the moment and likely force us to hire an actual support team.

------
jcims
Would be interested if anyone in here has a pipeline operating on regulated
data (HIPAA, financial, etc). Having a hard time drawing boundaries around
what the data science team has access to for development and experimentation
vs. where production pipelines would operate. (e.g. where in the process do
people/processes get access to live data)

~~~
reureu
I do machine learning work in healthcare, and work for a HIPAA covered entity.
The issue of permissions and data access often gets applied in an
unnecessarily strict fashion to data scientists in these environments, often
due to a lack of understanding from engineering managers who have been brought
in from a non-regulated environment (e.g., hiring a salesforce engineering
manager into a healthcare system so they can "disrupt" or "solve" healthcare
-- they hear "regulation" and immediately clamp down on everything).

HIPAA allows use of clinical data for treatment, payment, and operations. You
can also get around consent issues if the data is properly deidentified. If
you have a data scientist who is working to further treatment, payment, or
operations (i.e., isn't working on purely marketing uses, selling the data, or
doing "real" research), then they are allowed to use the data, assuming it's
the minimum necessary for their job. For training machine learning models that
support operations, "minimum necessary" is probably a lot of data. And,
obviously, the production pipelines and training/experimentation/development
would need access to the same amount of data if you want to train and deploy
models.

Data scientists are also likely to be the first to notice problems with how
your product is working, often before the data engineering team. At my
company, I've found numerous bugs in our data engineering pipelines and
production code because I've seen anomalies in the data and went digging
through the data warehouse, replicas of the production databases, and within
our actual product. You probably want to support and encourage that kind of
sleuthing - but each organization is different, so maybe you have better QA
that's more attuned to data issues.

My opinion, from having done this for over a decade, is that the question
shouldn't be about how much access you give your data scientists. They should
have access to nearly all of the data that's within their domain, assuming
they're legally entitled to it. The question you should be solving for is what
restrictions should be placed on how they access and process that data: e.g.,
have EC2 instances and centralized jupyter notebooks available for them to
download and process data, and prohibit storing data on a laptop.

~~~
SkyPuncher
I'd just like to point out that "the minimum necessary for their job" is the
reason many engineering managers apply unnecessarily strict rules.

It's very difficult to build rules and policies that allow broad access while
maintain minimum necessary. Some project may be completely justified in
accessing "all" (waves hands) data at its conception but slowly morphing to
focus on only a few key identifiers while still processing "all" data.

~~~
reureu
Totally, and to extend that further, one employee may work on multiple
projects for which the "minimum necessary" is different. Part of my job
involves patient matching and reporting patient-level data to partners we have
a BAA with. That means I need access to patient names and addresses. However,
if I'm working on training some ML model to predict diabetic progression, it's
not necessary for me to pull the names of patients.

I think there's an incorrect assumption in here that there exists a technical
solution which entirely solves this problem; that we just need to figure out
what the right set of rules are, or get the right column-level and row-level
security policies in place and we're all set. It's necessary but not
sufficient to have those kinds of safeguards in place. You also need to trust
_somebody_ in the organization, and you need to give those somebodies training
and support to do the right thing.

In my case, I need access to all of the (clinical) data within the
organization. I don't really care how that end is achieved: with one account
that has every permission, with multiple accounts that are used for different
purposes, or whatever. Ultimately, it's in the interest of the organization to
make sure that I have the access I need to successfully do my job.

------
kmax12
I see this as a very timely question. As ML has proliferated, so has the
number of ways to construct machine learning pipelines. This means that from
one project to another the tools/libraries change, the codebases start looking
very different, and each project gets its own unique deployment and evaluation
process.

In my experience, this causes several problems that hinder the expansion of ML
within organizations

1\. When a ML project inevitably changes hands, there is a strong desire from
the new people working on it to tear things down and start over

2\. It's hard to leverage and build upon success on previous projects e.g "my
teammate built a model that worked really well for predict X, but I can use
any of the that code to predict y"

3\. Data science managers face challenges tracking progress and managing
multiple data science projects simultaneously.

4\. People and teams new to machine learning have a hard time charting single
a path to success.

While coming up with a single way to build machine learning pipelines may
never be possible, consolidation in the approaches would go a long way.

We've already seen that happen in the modeling algorithms part of the pipeline
with libraries like scikit-learn. It doesn't matter what machine learning
algorithm you use, it the code will be fit/transform/predict.

Personally, I've noticed this problem of multiple approaches and ad-hoc
solutions to be most prevalent in the feature engineering step of the process.
That is one of the reasons I work on an open source library called
Featuretools
([http://github.com/featuretools/featuretools/](http://github.com/featuretools/featuretools/))
that aims to use automation to create more reusable abstrations for feature
engineering.

If you look at our demos
([https://www.featuretools.com/demos](https://www.featuretools.com/demos)), we
cover over a dozen different use cases, but the code you use to construct
features stays the same. This means it is easier to reuse previous work and
reproduce results in development and production environments.

Ultimately, building successful machine learning pipelines is not about having
an approach for the problem you are working on today, but something that
generalizes across the all the problems you and your team will work on in the
future.

------
jamesblonde
At Logical Clocks, we build a horizontally scalable ML pipeline framework on
the open-source Hopsworks platform, based around its feature store and Airflow
for orchestration:

* [https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.htm...](https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html)

* [https://www.logicalclocks.com/feature-store/](https://www.logicalclocks.com/feature-store/)

The choice for the DataPrep stage basically comes down to Spark or Apache
Beam, and we currently support Spark, but plan to soon add support for Beam,
because of some of the goodies in TFX (TensorFlow Extended).

For distributed hyperparam opt and distributed training, we leverage Apache
Spark and our own version of YARN that supports GPUs -

* [https://www.youtube.com/watch?v=tx6HyoUYGL0](https://www.youtube.com/watch?v=tx6HyoUYGL0)

For model serving, we support Kubernetes:

* [https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html#s...](https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html#serving)

Our platform supports TLS/SSL certificates everywhere and is open-source.
Download it and try it, and it runs in several large enterprises in Europe. We
have a cluster with >1000 users in Sweden here:

* [https://www.hops.site](https://www.hops.site)

(Edited for formatting)

~~~
mallochio
This is great, thanks for the link. Could you expand on how this workflow be
different/better than sticking to just something like TFX and tensorflow
serving? Is it easier to use or more scalable?

~~~
jamesblonde
It is pretty much the same as TFX - but with Spark for both DataPrep and
Distributed HyperparamOpt/Training, and a Feature Store. Model serving is
slightly more sophisticated than just TensorFlow Serving on Kubernetes. We
support serving requests through the Hopsworks REST API to
TFServering/Kubernetes. This gives us both access control (clients have a TLS
cert to authenticate and authorize themselves) and we log all predictions to a
Kafka topic. We are adding support to enrich feature vectors using the Feature
Store in the serving API, not quite there yet.

We intend to support TFX as we already support Flink. Flink/Beam for Python 3
is needed for TFX, but it's not quite there yet, almost.

It will be interesting to see which one of Spark or Beam will become the
horizontally scalable platform of choice for TensorFlow. (PyTorch people don't
seem as interested, as they mostly come from a background of not wanting
complexity).

------
kwooding
Here’s a framework we’ve been developing for this purpose, delivered as a
python cookiecutter:

[https://github.com/hackalog/cookiecutter-
easydata](https://github.com/hackalog/cookiecutter-easydata)

The framework makes use of conda environments, python 3.6+, makefiles, and
jupyter notebooks, so if that tooling fits into your data/ML workflow.

We gave a tutorial on the framework at Pydata NYC, and it’s still very much in
active development - we refine it with every new project we work on. The
tutorial is available from:

[https://github.com/hackalog/bus_number](https://github.com/hackalog/bus_number)

While it works well for our projects, the more real-world projects we throw at
it, the better the tooling gets, so we’d love to hear from anyone who want to
give it a shot.

~~~
mallochio
+1 for mentioning this. Took it out for a test run. Really neat stuff.

------
pilooch
We do everything with
[https://www.deepdetect.com/](https://www.deepdetect.com/). Small team and
targeted applications, this may not apply to all needs. The time from dev to
prod is very low though.

------
Liveanimalcams
I'm not a professional but I built a pipeline for Makers Part List - It
involves ingesting a video URL, converting the video into images, then storing
the images in google storage. Once stored I trigger the model to classify the
image. The images are then displayed to annotators who verify/relabel the
images. Once I get enough new images the system creates a .csv and uploads it
to googles autoML where it retrains my model.

My bottlenecks now are splitting the videos into images as its a very CPU
intensive process. Implementing a queue here is my best choice I think.

~~~
mallochio
Interesting. Could you maybe expand on the tools that you utilize inside the
pipeline for ETL, model creation, annotation and testing?

~~~
Liveanimalcams
I'm running the front and backend of the consumer site on Heroku. The meat of
pipeline is hosted on a DigitalOcean High CPU Droplet. I use ffmpeg to extract
images from the provided videos. I store everything in Google Cloud Storage
and create references to each photo in Firestore. I use Firebase to power for
the image verifying/labeling app I built. Its a simple app that presents the
viewer with the image and the label that it was given. If its not correct they
enter the correct label. I use a cloud function to move the images into an
exportable format for autoML once a new image threshold has been hit. Testing
is me using it and seeing if it is correctly identifying the objects.

------
kanzure
I've been referencing this lately: [https://github.com/pditommaso/awesome-
pipeline](https://github.com/pditommaso/awesome-pipeline)

------
6gvONxR4sf7o
Followup question: How often do you all re-tune your hyperparameters?
Automatically whenever your training ETL runs? Only on first deployment?
Somewhere in between?

------
usgroup
GitHub repo and a Makefile...

Cut C++ binaries, statically linked, test on local against small data, and
ship them to production with scp. Run either standalone, on GPU cluster or MPI
cluster.

How the data gets to where it needs to be depends on the environment.

Makefile contains all the steps : from compile to test to deploy.

~~~
kotrfa
That sounds terrible to me. I hope there is at least proper CI which executes
stuff from the makefile, but even that, `scp` to ship something in 2019 is a
step back
([https://www.phoronix.com/scan.php?page=news_item&px=OpenSSH-...](https://www.phoronix.com/scan.php?page=news_item&px=OpenSSH-8.0-Released))
.

~~~
usgroup
Nah, it’s very easy to know what’s going on as a result. The endpoint is only
ever a single binary that uses environment variables. Nothing mutates the
environment. And the environment is only ever a place with an interface to
hardware and to data.

There is a single threaded CI for each environment that checks prerequisites,
builds, tests, copies, executes and saves logs.

So CI is also a simple queue and binaries remove the need for containers.

Experience has made me a believer in making things only as complicated as they
need to be and favouring solutions that can fit into one brain. Do it whenever
it fits your requirements and is possible.

------
miket
Obligatory XKCD: [https://xkcd.com/2054/](https://xkcd.com/2054/)

