
Ask HN: What is your ML stack like? - imagiko
How did your team build out AI&#x2F;ML pipelines and integrated it with your existing codebase? For example, how did your backend team(using Java?) work in sync with data teams (using R or python?) to have minimal rewriting&#x2F;glue code as possible to deploy models in production. What were your architectural decisions that worked, or didn&#x27;t?<p>I&#x27;m currently working to make an ML model written in R work on our backend system written in Java. After the dust settles I&#x27;ll be looking for ways to streamline this process.
======
aaron-santos
What didn't work:

Shipping pickled models to other teams.

Deploying Sagemaker endpoints (too costly).

Requiring editing of config files to deploy endpoints.

What did work:

Shipping http endpoints.

Deriving api documentation from model docstrings.

Deploying lambdas (less costly than Sagemaker endpoints).

Writing a ~150 line python script to pickle the model, save a
requirements.txt, some api metadata, and test input/output data.

Continuous deployment (after model is saved no manual intervention if model
response matches output data).

~~~
exhilaration
Could you possibly define "pickling" in this context for us ML noobs?

~~~
PascLeRasc
If by ML noob you mean to say that you're like me and have zero formal CS
training (as in, I don't know what a data structure is), pickling lets you
write your Python workspace to a file just like Matlab's .mat file loading.
It's excellent for writing scripts defining different parts of a data
pipeline, or just for debugging/trying new things without waiting 20 minutes
for something to filter.

~~~
shantly
> as in, I don't know what a data structure is

Basically everything you work with in programming is a value (the number one-
hundred-seventy-five, for example: "175") or the address—location in computer
memory, say—of a value. You might record the address of that value above as
the count of characters from the beginning of this post, for example, were
this post the layout of data in some RAM, just as numbering houses on a
street. Add the concept of data "width"—how long the number is, in terms of
how many characters represent it (3, in this case) and you've got basically
all there is in terms of primitive stuff that computer programs operate on.

Observe that the value stored at a location in memory, say, might itself be an
address—location—of some other thing stored in memory.

The width+value concept can get you pretty far, in that you can store a bunch
of stuff and find it again, given the address of the beginning and some
convention that the first so-many bits of the value describe the width of the
rest of the value, or some other means of knowing the size (width) of the
value, as long as all its parts are stored right next to each other in memory,
and in the correct order. That's called an array, in fact, which is a data
structure! One problem with arrays is that if you want to make them longer,
you might not have more memory available at the end of them—something else may
be using that location already, and if you just overwrite it that'll likely
break something. So you'll have to copy your entire array to a larger piece of
empty memory to add on to the end of it.

EXAMPLE!

Say we have some RAM large enough to store nine things, and we know that we
have something stored at position 4—programmers like to order items starting
with zero, but I'll refrain because it's not really important here and makes
it more confusing. The RAM contains the stuff we're looking for, plus some
other stuff that we don't care about right now:

[429317501]

We look at position 4, and know (by convention, or whatever) that the "3" we
find tells us how many more locations to read past that to get our entire
value, and extract "175" as our value, by proceeding to do just that. That's
an array! One of the most basic data structures. Of course all this is binary
under the hood, and those binary numbers can also represent letters or color
values of a pixel in an image or whatever, but I'm using simple base-10
numbers to keep things easier to follow.

You can use these basic pieces to build up more complex data structures than
that, of course. You can have a series of places in memory, not necessarily
next to each other, each containing a value and then the address of another
value+address pair. Given the address of the first of these, one could write a
program to read each in order, following the addresses to hop from one to the
next until it finds one without an address provided. Ta-da, it's (one kind of)
a linked list! That's another kind of data structure. Now if you want to add
to the end of your value, you just pick an empty spot in memory, add its
address to the last piece of the existing list (first "walking" the list to
find it by starting at the beginning and reading each piece in turn, if you
don't already know where the last part is located), then fill it in with the
value you want. This takes up more space than an array, though, and may take a
little longer to "read" (get the value[s] from). It's very common for many
types of data structure to be able to represent the same thing, just with
different trade-offs in terms of space used, or time to locate a given part of
the structure, or how much space or time it takes to modify it (recall, having
to copy an entire array in order to add on to it) and so on.

EXAMPLE!

[950818972]

If we know (somehow) that our linked list starts at position (address) 5, and
know that we can expect a value then an address at that location, next to one
another, we see "18", so our value starts with 1 and we should look at address
8 next, where we find "72"—value is 7, and now look at address 2, finding
"50". Zero in our little make-believe addressing system here conventionally
means there are no further addresses to look up (there is no address 0) so,
without knowing how long the list would be when we started, we now know we're
done, that there were three values stored in the list, and that they are, in
order, "1", "7", and "5".

If you've followed this far, you may be able to see how one could make "trees"
(pairs of addresses instead of just one, suggesting a "left" and "right" path)
and other things from these fundamental parts, and may be able to think of
reasons _why_ one might do this. A linked-list where all the "values" are
addresses to parts of some other structure (with the "address" portion of the
linked list item used as normal)? That's a sort of index, right? A list where
the "value" at each position is the address of the beginning of another list?
We call that a multidimensional list (or multidimensional array, if it's laid
out as an array) and it's one way a person might represent a grid of, for
example, colors in an image (this is basically what a bitmap is). And so on.

Disk storage uses the same fundamental building blocks. FAT, as in FAT32 or
FAT16, old DOS and Windows file systems? File Address Table is what FAT stands
for. There's a table (kinda like conjoined lists, much as above), starting at
a conventional spot (address) on a FAT-formatted disk partition, that
describes where all the files are on the rest of the disk, along with some
other info about the filesystem. It's basically exactly what it sounds like,
and uses precisely the same concepts as above, just applied to locations of
files on a disk rather than locations of values in RAM.

One last insight: program code—the instructions for the CPU—is also stored in
memory, and that works basically the same way as the stuff above. This
unified, undifferentiated storage system is called Von Neumann architecture,
and it's what pretty much all computers you're likely to encounter use. Point
being, those addresses pointing to things stored in memory? They can also
point to places where code is stored, which, once located, one might direct
the CPU to execute. A little thought on that, combining it with the above
notions, should suggest some cool things this would enable.

And that's about it. That's data structures, and indeed much of programming.
It's all values, addresses, widths, and more than a little bit of convention.

~~~
0wis
I have never read such a crystal clear explanation of what a computer / a Von
Neumann architecture is. Simpler is always better. Thanks a lot for this.

Any other clear descriptions of CS concepts - especially for Data Science - to
share ? Links to them ?

~~~
shantly
Haha, thanks. Nah, I wrote that just now and don't have, like, a source I go
to for this stuff that's not pretty well-known already, or a blog or anything.
Maybe I should start one. Not exactly ready to explain anything else like that
off-the-cuff in an HN post at the moment :-)

------
rozgo
Custom Unreal Engine simulator, simulating agents with NVidia Physx and
publishing sensors through GStreamer. GStreamer has sinks and sources for ROS,
and tensorflow elements for inferencing. We package this all into NVidia
Docker for scalable simulations. Setup is similar for training and inference.
The core framework is a streaming engine with stream combinators that enable
reasoning about spatio-temporal data streams. Where each datum is related to a
point in space and time. The goal is for tensors to be the streaming
primitives, but the pipeline is still fragile, it’s a challenge to keep all
this working with so many core technologies changing constantly (UE4 + Physx +
Cuda + CuDNN + Tensorflow + …). We train robots in simulation.

Most of our tools are built in Rust. Several of those are for
creating(cleaning) data streams out of datasets. They are converted into
tensors or ROS messages.

~~~
eigenvalue
How well does the training in the simulator transfer over to real world
environments? Do the models require fine tuning on real life data, or do they
immediately work? I would imagine they would get confused by things like
wheel/tracks slipping a bit on the ground versus in the Unreal simulation.

~~~
rozgo
Traditional simulations don’t translate well. Usually one has to choose a
fidelity level; controls, navigation, exploration, mission, etc… We are having
more success with models that can tolerate variable dynamics. Simulations
don't need to be realistic, only consistent. We then reinforce models with an
infinite number of simulation universes and rules, such that reality is just
another sample. We don’t have the resources for end-to-end training, and our
initial tests did not yield good results. But chaining smaller specific-
purpose networks within traditional control and planning systems is looking
really good.

------
atheriel
Our organization looks similar: models are almost all written in R but the
business operates in C#. We just use simple HTTP APIs to intermediate.
Specifically, we do the following:

1\. Use R packages to bundle up the models with a consistent interface.

2\. Create thin Plumber APIs that wrap these packages/models.

3\. Build Docker images from these APIs.

4\. Deploy API containers to a Docker Swarm (but you could use any
orchestration).

5\. Stick Nginx in front of them to get pretty, human-readable routes.

6\. Call the models via HTTP from C#.

This stack works pretty well for us. Response times are generally fast and
throughput is acceptable. And the speed at which we can get models into
production is massively better than when we used to semi-manually translate
model code to C#...

Probably the biggest issue was getting everyone on board with a standard
process and API design, after a few iterations. And putting in place all the
automation/process/culture to help data teams write robust, production-ready
software.

------
NegatioN
We train models as kubernetes cronjobs defined by a minimal properties file
per model defining number of cps/gpus/mem. They will start with a given image
(ex pytorch or tf) based on where in the repository these files are placed,
and will then run a user specified bash file to start the job.

Data scientists have a similar docker image running in kubernetes which
includes all of these images as conda environments for experimenting in prod-
like environments. Spark is used to fetch data for the most part.

Models report a finished state over Kafka after getting persisted to buckets
in Google cloud, then gets mirrored over to a ceph cluster connected to our
serving kubernetes cluster.

We have an in house Golang server binding to c++ for serving pytorch neural
nets persisted with the torch.jit API (I can really recommend this for hassle-
free model serving). We also have some Java apps for serving normal ALS or
Annoy based models.

Our traffic is not as wild as many here, but we're serving around 10M user
requests a day.

We also do a merging of results from several models' results, and join them
together with a separate "meta-model" that estimates which model the user has
had a preference for recently, to weight those up.

There's probably a lot of details left out here, especially about the serving
part, since we have various services in front of the models enriching data and
presenting it to the user, but it's the gist of it.

------
__erik
What has worked fairly well so far:

Models:

\- Models are structured as python packages, each model inherits a base class

\- base class has define how to train, and how to predict (as well as a few
other more specific things)

\- ML engineer can override model serialization methods, default is just
pickle

Infra:

\- Code is checked in to github, Docker container built each merge into master

\- Use Sagemaker BYO container to train models, each job gets a job_id that
represents the state produced by that job (code + data used)

Inference / deployment:

\- Deploy model as http endpoints (SageMaker or internal) using job_id

\- Have a service that centralizes all score requests, finds correct current
endpoint for a model, emits score events to kinesis, track health of model
endpoints

\- A/B test either in scoring service or in product depending on requirements

\- deploy prediction jobs using a job_id and a data source (usually sql) that
can be configured to output data to S3 or our data warehouse

So far this has been pretty solid for us. The tradeoff has been theres a step
between notebook and production for ML engineers which can slow them down, but
it forces code review and increases the number of tests checked in.

~~~
aaron-santos
> The tradeoff has been theres a step between notebook and production for ML
> engineers which can slow them down, but it forces code review and increases
> the number of tests checked in.

This was a game-changer for us. What does your testing story look like?

~~~
__erik
We're still ironing out a few things but unit tests for various functions then
we have smol statistically representative datasets for each model. In CI we
train a model on the small dataset (aim for <5 mins e2e) then have a a suite
of model metrics we care about, tests confirm values are within acceptable
bounds and the test values are pulled into the PR.

------
Datenstrom
We are framework agnostic for model development, models get converted to
ONNX[1] and served with the ONNX runtime[2]. They are deployed as
microservices with docker.

We are currently looking at MLflow[3] for the tracking server, it has some
major pain points though. We use Tune[4] for hyperparameter search, and MLflow
provides no way to delete artifacts from the parallel runs which will lead to
massive amounts of wasted storage or dangerous external cleanup scripts. They
have also been resisting requests for the feature in numerous issues. Not a
good open source solution in the space.

Note that this is for an embedded deployment environment.

[1] [https://github.com/onnx/onnx](https://github.com/onnx/onnx)

[2]
[https://github.com/Microsoft/onnxruntime](https://github.com/Microsoft/onnxruntime)

[3] [https://mlflow.org/](https://mlflow.org/)

[4]
[https://ray.readthedocs.io/en/latest/tune.html](https://ray.readthedocs.io/en/latest/tune.html)

~~~
hn2017
Any issues with the relatively new ONNX format? How do you handle model
monitoring, verifying accuracy of model over time?

------
marcinzm
The systems I've seen basically break things into different services. Tied
together with gRPC or Thrift which have code generators for most languages. So
the Java backend simply makes RPC requests to a server running R.

Although in one case we had very tight latency requirements (ie: 10ms) so the
ML results were pre-computed and loaded from a cache on the backend servers.

~~~
imagiko
I imagine I will be breaking down into different services as well. An ML
"blackbox" that makes a call to the back-end for data and returns a
result/prediction. This could happen through an API. What kind of API to
choose is still open.

I'm not very sure what you mean when you say the ML results were pre-computed?

~~~
marcinzm
If you're pulling data from somewhere automatically then you should make sure
to define the data contracts well and that they're not changed. Also, A/B test
things if possible. I've had issues in the past where the data pipeline view
of the data and the API view of the data weren't the same. Or where a bug was
fixed that resulted in the values of certain fields to change.

>I'm not very sure what you mean when you say the ML results were pre-
computed?

We were scoring ads per page, and possible values for both were known ahead of
time. So for each ad-page combination we generated the scores and then pushed
them into a giant cache.

------
chewxy
Almost entirely in Go. I use Gorgonia [0] and Gonum [1]. Granted I wrote
Gorgonia. All solutions fit into the company's CI/CD infra with almost no
additional overhead.

Sometimes the end result is gRPC services, sometimes its some sort of
serialized model (weights). Sometimes the model is specified in protobuf. Very
rarely it's a HTTP API. I don't fancy those.

Ironically I haven't done much distributed models. Or if it's distributed,
it's not some Kafka-esque monstrosity.

I rarely use Python for anything other than exploratory analyses now.

Being able to type `go build .` and have it run anywhere is pretty awesome

[0] [https://gorgonia.org](https://gorgonia.org) [1]
[https://gonum.org](https://gonum.org)

------
jointpdf
Check out the CRAN task view on this topic:
[https://cran.r-project.org/web/views/ModelDeployment.html](https://cran.r-project.org/web/views/ModelDeployment.html)

One dead simple way to do this (R model —> Java production) that I’ve done in
the past is to use PMML (via _pmml_ package), which converts models to an XML
representation. ONNX is a similar/newer framework along these lines. You can
also look at _dbplyr_ for performing ( _dplyr_ -like) data preprocessing in-
database.

~~~
oli5679
I didn't see you his comment when I posted but I would strongly recommend
PMML-based approach.

------
saternius
What we do at [https://quillbot.com](https://quillbot.com)

Training:

Currently we just use a bunch of beefy desktop workstations for training
(using Pytorch).

Deployment:

This is the vast majority of our cost, each time a paraphrase comes in we add
it to a queue through google cloud Pubsub. We have a cluster of GPU (T4)
servers pulling from the queue, generating paraphrases and then sending the
responses back through Redis pub/sub. I think ideally we would have a system
that makes it easier to batch sentences of similar length together, but this
seems to be the most cost effective way for models that are too
computationally expensive for the CPU that is relatively simple to put
together.

------
gautamcgoel
I'm a PhD student at Caltech, working on the theoretical foundations of ML. I
personally don't do a lot of coding, but basically everyone in my department
uses Python (especially Pytorch) for deep learning/ML. This all runs on Nvidia
GPUs (never seen an AMD GPU in the office). Occasionally people code in
Matlab, especially if they work in optimization or control. Tmux and git are
the only command line tools I see commonly used. Occasionally people ssh into
an Amazon box if they need more compute.

~~~
chronic379019
You didn't really describe a stack. Which is fine, because academic research
usually doesn't really reuse code ;)

A proper ML stack is something like:

\- Data format in X schema

\- Model trained on Y library/platform

\- Evaluated and tested using Z

\- Serialized in A format

\- Stored on cloud B

\- Deployed using C

\- Versioned using D

\- Real-time monitoring using E

------
oli5679
I'd recommend exporting R model as PMML file, and getting your Java team to
interact with Openscoring server.

PMML is language agnostic model specification (XML like). Python and R machine
learning ecosystem can easily generate these (caveat, only tried for gbdt and
linear models and not sure this works well for neural nets).

Openscoring is Java library that creates rest API for scoring models. It's
lightweight, battle-tested, nice API, good model versioning and in my
experience 10x faster than Python flask. You don't need to write any Java
code, just download and run the .jar and post valid PMML to the right
endpoint.

Another feasible approach is Sagemaker deploy - code from Jupyter notebook can
deploy API in one line. I think this can be less economical and have higher
latency if you will have high usage but a datascientist can do model updates
from within a notebook.

Please NEVER hardcode regression model coefficients within Java. This is a
nightmare to maintain, prevents increasing model complexity and is no simpler
than PMML + openscoring. I think you can wrap the Java PMML library in another
Java web framework like spring if you need something more bespoke.

[https://www.rdocumentation.org/packages/pmml/versions/2.1.0/...](https://www.rdocumentation.org/packages/pmml/versions/2.1.0/topics/pmml.xgb.Booster)

[https://github.com/openscoring/openscoring](https://github.com/openscoring/openscoring)

[https://aws.amazon.com/blogs/machine-learning/using-r-
with-a...](https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-
sagemaker/)

~~~
imagiko
Looks very interesting. Will definitely explore in this direction.

> Please NEVER hardcode regression model coefficients within Java.

Amen to that.

~~~
oli5679
If you aren't wedded to R then pickling sklearn Pipeline and loading in Flask
app can be nice. Advantage of this is that data pre-processing can also be
included in a sklearn Pipeline.

[https://scikit-
learn.org/stable/modules/generated/sklearn.pi...](https://scikit-
learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

The bit I'm not sure about how to do well is model monitoring.

------
ivalm
We have a bit of a problem like what you mention. Our backend/app is in Java
but the DS/ML team generally works in python. The ML team basically doesn't
ship production code.

Here are the artifacts we produce:

1\. For new models we often build a demo endpoints/glue code written in
python/flask that can be compared against the prod output in dev/psup.

2\. Deep learning models (much of what I do personally): saved in TF saved
model format. If it is an update to an existing model often it is just a drop-
in replacement. If it is a brand new model i will often include a flask demo
(the python code does proper data transformation before calling on tf). On
production side, after testing/regression these model are deployed via
tensorflow-serving containers and used as gRPC endpoint. For production,
whatever data pre-processing needs to be done is written by the backend team,
who compare preprocessing output with our demo.

3\. Logistic regression/tree models: again, for new models we provide the demo
but what goes into production are either csv (logistic regression) or json
(tree) of the weights/decision boundaries which are used as resources by the
backend team's Java code.

The overall flow is:

ETL (via apache airflow/custom code) => model training/feature engineering =>
(saved model file + flask demo endpoint/documentation on feature
transformations) => dev incorporate model/test into java backend => comparison
of demo vs java backend => regression of java backend (if they had previous
versions of model) => psup (small amount of prod data duplicated and ran in
parallel with prod) => prod (model deployed + monitored)

There is a caveat that we also do some batch processing/not really live
analysis that is just done in python and then results are pushed wherever they
need to be pushed. In this case we don't involve the backend/java team.

~~~
chaoyu
BentoML([https://github.com/bentoml/BentoML](https://github.com/bentoml/BentoML))
may help you with the process of building endpoints with both Deep learning
models and logistic regression/tree models, and it automatically helps you to
containerize the API server into docker image that's ready for production
deployment.

It also provides OpenAPI spec for your API endpoint, which allows you to
generate API client in Java, for your backend/app teams.

~~~
wing328hk
FYI. One way to automatically generate API clients is to use OpenAPI Generator
([https://github.com/OpenAPITools/openapi-
generator](https://github.com/OpenAPITools/openapi-generator)), which is free,
open-source and supports 30+ programming languages.

Disclosure: I'm the top contributor to the project.

~~~
ivalm
So we've been using swagger, is there some comparison between OpenAPITools and
swagger code gen?

~~~
wing328hk
Here is the Q&A: [https://github.com/OpenAPITools/openapi-
generator/blob/maste...](https://github.com/OpenAPITools/openapi-
generator/blob/master/docs/qna.md)

Another way to compare is as follows:

[https://www.openhub.net/p/openapi-
generator](https://www.openhub.net/p/openapi-generator) vs
[https://www.openhub.net/p/swagger-codegen](https://www.openhub.net/p/swagger-
codegen)

------
FridgeSeal
Currently:

Models and feature engineering done in python, trained locally, weights
uploaded to S3. Dockerfile with a tiny little web server gets deployed through
or CI/CD pipeline for serving.

Soon: Argo workflows + Polyaxon for data collection, feature engineering,
training etc. Push best model tobS3, same CICD process with docker container
deploys little web server onto our Kubernetes environment.

Deep learning stuff will probably use a similar setup, but with PyTorch
instead of Sklearn. Would like to look at serving with ONNX exporting.

When the Julia packages evolve a little more, will be looking forward to using
that in production.

~~~
mmq
Glad to see that you are interested by using Polyaxon[0] for your MLOps.
Although I was going to write a blog post about the upcoming v1.0 release of
Polyaxon, I just wanted to point out that there will be a native support for
different type of workflows, currently it supports parallelism and distributed
learning, and in the next release there will be native support for DAGs as
well. Here's a test fixture[1] of what a dag workflow will look like in
Polyaxon.

Happy to answer any question or provide more information.

[0]:
[https://github.com/polyaxon/polyaxon](https://github.com/polyaxon/polyaxon)

[1]:
[https://github.com/polyaxon/polyaxon/blob/master/cli/tests/f...](https://github.com/polyaxon/polyaxon/blob/master/cli/tests/fixtures/pipelines/simple_dag_pipeline.yml)

~~~
superkitty
Any inputs on Argo workflows vs Kubeflow vs MLFlow? Which is better suited?

~~~
FridgeSeal
Kubeflow, unless I’m missing some things, is for Tensorflow pipelines, if
you’re not using TF, or it’s not the only thing you use, it’s not ideal.

I thought MlFlow was a spark thing, and were trying to migrate off of
spark/DataBricks due to the resources inefficiencies of Spark (at our scale)
and maintenance nightmare that python notebooks are causing us.

Argo is just a container workflow tool, not ML specific. We’re planning on
using Argo for the data engineering parts, and polyaxon for the ML training
parts because of the convenient monitoring and hyper parameter search tools.

~~~
TheIronYuppie
Hi! Co-founder of Kubeflow here - definitely not TensorFlow only! You can see
([1]) many many different repos and operators. The nice part about Argo for us
is it let us build an ML specific DSL that was also Kubernetes native.

[1] [https://github.com/kubeflow/](https://github.com/kubeflow/)

------
atak1
Background:

We have a team of 6 DS working on parallel projects. We tried the Java service
approach. It's great for a one-time model, but very painful to iterate on.

We develop on top of Sagemaker, and since we're a funded company, can somewhat
get away with the 40% price increase of an "ML Instances".

We have a mix of R/Python models. For each, we keep a separate repo with a
Dockerfile, build file, and src code.

Training:

If the jobs are small, we train them locally, package assets into the
container, and deploy. If it's a bigger job, we leverage Sagemaker training
jobs and S3 for model storage.

Serving:

We have boilerplate web service layer with an entrypoint that DS fills in with
their own code. Yes, this allows almost arbitrary code to be written, but we
do force code reviews and enforce standards. Convention over configuration.

We do the feature engineering using Python/R, which when parallelized, has
good enough performance (sub 200ms latency on Sagemaker prod). If we need
latencies in the 1-10ms range, we'd consider refactoring the feature
engineering into a separate layer written in a more performant language. It's
always FE that takes the most time.

One learning from tuning Python services is: for max performance, try to push
the feature engineering work onto consumers. Have them fully specify the shape
of the data in a format that your models expect so you do as little feature
engineering in the serving step.

------
orasis
Data ingest: AWS Lambda (JavaScript) using the Serverless Framework to Kinesis
Firehose.

Continuous Training: Lambdas triggering SageMaker BYO training jobs.

Continuous Model Deployment: Lambdas polling SageMaker training jobs to update
SageMaker BYO endpoints.

Inference endpoint: Lambda proxying to a SageMaker endpoint.

Our BYO endpoints allow us to do batch inference without a bunch of round
trips between Lambda and SageMaker.

If endpoint costs get too high, we’ll implement some caching at the lambda
layer.

------
brucej
For our current projects we use our open source Apache Spark framework Arc
([https://arc.tripl.ai/](https://arc.tripl.ai/)) for feature prep then
depending on the type of model we will either:

\- use builtin Spark ML models

\- call a model running as a service

\- write files for a model to ingest (for a legacy project)

\- develop a custom plugin or UDF (for calling via SQL)

We have built in stages for running Spark ML models in the framework as well
as HTTP and Tensorflow Serving stages to call services. We recently ran a
series of models for NLP that were in Python and Ocaml via the HTTP stage
sending payload either in JSON or other formats that the services needed. The
text extraction via OCR (tesseract) had been done as a prior Spark stage. This
design allows us to call these more custom ML models but keep them part of a
larger Spark job and use SQL and other features when needed. The services
where deployed in AWS Fargate to allow for scaling. For other jobs we are
deploying our Arc jobs using Argo for orchestration. We spin up compute on
demand vs running inside a persistent cluster.

For training we use Jupyter Notebooks where possible. We have a plugin that
generates Arc jobs from these notebooks.

For special cases we can add custom plugins or UDF functions to extend the
framework. I have done similar plugins to run XGBoost models in Spark for
example.

Whilst we try to be prescriptive around the ML stack for Data Scientists this
approach has allow flexibility where needed and for different teams to own
their part of the job. This is particularly useful in larger teams where
development is more federated.

------
superkitty
We use self deployable configuration to allow Data Scientist to control the
model's destiny.

Models are written in Python (mix of pytorch/NLP/tensorF). The Models are
serving about 35 predictions/second on avg. The API server written in the
Python. API server container feeds or write the requests in the distributed
queue cluster. The models picks up the samples from the queue in batching. It
allows to experiment the models (different flavor) based on the routing being
set during the deployment time and which in turns being set in the cache. We
use AWS managed cache, queuing and container orchestration platform. Next:
1)Current pipeline for the training and production is two separate pipeline
which we want to combined, possibly use MLFlow, Airflow or KubeFlow.
Deployment to the production is done through Jenkins. 2)Active retraining and
auto deployment to production. 3)Tie the version of model in production to
model being trained. There is no way for us to tie back the version.

------
Sloppy
We developed an OSS ML Server called Harness. It does all ingest, prepare,
Algorithm management, workflow bits for pugable ML "Engines". These are
Algorithms + Datasets + Models and are flexible enough to do most anything. We
use the build-in Universal Recommender Engine, and have built our own for
other uses.

Harness exposes a framework for adding Engines and does all the routing for
Engine Instance workflow and lifecycle management. It also provides a toolbox
of abstractions for using the Spark ecosystem with Mongo and Elasticsearch.

It comes in a docker-compose system for vertical scaling and Kubernetes for
ultimate in scaling an automation. Quite a nice general system with out of the
box usefulness.

[https://github.com/actionml/harness](https://github.com/actionml/harness)

------
MrSaints
We are currently experimenting with workflow engines for orchestrating
different components, e.g. data ingestion, data preprocessing, feature
engineering, scoring, automated decision making / escalation. Namely, Argo for
offline processing, and bulk processing; Zeebe, and Cadence (trying both out)
for online processing, and business logic / application services.

We don't yet have a polyglot architecture, but we do have the requirement of
running distributed services (partly because there are certain components of
the pipeline that needs to be run on-premise), and we have found that workflow
engines / orchestrators definitely makes it a lot easier to reason about the
wider architecture / have a bird's eye view. It works for us. No need to
handle callbacks, events, queues, etc. We also do have the potential to run a
polyglot architecture.

We tried out Celery Workflows, and struggled to get it "production ready", so
I'd advise against this for complex workflows. We also found the visibility
lacking.

We have yet to fully try out Kubeflow, and MLflow. What is not quite working
at the moment is creating, and deploying portable models. And I don't mean
simply pickling, and storing an artifact.

Leveraging containers (Docker), and slapping simple anti-corruption layers
(e.g. simple web APIs) has also helped. We have a more consistent way of
deploying, and isolating code without having to rewrite much.

We want to look into using Nuclio, and/or knative to ease the process of
deployment, and to empower the data scientists to deliver without much
engineering expertise.

Others have mentioned using base classes or standard interfaces for their
models. We tried this too, but it didn't work. The generalisation early on was
met with conflicting requirements, and broke the interface segregation
principle (not that it matters too much, but it can be confusing to not know
precisely what is being used or not used). We figured it's much easier to
procrastinate any abstractions. Let the data, and it's flow do the talking.

------
klowrey
Full stack Julia for RL (and trajectory optimization)

~~~
jointpdf
This sounds pretty awesome. Do you have any write-ups that explain more?

~~~
klowrey
After wrapping the Mujoco physics engine in Julia, I did a few reinforcement
learning / robotics control projects, combining it with the Flux.jl library
for neural nets.

While I never made my research code public,
[https://openreview.net/forum?id=SyxytxBFDr](https://openreview.net/forum?id=SyxytxBFDr)
documents a similar setup to what I use.

------
ZeroCool2u
Plotly's Dash to prototype front ends that ingest/use the model output (Our
team is Python only, but others use R, so this works great, because it
supports both.) [https://dash.plot.ly/](https://dash.plot.ly/)

FastAPI for quickly creating new API endpoints. It has automatic _interactive_
docs and super simple data validation via Python typehints, so that we don't
waste compute time with malformed data.
[https://fastapi.tiangolo.com/](https://fastapi.tiangolo.com/)

We deploy on prem most of the time, but have started using GCP on occasion.

------
zmmmmm
> or example, how did your backend team(using Java?) work in sync with data
> teams

I guess we are a bit of an outlier, but we deploy the ML using Java / JVM. Not
really in the same league as others here so the models are simple enough that
the various Java ML frameworks are fine for it (DL4J, Smile, etc). We even do
a lot of the interactive exploratory / training type work on the JVM (though
via Groovy and Scala with BeakerX notebooks [1] - sometimes combined with
Python and R).

I think as the field matures a lot more could move to this model.

[1] [http://beakerx.com/](http://beakerx.com/)

~~~
imagiko
We once had a team member re-write their R code in Java, using weka just to
avoid too much hassle for the back-end team. So I guess you're not alone!

------
sam0x17
crystal / shainet
([https://github.com/NeuraLegion/shainet](https://github.com/NeuraLegion/shainet))

I contract for some clients in fintech and some defense-related stuff.

~~~
wyldfire
What kind of targets is crystal used on? Strictly x86_64 linux?

EDIT: Took a little digging but I found it [1] (and yes it's primarily x86_64
linux+macOS)

[1] [https://github.com/crystal-lang/crystal/wiki/Platform-
Suppor...](https://github.com/crystal-lang/crystal/wiki/Platform-Support)

~~~
sam0x17
Windows will probably be done in a year, but it's hard to say. Hardest part is
finding people who want to work on windows support xD

~~~
wyldfire
No kidding. (I wasn't complaining btw nor was I even focused on OS support, I
was curious about which arches people use it on).

What are the popular use cases - either ones driving current development or
ones where it's best suited?

~~~
sam0x17
Right now people mainly use it for web apps, sometimes data processing. With
windows support, I think there would be a rush to make cross platform GUI
apps.

------
eggie5
Development of models in our data environment: notebooks, pyspark EMR clusters
for analytical workloads and offline models, tensorflow/EC2 P2s for online
models.

Jobs are scheduled (Azkaban) for reruns/re-training and pushed from data env
to the feature/model-store in live env (Cassandra). Online models are exported
to SaveModel format and can be loaded on any TF platform, eg java backends.

Online inference using TF Serving. Clients query models via grpc.

A lot of our models are NN embedding lookups, we use Annoy for indexing those.

------
kevinskii
We're currently running a single NVIDIA RTX2080 with Tensorflow 2.0 on a
Windows 10 station. We'll soon be switching to a standard multi-GPU rig
running an air gapped Linux distro. Linux seems overall much better for ML
because of better Docker integration and tensor core support on the newer
GPUs. Also, we'll probably be switching from Tensorflow to Pytorch for model
development. Pytorch requires a little bit more code, but debugging is 10X
easier.

~~~
jedieaston
Why airgapped? Is it a business/security requirement? If you have to share the
machine, does everybody have to thumb drive over their files to run with the
big GPU?

~~~
kevinskii
Yes. The air-gapping isn’t ideal, but trying to get our IT org to accommodate
a Linux workstation on the network just isn’t worth the hassle.

------
rhacker
I'm pretty impressed with the level of automation I'm seeing in general. Looks
like many are using docker/k8s or containers in some way or another.
Inspiring.

~~~
chewxy
It's pretty horrifying to me. Why are we adding layers and layers of
abstraction that does not really make life easier?

------
tvinko
Hi Aaron, I was having similar issues. My main problem was integration of
different programming languages and tools under same roof. So I started with
my own ML platform. Currently C# is supported, but there are other ones in the
roadmap (Python, R, Nodejs...) You can check it here :
[https://github.com/Zenodys/ZenDevTool](https://github.com/Zenodys/ZenDevTool)

------
elwell
On a meta note, if you're interested in viewing & sharing stacks, that's the
primary feature of the startup I'm working on: Vetd (app.vetd.com). The
communities we host (often VC portfolio companies) share their stacks and
leverage for discounts.

\- CTO (chris at vetd.com)

------
mendeza
How do you all monitor concept drift or monitoring to detect when its time to
deploy a fresh model?

~~~
hn2017
Wondering the same thing!

------
jacquesm
Python (the Conda distribution, not small but quite easy and batteries
included), Keras, Tensorflow, NVidia hardware (GTX1080ti, not sure what the
current sweet spot for price/performance in GPU land is but that was the best
I could get at the time).

------
rewritteninrust
I'm also curious—and maybe someone here can chime in—about how you get
organizational buy in for introducing ML. There are a couple of problem areas
at my company that I think would be great for ML, but I don't know how to get
others onboard.

~~~
sten
Someone here wrote a POC. They happened to be fairly high up, not C-level but
well respected. Basically, I can save x% cost using this. You can bootstrap
that if you're not high enough to sit at the table, just need to convince
someone who does.

I appreciate that this approach won't work for everyone.

------
suyash
For my pet projects, do training and testing locally on my machine either
using Notebooks or on an IDE. Test and validate it further on my local machine
before deploying it on a server as a micro service. This is for my pet
projects only.

------
cercatrova
We used TensorFlow Serving running in a docker container that is a rest
endpoint for predictions. Then the backend can be agnostic to Serving. We
personally used Node to query the container for whatever we needed.

------
mistrial9
our team built a Postgres backend on a single physical server (with a lot of
good disk in RAID and others), python ran against the database, calling
sklearn libs .. (skipping problem specific libs involved, but they added to
either PG or the python side) Worked really, really well, easy to work on ..
good architectural separation of stages in the process. Completed and shipped
to an impressed customer. No GPUs

------
GolDDranks
scikit-learn + Optuna for hyperparameter search + Python + Docker + AWS
Batch/AWS Fargate. We are going to incorporate XGBoost or LightBGM at some
point. The AWS services have some slowness in starting up, but the pipelines
are not for online services so it works.

------
itronitron
why not write the ML model in Java?

~~~
m00x
Because all of the large tools are in Python. Tensorflow, PyTorch, OpenCV,
Numpy, Pandas, etc.

~~~
itronitron
The OP mentions that they are currently working in R.

------
xiaodai
We use our own product [https://PI.EXCHANGE](https://PI.EXCHANGE)

We deploy the models as HTTP endpoints and consume them in R/Python/Excel.

We also have advanced functionality available to enterprise clients that are
exposed as APIs with a customised JSON format to trigger various agents.

------
burner6565
My company is considering licensing DataRobot.

It looks to be a time saver.

------
mister_hn
OpenCV + dlib + CUDA or caffe2 as alternative

------
copperfitting
Idk i would love good answers here

