
Ask HN: How do you deploy and scale ML (±DL) models? - aaai
<i>Broad answers to the broad question are OK too, everyone will benefit from them, but more specifically:</i><p>1. Why is there so little &quot;unbiased&quot; info about production deploying&#x2F;serving ML models? (I mean except the official docs of frameworks like eg. TensorFlow which obviously suggest their mothership&#x27;s own services&#x2F;solutions.)<p>2. Do you hand code microservices around your TF or Pytorch (or sklearn &#x2F; homebrewed &#x2F; &quot;shallow&quot; learning) models?<p>3. Do you use TensorFlow Serving? (If so, is this working fine for you with Pytorch models too?)<p>4. Is using Go infra like eg. Cortex framework common? (Keep reading about it, <i>love the point and I&#x27;d love using static language here but not Java</i>, but talked with <i>noooone</i> who&#x27;s actually used it.)<p>5. And going beyond the basics: <i>is there any good established recipe for deploying and scaling models with dynamic re-training (eg. the user app expose something like a &quot;retrain with params X + Y + Z&quot; API action, callable in response to user actions - eg. the user control training too) that does not break horribly with more than tens of users?<p></i>P.S. Links to any collections of &quot;established best practices&quot; or &quot;playbooks&quot; would be awesome!<i>
======
tixocloud
1\. By unbiased, do you mean opinionated? The MLOps industry is still in the
very early stages and there’s no single standard. Every dev and company has
come up with an implementation but there are so many tiny little use cases
that sometimes forces new implementations to spring up. The closest standard
you get is a Docker/Kubernetes flavour.

2\. Handcoding to begin with is fine but as you start to scale the number of
production models and actually productionalize the model at scale, it’s
unfeasible and leads to plenty of maintenance issues. There are a few model
infrastructure tools that help with this but again, many are homegrown because
the market is still new. Algorithmia, Seldon are pretty good starts.

3\. Rarely use serving options provided as the challenge is integrating it
with the rest of engineering. Service monitoring gets handled by different
teams.

4\. Depends on the industry and usecase. Again integrating and maintenance
comes into play. Go/Cortex might make sense but a lot of companies leverage
Spark so Scala/Java could be the choice for production models.

5\. We’re working on creating this recipe for enterprises. I believe Seldon
(open source) might contain this capability. The challenge as you pointed out
is ensuring things don’t break!

------
calebkaiser
Cortex contributor here/the guy who wrote that article about using Go. The
project is on the young side, so we don't have the "footprint" of older
projects yet, but if you want to talk to people deploying models with Cortex
I'd recommend checking out our Gitter channel:
[https://gitter.im/cortexlabs/cortex](https://gitter.im/cortexlabs/cortex)

All of our core contributors + a good number of users are in there, and we're
all happy to chat.

~~~
aaai
Thanks! Will digg more into Cortex and/vs MLflow now ;)

------
Jugurtha
We're building an internal platform for that. Description in my profile.
Please get in touch if you'd like to know more or play with it, we'd love your
feedback. We have given access to about thirty students to prepare their final
year projects in vision, NLP, etc.

We've been doing consulting for more than six years and we're building a
platform precisely to solve the problems we have encountered and you are
writing about. We have learned some things that we are encoding in the
platform, in case you want to build your own. We have started doing this
because we hit a ceiling on the projects we could do, and we were under
stress. We're a tiny, tiny team.

The problems are in interfaces between different roles, with each role having
a stack with a gazillion tools, and a different "language" they speak and
universe they live in. The stitching of people's interaction together, the
workflow, the business problems, and the fragmented tooling is problematic.
The inflexibility of said tooling and frameworks that you addressed also made
us not be able to use them, or other platforms. This is why we are working
hard to build a coherent, integrated experience, while still trying to bulid
abstractions that allow us to substitute tools and view the tools as simple
components, not to be tied.

For now, it allows you to create a notebook from several images with most
libraries pre-installed. The infra it's deployed on allows Tesla K80 which you
can use. You can of course install additional libraries.

This solves the problem of setting the environment, CUDA, docker engine,
runtime versions, and the usual yak shaving. We're only using JupyterHub and
JupyterLab for Python notebooks for now, as it is what our colleagues use, but
we plan to support more.

It also solves the problem of the "it works on my machine" and running a
colleague's notebook.

You can click on a button and publish an AppBook and share it with a domain
expert right away to play with. It is automatically parametrized for you so
you don't play with widgets, and automatically generates form fields for
parameters. The parameters, metrics are tracked behind the scenes without you
doing anything, and the models are saved to object storage. Again, one role we
target is the ML practitioner who does not necessarily remember to do these
things, so we do it for them.

Here's a video from a very early version:
[https://app.box.com/s/mwsw79g3d5b974o625f1mw979cc4znf0](https://app.box.com/s/mwsw79g3d5b974o625f1mw979cc4znf0)

We're using MLFlow for that, but plan to support GuildAI, and Cortex. We think
hard to make things loosely coupled and configurable, so you get to pick the
stack and easily integrate the platform with existing stack.

The AppBook is super useful in that you can publish it and then use it to
train the model, or share it with a domain expert so they can play with
different parameters. One of the problems we've seen was that some features
are considered unimportant for an ML practitioner, but are critical to domain
experts.

Thightening that feedback loop from notebook to domain expert makes the one
click AppBook important because it saves you scheduling meetings and how to
"show" the domain expert the work, while allowing them to interact with it.

You can also deploy models you choose with one click and it will give you an
endpoint and generate a tutorial on how to hit that endpoint to invoke the
model with curl or Python requests. You can generate a token and invoke the
model in other places or services.

This self service feature is important because it allows an ML practitioner to
"deploy" their own model, without asking a colleague to do so who might be
doing other things. Self service is super important through this.

Right now, we're focusing on fixing bugs and improving tests and have added
monitoring before going back to feature development. Some features we were
working on were a more flexible and scalable model deployment strategies,
monitoring, collaboration, retraining, and data streams, and building the SDK.

~~~
Jugurtha
One of the problems is that the demos in PyData or SparkSummit and what not do
not survive first contact with reality. For really simple things.

For example, some libraries expect a filepath for their data. Say you want to
use Keras from a notebook and your data is somewhere else than on disk (like
if your job isn't to write blog posts on ML deployments, but you have real
clients who expect you to explore data, build, deploy, and manage models, then
build applications that use them that also look pretty, with money on the
line, not toy projects), you suddenly have to dive into the framework
internals to make it work with say, object storage.

Another example, say your project is for image classification and you have
100+k images. Min.io does not support pagination because it's not really "S3",
so you have to build pagination for the users because you're displaying it
like a directory, and it must act like a directory. The way Min.io does it in
their front end is they download the whole list recursively, and then do an
infinite scroll. This can be 20MB+ of data through the network. It works great
if you have great internet bandwidth, but for a lot of parts in the world
having maybe 1Mbps (notice Mbps, not MBps), this won't work when a user just
wants to "explore the directory structure".

Heck, one of our colleagues was not using the product and when pressed, she
said the notebook is taking forever to load. There were 30 megabytes of static
files being downloaded and she had a 5kB/second during confinement. We dug
into rebuilding it, then compressing static files, and caching. And she's
having trouble using the AppBook doing real projects in vision, for example,
specifying a data source and having the boxes display properly.

One way we're developing the product is by going through the _real projects we
have worked on, with real data_ and doing them retrospectively on our platform
to make sure it works with real problems. We're not optimizing for a demo in
an event, we're optimizing for something that really works for us because we
don't have teams of "data scientists", "ML engineers", "deployment engineers"
and we want to be able to allow the couple of ML practitioners we have to get
data projects running in a self service way, which means that by definition
you have to inherit of all the complexity you're trying to spare users.

The same problems when you can't trivially create an "empty bucket". Users
don't care that S3 is not the same as a filesystem, you're pretending it is by
having a "folder" icon and you damn better get it to work like a "folder"
where one can create structures for image classes, and then traverse them. The
API does not allow that, so you have to write the code to give it the look and
feel of a directory and you must thus write something that make "pagination"
work to display hundreds of thousands of images. And that's just 100K+ images,
not millions or billions. But you wouldn't have that problem with a hello
world example or the talk you give.

The deployment problem, for instance. Yes, you see the example and it looks
great. Then you try to reproduce the example in the repo, and it does not
work.

Let's say you use MLflow to "deploy". It has a client and a server. As far as
you'd expect, the client makes a request to the server, and the server does
"things". But let's say you're deploying a model that's in object storage:
object storage credentials must be put server side _and_ client side. You
can't just make a request from the client to save a model and then the server
handles it in the backend with whatever solution you're having. No, you must
specify the object storage URL, and credentials in the client code.

Which means, if you don't want to play house, you have to proxy requests and
then authenticate them in a "Man in the middle" fashion between the mlflow
client and the mlflow server itself, just so that your credentials do not
leak.

This would be mitigated if you're using Min.io in a multi tenant mode so each
user has their own "object storage", but Min.io does not have an API with
which you can can do that (user creation, etc), and you must do it with their
`mc` client. Which means you have to create this on the fly for each user and
wrap these.

There's also the problem of work load scheduling, notebook collaboration and
versioning. You give 2GB or RAM? OK. Users need way more. What do you do next?
You give 100GB of RAM? You make it elastic? How do you deal with "runway
models" (as opposed to Instagram models) that are hemorraging your resources?
You have to think about resource management and workload management. Do you
instaure quotas so that one user, doesn't monopolize all the resources?

How do you deal with real time collaboration and versioning? Because you know,
you're working on real projects with real people? Do they have to version
their notebooks? They don't know how to use Git when they do ML. Do you hack
on the Contents API and have a custom ContentsManager? Do you dig through
operational transformation or CRDT to give it the look and feel people expect
now for collaboration?

It is that stitching and managing these fragmented tools idiosyncrasies that
make it that the posts I read on some data science medium blog posts or watch
talks about machine learning lifecycle management completely shock me, as I
really would love it to be that way, but it simply isn't. Maybe it is when
you're toying with a jupyter notebook or on Kaggle and, training a model on
data on your disk, and wrapping a Flask application on it, then writing a blog
post on how easy it is.

Let's then say that you have "deployed" your model with the super ml lifecycle
management library, which really just starts a process and launches a flask
application. How do you shut it down or manage it? Drifting? How do you
retrain it? Do you use Airflow or NiFi or the like? Who configures them, the
use? What's the schedule?

So, yes.. I understand why your question is: "Since everybody has it figured
and blogs about it and demos in conferences, am I that stupid or is everyone
full of baloney? Is there something everyone knows that I don't or what?"

