
Wootric moved ML deployment pipeline from AWS to GCP - prabhatjha
https://engineering.wootric.com/move-from-aws-to-gcp-ml-deployment-pipeline/
======
minimaxir
Given the architecture demands and shortcomings noted in the article, it would
likely be more efficient to cut out the overhead invoked by Cloud Run and AI
Platform and just run everything in Kubernetes Engine (backed by Knative for
Cloud Run-esque autoscaling). This would also solve the latency and scale-to-
minimum size issues, and likely be cheaper in the long term at the cost of a
bit more configuration to get it started.

~~~
prabhatjha
Thanks for the tips. Having everything colocated in a k8s cluster will
definitely help with latency and probably overall infra cost but it will be at
the expense of engineering times spent on running a k8s cluster in prod.

Fingers crossed that we keep growing which would mean that we can justify
working on v2 architecture.

------
marinhero
Can you elaborate on what changed from one architecture to the other regarding
Custom Models? it feels like it would still be easier to do that on AWS since
you control more of the stack. Isn't that the case?

~~~
rsmith49
While AWS does allow more hands-on control of the model stack, our "Custom
Models" are more just different sets of weights using the exact same
methodology. Basically, each customer that creates their own custom model is
plugging into our existing framework, but with a different configuration.

Because of this, GCP'S AI Platform allows us a more micro-services type
approach to interacting with the ML models themselves - as opposed to our
previous deployment strategy on AWS, which put all of the models into one big
bucket on every instance that was serving requests.

Hope that answered your question!

------
ambaragrawal
If for a while I eliminate the scaling issue with v0 AWS architecture. Would
it be right to say in v0 issues like excessive load times were solved by
decoupling models and flask app (e.x. making batch prediction calls to each
necessary model for the current request?) rather than v0 architecture itself?

Was it that hard to make the make batch prediction calls to each necessary
model for the current request on AWS?

~~~
rsmith49
Making the calls to the corresponding model from Flask was actually easier on
AWS, since they were loaded into memory. Unfortunately, the scaling
issues/excessive load times were big enough of an issue that we had to make
the switch, as our number of hosted models continues to grow.

------
LiamPa
Why not just have a keep alive/warm style serverless function to prevent the
start up latency?

~~~
prabhatjha
I am assuming you are talking about our deployment on GCP Cloud Run? We have
thought about sending a heartbeat API call. It we notice any user experience
friction because of this lag then we will definitely do that. As we said in
the blog, this has not been a major pain point as of today.

------
mohsen1
Love it when companies decide not to use Microservices "just because"!

~~~
prabhatjha
Totally. We did not have a need for custom models initially. We could load all
our models on one VM so there was no need. We were tempted to get on the
bandwagon. ;-)

------
msoad
What was the main motivation behind this? is GCP cheaper?

~~~
rsmith49
Not inherently, but how we're using it makes it cheaper. Our stack has ~50
different ML models being served live, and GCP makes it easy to treat each
model as a micro-service, and give auto-scaling to each one.

This is in contrast to the easiest way we found to deploy the same
architecture on AWS using Elastic Beanstalk, which involved one really big
instance (that was constantly growing as we added more models), and the costs
that come with that.

------
robinjha1
Just curious about the errors in V0 - was it a resource allocation issue or
response timeout errors?

~~~
rsmith49
A bit of both. Flask was obviously not designed with serving Tensorflow models
locally in mind, which is how we had it set up in v0. Towards the beginning we
had to debug some weird threading issues, but towards the end once it was more
stable (as a result of some hacky fixes), the timeouts were the main issue.

