I'd like to propose an alternative... Build a model (once) on your dev machine. Copy it to S3. Do CPU inference in some microservice. Get the production system to query your microservice, and if it doesn't reply in some (very short) timeout, fallback to whatever behaviour your company was using before ML came along.
If the results of yor ML can be saved (eg. a per-customer score), save the output values for each customer and don't even run the ML realtime at all!
Don't handle retraining the model. Don't bother with high reliability or failover. Don't page anyone if it breaks.
By doing this, you get rid of 80% of the effort required to deploy an ML system, yet still get 80% of the gains. Sure, retraining the model hourly might be optimal, but for most businesses the gains simply don't pay for the complexity and ongoing maintenance.
Insider knowledge says some very big companies deploy the above strategy very successfully...
A model that can't be continuously trained inevitably rots due to data, interface or environment changes (and the code is typically very difficult to maintain across team members - if the author leaves it's often a ticking time bomb). If you're OK with the model rotting then it wasn't that important to your business to begin with. This is not true for all businesses.
The next value add was consolidating data in a db with web ui so relevant non-devs can view and help add to it + easily integrate different data sources with automatic validation. I wish their was a nice open source thing here you could spin up without much effort. There's middle ground between complex/paid full solution that can scale to huge data sets and integrate with turking etc and emailing csvs or sharing google sheets links around.
If you're interested in ML Ops, I have a shameless plug to share: on November 19th I host a free online panel, "Rage Against the Machine Learning", with industry experts. 
In my experience, spending time explaining results to the business is also a very time consuming element of deploying a model too.
There does seem to be a dearth of writing on the actual topic of deploying models as prediction APIs, however. I work on an open source ML deployment platform ( https://github.com/cortexlabs/cortex ) and the problems we spend the most time on/teams struggle with the most don't seem to be written about very often, at least in depth (e.g. How do you optimize inference costs? When should you use batch vs realtime? How do you integrate retraining, validation, and deployment into a CI/CD pipeline for your ML service?).
Not taking anyway from the article of course, it is well written and interesting imo.
Using a fresh draw is difficult and expensive, especially since the labels may not be available. Using A/B is expensive, multi-armed bandits are more efficient, but again there is an optimisation element there (waits for shouting to start)
Additionally surely there is a really significant qualitative judgement step about any model that is going to be used to make real world decisions?
You would do model comparisons, quality checks, ablation studies, goodness of fit tests and so forth only using the training & validation portions.
Finally you test the chosen models (in their fully optimized states) on the test set. If performance is not sufficient to solve the problem, then you do not deploy that solution. If you want to continue work, now you must collect enough data to constitute at minimum a fully new test set.
The meetups are all on YouTube and have great topics like putting models into production, but also more interesting ones (to me) like ml observability and feature stores.
Their slack channel is great too, learned a lot about the reality of using kubeflow vs the medium article hype
Any thoughts on that?
On a more serious note - as always, it depends. On the logic, on the company processes and skills, how frequently will the logic change, visibility into the decision making, ... It may be a good idea or you may bring in a lot of manual work.
It is very complex because most of the time there is no simple rule such as a threshold on the confidence score of the prediction. In practice it might be more like, “if the user has more than 7 items in their cart and if the user is not a returning customer that filled out personal data and the value of their cart is greater than $100 and they have not put a new item in the cart for 2 minutes, and the confidence score of the predictor is less than 0.4, THEN don’t show the next recommended item, just display a checkout link.”
And the number of items, the cart value limit, the time since last item-add, etc., will all be hotly debated by product management and changed 5 times every quarter.
In grad school there was a professor who said of machine learning that “parameters are the death of an algorithm” - so you want to avoid coupling extra business logic parameters tightly with the use of machine learning models.
I work in automatic train traffic planning, mainly for heavy-haul railways . Recently, we've been working on a regression model to predict train sectional running times based on historical data.
As our tool is used during real time operation, we can't risk the model outputting an infeasible value. So we're thinking about defining possible speed intervals, e.g. (0km/h, 80km/h] for ore trains, and falling back to a default value if the predicted running time causes the speed to fall out of this range.
I think there might be some better research on this... Not bothered to look it up much.
You very much should be choosing models that actually map to the physical reality of your problem domain, and not using something that is fundamentally unphysical for your use case, but attempting to correct it with hand made business logic.
That said, there is a piece here on TFX, which is valuable in this context. I also think the advice about going with proprietary tools that speed up the process is good. Tools like Microsoft's AI tooling, Dataiku and H20 are good in that context.
I would have liked to have seen some discussion around when you should deploy a model as an API vs generating batch predictions and storing them - I've done both on a test bench, but I don't really know how well the API scales.
This seems to be a common theme of a lot of articles about 'how to put ML models in to production'.
since like everyone is on K8s, im wondering if kubeflow is not the more natural fit
This workflow also doesn’t work well in hybrid on-prem + cloud environments because, for example, your model training might run in a cloud Spark task, but your CI pipeline (responsible for building and publishing a container to an on-prem container repo) might run on-prem. kubeflow, for example, has a hard requirement to put containers into cloud container registries, and makes assumptions about what the networking situation is allowing connection between on-prem and cloud container resources.
I think industry shifting focus to kubeflow is actually a giant mistake.
What I'm seeing is that ML/data engineering is diverging from devops reality - and going and building its own orchestration layers. Which is all but impractical except at the largest orgs.
I'm yet to find something that fits in with kubernetes. Which is why it seems everyone is using fully managed solutions here like Sagemaker
Kubernetes is very poor for workload orchestration for machine learning. It’s ok for simple RPC-like services in which each isolated pod just makes stateless calls to a prediction function and reports a result and a score.
But it’s very poor for stateful combinations of ML systems, like task queues or robustness in multi-container pod designs for cooperating services. And it is especially bad for task execution. Operating Airflow / Luigi on k8s is horrendous, which is why nearly every ML org I’ve seen ends up writing their own wrappers around native k8s Job and CronJob.
Kubeflow can be thought of like an attempt to do this in a single, standard manner, but the problem is that there are too many variables in play. No org can live with the different limitations or menu choices that kubeflow enforces, because they have to fit the system into their company’s unique observability framework or unique networking framework or unique RBAC / IAM policies, etc. etc.
I recommend leveraging a managed cloud solution that takes all that stuff out of the internal datacenter model of operations, move it off of k8s, and only use systems you have end to end control over (eg, do your own logging, do your own alerting, etc. through vendors & cloud - don’t rely on SRE teams to give you a solution because it almost surely will not work for machine learning workloads).
If you cannot do that because of organizational policy, then create your own operators and custom resources in k8s and write wrappers around your main workload patterns, and do not try to wedge your workloads into something like kubeflow or TFX / TF Serving, MLflow, etc. You may have occasional workloads that use some of these, but you need to ensure you have wrapped a custom “service boundary” around them at a higher level of abstraction, otherwise you are hamstrung by their (deep seated) limitations.
Airflow is useful as a component of an ML platform, but it is only in principle capable of addressing a really really tiny part of the requirements.
You also need to ensure Airflow can easily provision the required execution environment (eg. distributed training, multi-gpu training, heavily custom runtime environments).
Overall Airflow isn’t a big part of ML workflows, just a small side tool for a small subset of cases.