As a researcher applying ML in my work the OP reads like me looking from the outside at web-frameworks: there's a LOT of noise but in practice, people run just the same 3 things.
For example, when I first got into ML the ADAM optimizer was the new big thing, since then hundreds of 'better' optimizers have been published. This paper from August '21 shows that most of that is overblown and no optimizer consistently outperforms ADAM: https://arxiv.org/pdf/2007.01547.pdf
I'd even go so far and say that OP's big image of MLOps is misleading. 'Data Science Notebooks' shows Jupyter, binder, colab, I can't see the other logos; but binder and colab both run Jupyter notebooks? Under ML platforms there are a ton of companies in this picture which effectively do the same thing. Some of these logos are tools (Jupyter, R), some of these logos are companies using ML in some way or other (John Deere, Siemens) - and once you go down this path you might as well put any mid-sized company in the world onto this.
OP doesn't weight the size of each technology by adoption -- if they did that then yeah there are some centroids. But I think their point is correct that parts of the ecosystem remain quite fragmented.
> For example, when I first got into ML the ADAM optimizer was the new big thing, since then hundreds of 'better' optimizers have been published. This paper from August '21 shows that most of that is overblown and no optimizer consistently outperforms ADAM: https://arxiv.org/pdf/2007.01547.pdf
FWIW this paper acknowledges that optimizer is application-dependent, yet draws its conclusions from < 10 canned datasets (mostly classification). "no optimizer consistently outperforms ADAM" is a misleading statement -- I think a more correct one is "no optimizer consistently outperforms ADAM on {MNIST, CIFAR-10, SVHN}". If you do work on much larger and more novel architectures / datasets, I think you'll find the advice to "just use adam" to be insufficient. At least at the orgs I've worked at, folks often do a (sometimes smart) param sweep to find best optimizer / settings << if you're lucky enough to have infra that supports this :)
That aside and more generally, there certainly are commonly accepted best practices in ML. But that doesn't mean ML organizations all work the same way, and the OP's point is that the dust has not settled on what the best tools are to do X.
Except where's the data extraction, warehousing, labeling, deployment, and eval infra here? Are you calling `model.fit` on distributed training infra or just on your local machine? I think that's the OP's point.
I work as a Senior Software Engineer and DevOps for a AI consultancy, and I've been dealing with MLOps for the past 2 years.
I'm mostly aligned with what the article says, MLOps today is definitively in a frazzled state. However, I disagree on the following points :
- Google and al. are not good example to follow for ML deployment best practices. Sure their sophistication is higher, but they also have a lot more staff to handle "the other side" of modelling : Data, Infrastructure, Tooling. They build tools to suit their needs, as big SaaS products holding *bytes of already organized data ready to be used for data science, and it's a really different perspective than a large retailer trying to get better sales forecasts.
- Vendors from all sides have really, really bad fitting products for Data Science and ML in general. "Platforms" are trying to profit off MLOps with a commercial product claiming to be the silver bullet to every pain your team has. Three months later, it's just another life-sucking lock-in with a list of tickets to be addressed. We really miss a new "Docker" here. Have a few examples: Databricks ? It's a Spark-as-a-Service platform with the worst APIs you could imagine. Git-for-data vendors ? They don't understand Git nor Data : is a model data or code ? Both ?
Finally, ML at Reasonable Scale builds on top of regular software engineering best-practices. If you don't usually store a shell script's output in a repository, then you should do the same for a notebook. Same goes for Idempotency, Reproducibility (of model training), Composability (of pipeline steps), (data) Versioning, etc..
It’s astonishing how similar all these MLOps platforms are as well. And it always starts with some notebook capability and then end2end / cradle to grave through model serving.
Model serving, real-time/online-inference at scale, has been one of the most complex tasks I’ve worked on. It’s always funny seeing these platforms offer the same solution — some ui wrapper over a kubernetes deployment.
I've worked on ~5 ML products over my career across startups, unicorns, and FAANG.
The general factors I've observed during scale-up are common across all projects.
1. Very few people know how the underlying model works, or why it works - many of the people you would expect to know do not.
2. The business value of marginal improvements to the model or extending to new use cases in the business is much smaller than you would hope. Or more advanced methods introduce unfortunate tradeoffs which make them impractical.
3. Whenever you add new modelers to the mix, they want to use radically different technology - reducing the effectiveness of platform efforts. Many of these divergent efforts do not produce net gains and instead come down to the old use X instead of Y type debates.
4. Few modelers want to touch anyone else's code. Most believe internal tools will inherently be garbage (which they often are)
5. Platform efforts tend to spiral into cost pits.
6. Investment in the product area is largely dependent on leadership buy-in for ML related efforts, usually with a big initial thrust - moderate gains, then slow wind down.
This reminds me of systems engineering and DevOps pre-cloud, and tells me that companies are going to want to outsource this tooling as fast as possible. I'd also expect that there will be a good market for directly offering specific customizable platforms for things like ASR, Recommendations, Search, Computer Vision, and others - but the challenge in 1&3 will make this a tough sell.
The problem is scaling up. You do 2,3 - 5 projects the "hard way" manually deploying the data pipelines, model images, api, etc... then you are like.. there has got to be a better way. It's the difference between a side project and a full department supporting the entire enterprises needs.
It covers differing levels of maturity, with the Maturity Level 0 being what the op was mentioning. Our team is trying to dig out of Level 1 to Level 2.
AWS has SageMaker Pipelines, which you can spin up into a default MLOps setup through the SageMaker Studio -> Projects -> Default Projects(?) flow. It's a bunch of stuff effectively vended through Service Catalog, so you're free to modify and inspect it all.
I would consider defining MLOps more broadly than the article (which only discusses production).
For example, it can be as simple as wondering how to organize, collaborate on, and versionize lots of Jupyter notebooks! Or how to keep track of and compare different models/hyperparameters/datasets, not only individually but also across a team. Both are not discussed in typical data science courses.
Interesting, hah - I'm of the opinion that even just the first 3 steps you listed, before you actually get to the learning part itself, can each contain a world of hidden assumptions that can get you mired in a cycle of constantly repairing and tweaking pipelines
You're totally right. All of the ML hype out there isn't helpful either.
The hardest part for me is not the tooling or processes, but ironing out expectations, assumptions, what can and can't be done (or what's worthwhile) with ML.
Getting to a common understanding about metrics and outcomes with an end user / customer can chunks of time out of a project.
The problem is simple, except that Data Scientists seem to resist good software engineering practices and storm ahead to the model development part which leaves a lot of mess for engineers to pick up. MLOps then looks like a 'difficult' problem because of this mess.
That's quite a simplified and antagonistic comment.
Data scientist have their own problems too, like crappy data (if any) and unclear business objectives. This usually means fast iteration at all costs. That's why Python or R are the languages of choice, which are not precisely the most amenable to deployment. Now, think about their KPIs vs yours and everything falls into place.
I view the job of the MLOps framework as essentially a contract where "You place your messy code that does X here, and we'll take care of turning it into a retrainable, resilient, secure API endpoint without any extra fuss". This requires the software engineer/dev ops person to think through ahead of time what the data scientist will need to do as a part of their process and templatize it and form a good abstraction. The problem is that the abstractions are often leaky so there is some constraint that needs to be put on the data scientist. This also requires the MLOps engineer to work on new feature development to form a new abstraction to handle the new features. "Oh you want to do active learning? What does that look like?".
But whose job is it to run those steps? And note that most of what I think of as DevOps might have an analogy, but no direct equivalent.
Of course this is all solvable; the thing is that it is new. DevOps assumes that changes are to the code, and has processes and tools for managing code changes.
(I'm biased, as I'm the CTO/co-founder of Efemarai)
Yes, there are usually several changes that need to be tracked - code/model changes (those usually happen early on and the stabalise), input/code changes (e.g. pre-processing the data with either new transformations or _other_ models), and data changes (both changes for training and testing). At Efemarai we are thinking about it as any changes to the above should automatically trigger a test suit of the model/process. And under test we're thinking not just the different forms of unit testing the input/output formats and sizes from the model, but also unit testing on the model performance of the data you've collected + stress testing the model with data the model is expected to see in production.
So in reality, it's indeed nothing new, but the standard DevOps pipeline needs to be extended to work with the ML assumptions.
Maybe I’m a bit naive, but I’m convinced any great traditional software engineer or devops engineer in combination with a data scientist/ML person should be able to setup the ops pipeline for a ML project. The details and algorithms themselves may look new and exciting, but operationalizing algorithms isn’t a totally new thing.
Dunno if this technically counts since it's managed, but I'm a semi devops-y person (had all the buzzwords foisted on me and just learned on the job by making all the mistakes) and guided an ML intern through setting up their ML pipeline in AWS. A few more mistakes from unfamiliarity with AWS's ML-specific options at the time notwithstanding, it was certainly not the most traumatic devops experience I've had.
IIRC the only stuff to learn there was (a) PyTorch (I think that was the right ML bit) and (b) the AWS black magic, and the intern ended up picking up both once I got them pointed in the right general direction on the devops end.
In my experience, you’re right. A team of software engineers & platform engineers are doing their normal job deploying production grade models at scale. Caveat, this requires deeper or more advanced computer engineering in practice.
Devops roles are generally breaching into data engineering more and more. Again, in my case, the devops/platform team creates all the automation to move a lot of data around daily.
In terms of the model training, we essentially treat it like a supply chain where the software team is closely engaged with the model team who aren’t even really deep data scientists themselves. Just software engineers focused on building ai/ml models.
ML models are unlike traditional algorithms in that most need to be retrained frequently to deal with distribution shifts, and monitoring to ensure your model is still accurately modeling the real world distribution is a key component. If you are someone relying on ML to do a good job / needing to continuously improve your ML models, you end up caring a lot about how much it costs to retrain, how ongoing labeling operations are going & how quickly it goes from newly-labeled data --> retrained model --> evaluated model --> deployed model.
Furthermore, ML orgs often have to make labeling and training efficient by sharing resources & strategically triaging only the most impactful experiments / picking the most impactful things to label.
There's also the pace of progress in the field, and often organizations will have dedicated data science / ML folks who want to run experiments to improve the models upon each retraining. The infra needed to unlock rapid prototyping is quite a lot.
If you have a lot of models and you're re-building the end-to-end stack each time, you end up with a ton of wasted work. A lot of this stuff is also pretty specialized. It's a bit like asking all your engineers to set up their own web server, proxy, monitoring / alerting, maybe a load balancer, etc etc. Plenty of people know how to do it, and for small companies doing 1 or 2 ML models maybe that's fine. But for corporations at scale it makes no sense to do it like that. And when you're at scale, a small team probably works on the load balancers vs the base infra vs the software platform vs the features. ML works the same way.
Some real world examples where you can imagine how much infra you need within ML-heavy organizations:
- Say you want to build a _motorcycle detector for a self-driving car_. You need to build a data extraction pipeline that processes images, gets the segmented objects, sends each object for labeling, then when you have all your labeled data, you need to split it into test/train/validation (and make sure you use the same splits as everyone else building the car's software), then you need to have these piles of images integrated with additional information needed for training (e.g. how fast the object was moving, what time of day), you need to upsample/downsample certain cases (e.g maybe you need to upsample examples of motorcycles at night), then train a net (locally, or in the cloud, or maybe as part of N experiments to tune, on shared infrastructure that M people are using where jobs need to be prioritized), then evaluate (do you need to build infra to harvest important metrics, like how well your model performs when the car is driving on a slope? in fog?), then optimize for onboard (are you going to run it on CPU? GPU? Accelerator? Do you optimize it using TensorRT? Your own quantization infra? Distillation?), then deploy (and monitor -- is the model eating too much memory during inference? Being run too many times? Not doing the right thing?) Okay your model works -- you sure it will keep working when motorcycles look different 10 years or even just 1 year from now?
- Say you want to build a _spam detector for your social media website_. You do everything above, build and deploy your model to the cloud, and suddenly you realize it's not working, a new spam campaign has occurred that your model can't account for. You need to add more labeled data, but how much and where are you going to get it? After adding it, what's your overall data look like? Adding it didn't help your net as much as you expected, why? The model-level eval looked improved, but combined with a rule-system together, it got worse. Crap, how to debug? Okay finally working, how stale is the data in your model after 1 year? Did we regress on something when we solved the spam campaign? You have a computational budget for how big the net can get, because it's used real-time to judge spamminess of posts on a major website -- maybe you care about what hardware your model is running on and how to best optimize for that hardware. Maybe you use cloud TPUs, where large batch sizes help you to scale. Maybe you use graphcore or something that thrives on small batch size. What if you started on one, moved to the other, and suddenly your net isn't working as well? What if you upgrade from a gtx2080ti -> gtx3080ti and see that your net has a prediction regression? Do you have infra to detect these regressions? Over time, when your data got an order of magnitude bigger, you noticed that your net's hyperparams were no longer optimal. You needed to increase your learning rate, or decrease it. Did you notice this issue, and do you have the infra to do that tuning quickly? You notice your labeling budget is too small to label everything flagged as spam. How do you decide which things are most worthwhile to label?
You have to build infra for all of this. MLOps are needed every step of the way. It's not that different from needing SREs and cloud infra engineers to run your cloud services & organizations.
Labeling infra alone is a big enough market for companies like Hive and Scale to build billion-dollar businesses.
Is there really a difference between "labelling data" and assigning properties to "traditional" inputs (e.g. assigning tax codes, classifying new products, filing cases, managing customer data, ...)?
Is there something fundamentally unique about sharing and monitoring data during ML training as opposed to say feedback loops between trading algorithms and profit or production planning, logistics and market response?
Or to address your examples, wouldn't the same issues as in your motorcycle detector arise with any other software implementation? Hardware constraints, runtime limitations and -requirements are in no way unique to ML after all.
The same applies to your spam detector example. The same questions arise with any other software. It's all just constraints versus benefit, data quality, monitoring loops, infrastructure, and cost.
I honestly don't see anything that's truly unique to ML here.
The part that is described as "model training" in ML is just done manually by developers and expressed as iterations in engineering. I would therefore think that the skillset is very much transferable and much of the apparent novelty is just traditional software engineering and management practices hidden behind ML jargon.
> I honestly don't see anything that's truly unique to ML here.
- The workloads are specific (lots of offline batch processing, accelerator powered offline stuff, then speed/power/resource constrained inference stuff)
- The hardware is specific (ML accelerators are for ML, you don't really use TPUs for anything else do you?)
- Debugging is specific (ML-specific tools like XLA)
- Labeling is specific (e.g. labeling audio, video, 3D points requires specific tooling)
If what you're saying is, "ML engineering sounds like engineering" that's obvious and was never a point in contention. OP's comment was "a couple of motivated engineers can make ML work" and my point is -- kind of, but at scale you need a lot of very specific things which are best done by specialized folks.
That there are billion dollar ML infra companies, as well as companies with ML infra teams that are hundreds of people, means that folks are finding it worthwhile to have, say, a team of folks who work on deploying nets efficiently and only that. Or a team of folks who only build labeling tools. Or a team of folks who only build model evaluation tools. My ramble was mainly to illustrate just how many sub-problems there are in ML and why ML infra is rightfully a big business -- there's a reason companies that use a lot of ML don't just have 2-3 randos building everything end-to-end for each model.
> The part that is described as "model training" in ML is just done manually by developers and expressed as iterations in engineering. I would therefore think that the skillset is very much transferable and much of the apparent novelty is just traditional software engineering and management practices hidden behind ML jargon.
Yeah ML engineering is engineering, so plenty of skills transfer between ML engineering <-> other engineering. But if you want to go from other engineering -> ML engineering, you do have to learn ML-specific things that I would not dismiss as "novelized software engineering" or just "jargon."
"New fundamental science advances come out every week"? While there's certainly a lot being published, I think that the word "fundamental" is being abused in that sentence.
WhyLabs cofounder here, so my opinions are probably biased.
When it comes to MLOPs, data makes it much more complex to handle. Think of it as the curse of dimensionality. Nobody wants to deal with metrics across tens, if not hundreds or thousands of features. In addition, data is often not stored in nice SQL based system with strong schema enforcement, so we see data bugs creeping up all the time. An example is when an upstream API service returns the 9 digit zip code instead of the 5 digit one. This sort of data issues can creep up at many parts of the ML system, especially when you use JSON to pass data around.
You can defend against some of these problems with some basic devops monitoring, but when you deal with tons of features this becomes a tedious task. DevOps tools focus on solving problems around code, deployment and systems health. They are not designed to address the curse of dimensionality above, and you sacrifice a lot by trying to reduce data problems into DevOps signals.
To be fair, I don’t think we need some fancy algorithms, but I think we need tools that are optimized around the user experience (I.e. removing frictions) and workflows for these data specific problems. There’s a lot to learn and apply from the DevOps world to think about data health, such as logging and collecting telemetry signals.
Operationalizing ML is hard. But it has nothing to do with the models at all. It is hard because the main use-case (besides Images and Text processing) is feature fusion, you generate a bunch of distinct features about say People and their history and the products they like etc (thinking of a Recommender-System now). However, these are things that usually live in really distinct parts of your DB / your backend. So as ML ops you are now tasked with getting info from all of these places. In a big org, often with different responsible people, security protocols etc....
Data acquisition is definitely hard, but it's far from the only challenge. Labeling is also hard for many use cases. Curating your labels is pretty annoying. Making your model inference performant enough to be launchable is also hard. Making your model something you can quickly iterate on is also hard. Evaluating your model (with the system it's embedding in) is also hard.
I wouldn't say it has "nothing" to do with the models. Maybe "little to do with model architecture" and "a lot to do with everything around the model." There's just a lot of work to be done to get the business wins you want from ML.
I'm not sure there is anything extra messy about MLOps: there are lots of vendors in pretty much any area that has profit potential. If you put all vendors on a chart, it will look messy, but you aren't going to be working with ALL of them (e.g., if you pick Snowflake you likely won't also be working with RedShift, Databricks...). The messy part I guess is the evaluation/selection, but not the integration or learning per se, as this article seems to imply. The article looks like a good reference for what's out there though.
I have an internal document just for comparing different ML vendors/frameworks/etc. Most of them have low value add, so it's easy to remove them from the picture.
One challenge is whether to pick a vendor that offers solutions for each stage in the lifecycle or specializes in solutions for a particular stage. Ultimately it's a false choice because you'll run into the limitations of a vendor and need to compliment it with another. Then you have the problem of trying to staple vendor solutions together.
Hi, not sure if this is question is better as a PM. I’m working on a project in this space and am curious what you find low value add about most current offerings?
My project focuses on the actual deployment of the model artifact and transformation of that artifact into a callable API.
Do you find most offerings low value add because you don’t want to deal with created containers (docker, etc.) (that’s my experience of most offerings in this space)? Is it because you prefer to do that work yourself? Some other reason?
My frustration so far with mlops is how unnecessarily large containers need to be to serve even the smallest models. Want to serve an mnist pytorch model? It'll likely be a huge dockerfile compared to model size, exceeding the capacity of most free-tier hosts
Your model serving code (likely a python installation and a few libs to answer HTTP requests, like FastAPI, Pydantic, etc...) lives in your docker image, and their installation are documented in the Dockerfile. This is likely ~10-20 lines and about 500MB for a "standard" python installation.
Then, on container startup, your model-serving-app collects the latest model from an external artifact repository. This step usually belong in an init container if you use Kubernetes.
Finally, the app starts and begin serving requests with its in-memory model.
The size of the running container scales linearly with the model size, and the image size stays "small" (for a python installation). You also gain fast auto-update mechanism where your app just need a container restart to fetch the latest model.
Yes, this is a generic workflow, but when you start adding in e.g. torch[vision, audio], scipy, or any other "useful" processing library, things get out of hand pretty quickly and you need to start being more intelligent about what is really necessary to have inside the container. So for a model that is even a few mb, you have a surrounding container of a couple of GBs
You need to run a forward pass somehow through a model, so at the very least you need something to define your operations + load your weights. You also need whatever libraries for any kind of signal processing you might do before to your inputs
> Then, on container startup, your model-serving-app collects the latest model from an external artifact repository. This step usually belong in an init container if you use Kubernetes.
I agree but there are some reasons why folks choose to bake the model into the container image rather than fetching. E.g. time waiting for the model to transfer, artifact repository may be down, etc.
It's mostly because it's easier. No additional token necessary, No config, No Service discovery. For small models (less than 100MB), it's generally not an issue to embed them directly in the image.
However, if you're getting GB-size images with less than half of it related to the serving code and its library, then it's time to upgrade.
* Distributed computing in most cases w/ spark + hadoop stack
* Keeping state, which may be required to mutate
* Rapid iteration
The ML tooling part of it is an implementation detail, i.e, the software and dependencies required. These are hard problems even with trad deterministic computing. I don't seem to understand why the author seems to think ML engineers or scientists need to know these Ops tooling.
For example in this tweet https://twitter.com/mihail_eric/status/1486750600343822343 the author complains that data scientists need to learn kubeflow (they don't), and that it's complicated. Thing is, insofar as scalable architecture diagrams along with all the other security side-requirements it's about as complicated as one would expect, maybe a little too abstract for those that do this for a living. I mean your typical k8s-based SaaS tech stack can reach that complexity, but it's managed complexity about as complex needed for the stakes at play.
I don't know if ML folk are in the peak hype cycle arrogance where they think global ops problems can be solved for their use case, or if there's some misunderstanding on the iceberg of a problem of managing infra is.
I do agree it is messy, I did some ML Ops (w/a big data stack) as a "DevOps engineer" but I stuck with k8s and infra primitives, filtering out most of the list. The ML aspect was the easy part, mainly managing the install deps, jupyter notebooks state etc., the hard part was scaling to manage costs, managing a big data stack in general, and making the entire flow UX friendly to ML engineers and data scientists, since you can't expect them to learn new cli tools and trad software dev tooling (they're paid too much to waste time not working on ML problems). I think a lot of these problems are solved if your company has a lot of money to burn on SaaS solutions or not care about scaling down, or being able to afford your own datacenter.
My counterpoint to the article is that the industry has bent backwards to cater to the ML space, integrating all these tools to existing tech (spark on k8s, kubeflow), making entire pipelines jupyter-driven (https://netflixtechblog.com/notebook-innovation-591ee3221233), and generally using massive amount of resources for ML. The ROI and massive push to burn resources and time into the tooling seems work out for big tech more than anyone.
For example, when I first got into ML the ADAM optimizer was the new big thing, since then hundreds of 'better' optimizers have been published. This paper from August '21 shows that most of that is overblown and no optimizer consistently outperforms ADAM: https://arxiv.org/pdf/2007.01547.pdf
I'd even go so far and say that OP's big image of MLOps is misleading. 'Data Science Notebooks' shows Jupyter, binder, colab, I can't see the other logos; but binder and colab both run Jupyter notebooks? Under ML platforms there are a ton of companies in this picture which effectively do the same thing. Some of these logos are tools (Jupyter, R), some of these logos are companies using ML in some way or other (John Deere, Siemens) - and once you go down this path you might as well put any mid-sized company in the world onto this.