
We Need DevOps for ML Data - amargvela
https://tecton.ai/blog/devops-ml-data/
======
softwaredoug
In my experience, the problem has a lot to do with how teams organize around
ML.

When you have engineering team separate than a data science team, you'll
inevitably have unproductive conflict & politics. One team might be
incentivized for stability and speed (engineering or ops) and the other model
accuracy (data science). The end result can be disastrous... An engineering
team that wants to bend nothing to help data scientists get their work in
production. Or a data science team that only cares about maximizing accuracy,
even if it might destroy prod, or be impractical to implement in a performant
way.

To hit the sweet spot on accuracy, speed, and stability, you need to have one
team that focuses on the end feature. It needs to be cross-functional and
accountable for doing a great job at that feature. And the data scientists
need to be possibly more focused on measuring and analyzing the feature's
success, rather than just building models for models sake.

I'd recommend the book Agile IT Organization Design if you're interested in
good team design patterns

~~~
tixocloud
This. In my experiences across larger enterprises, data science teams rarely
hold the key to production environments and therefore, relies heavily on IT to
productionizing ML. And I completely agree that data scientists need to be
focused on measuring and analyzing success as opposed to churning more and
more models.

~~~
moandcompany
In many cases, this is not just because production teams don't want data
science teams to be able to deploy to production --due to lack of trust or
confidence- but data science teams often don't want this responsibility.

~~~
softwaredoug
There’s also perhaps a syndrome of not wanting to do the organizational work
to do ML well. Instead of changing the whole org by integrating ML into every
team, a data science team is hired doing god knows what. There’s status
assigned to being the “data scientist” and they work away, siloed, on fun
sounding deep learning models. In this mode, if they produce anything it’s
impractical, divorced from the product realities, and rather hard for the main
engineering/product org to maintain or implement.

The reality is there’s more work to embracing ML than hiring data scientists.
Everyone needs to understand ML a little, and it needs to be OK to critically
question data science work from product and engineering angles.

~~~
moandcompany
Another aspect of this I've observed -- Personal sense of value (and industry
pay goes into this) contributes to partitioning of work. If we're charitable,
it's from a belief of comparative advantage, and if we are brutally honest
about some people, it's because people often feel that "_____ isn't a good use
of their time." This is also fed by the "sexiest job of the 21-st century"
saying that's been created.

We see this in data science and machine learning where people complain about
spending their time cleaning data, etc... when their time should be spent
"generating insights/etc." We also see that those insights are interesting but
not very useful if they aren't actionable, too costly or too impractical to
implement.

Ultimate value is related to being able to contribute to and achieve the
holistic outcome, but the lens of success is often focused on models or
insights instead. This is a cultural and organizational problem, rather than a
technological one. It also takes a dose of humility to appreciate the true
value of the so-called dirty work.

~~~
tixocloud
Another point is that while we technologists love to marvel at data science
and machine learning, it still begs the question of what value does it bring
to the business. Does the added responsibility of creating all the
infrastructure and processes worth it to justify a 5% increase in conversion
rates? As you say, even the dirty work has a cost and whether that cost is
worth paying to find out that there's nothing you can do to improve the
business. That's why all the massive multi-year central data warehouse
cleansing type projects keep failing without yielding much value. There's just
a lack of focus on delivering incremental value with these data projects.

~~~
atupis
I think currently it creates new possibilities to do business. Computer vision
is that point that nowadays it is more engineering than data science so adding
something like somewhat good object detection is not that hard. NLP is
probably same point where cv was 5 years ago so we start seeing very good NLP
models.

~~~
tixocloud
Definitely. Even as an engineer working on CV 10 years ago, the hard part
wasn’t object detection but rather network bandwidth to stream incredible
amounts of data and processing it in real-time.

Spoke to an experienced engineer who used to lead NLP at MSFT and same
comment. NLP models are already fantastic and it isn’t very hard to build a
smart chatbot. The implementations these days are just very poor because they
are not well thought out from a user perspective.

------
simonw
I see this as more of an organizational challenge than a technology challenge.

Getting ML models into production isn't particularly hard... if you put an
engineering team on it that know how to write automated release procedures,
design architecture that can scale and build robust APIs to surface the data.

But in many companies the engineers with those operations-level skills and the
researchers who work on machine learning live completely separate lives. And
then the researchers are expected to deploy and scale their models to
production!

That's not to say that this organizational problem cannot be solved with
technology/entrepreneurship. If a company can afford it it's likely much
cheaper to pay an external company to solve your "ML in production" problems
than to re-design your organization such that you equip your internal ML teams
with the skills they need to go to prod.

~~~
Cacti
I disagree. It’s not about getting the data where it needs to be. It’s about
data version control at a very fine level with very large datasets (in a way
that is efficient). It’s about detecting changes in model results base don
changes in data. It’s about tracking provenance of data in the datasets. It’s
about potentially controlled access to the data (eg allowing models to use
health care data without actually knowing the underlying data). It’s about
detecting bias in datasets over time.

It’s actually quite complex, which is why generally speaking very few people
do anything like this. I am unaware of any general solution to this problem,
either in industry or academia.

~~~
tixocloud
You've brought up a lot of very interesting points that we're actually looking
to solve regarding data distribution changes, data version control and
reproducibility, privacy guards and bias detections with our startup Orchestra
([https://orchestrahq.com](https://orchestrahq.com)).

Would love to chat if you have further thoughts around the subject - there's a
ton of problems we're looking to tackle in the space and would be good to get
input.

------
moandcompany
Fig 4 looks like it's derived from Hidden Technical Debt in Machine Learning
(2015).

[https://papers.nips.cc/paper/5656-hidden-technical-debt-
in-m...](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-
learning-systems.pdf)

As someone else says in this comment thread, this is very much an
organizational problem, and cannot be viewed as just a technology problem.

The common behavior of individuals and teams is the pursuit of solutions that
solve problems for them. Problems here with ML, and as we've seen with "Data
Science," along with other magic technologies is that having an appreciation
for the domain or context goes a long way. Being familiar with entire process,
or "pipeline," is valuable, and role/functional silos often lead the problems
people experience.

For some classes of machine learning problems and associated data, sourcing
solutions from vendors can work, but as with any tools you can procure, you
need the right people to use them appropriately. This also applies to "DevOps"
which is used for comparison in the blog post.

\--> DevOps example -- the philosophy seems to be about having software
developers also share build/release and infrastructure responsibility. But
some organizations have made "DevOps" teams to silo build/release and
infrastructure work... they ended up renaming what used to be called their
Build/Release or SysAdmin teams. Siloing things to be "someone else's" problem
doesn't result in the major transformations that are needed.

Now imagine what happens if we substitute MLDevOps for DevOps above.

I'll continue to say "The Role of a Data Engineer on a Team is Complementary
and Defined By The Tasks That Others Don’t (Want To) Do (Well)"

~~~
amznthrowaway5
> The Role of a Data Engineer on a Team is Complementary and Defined By The
> Tasks That Others Don’t (Want To) Do (Well)

Those types of tasks are also often not recognized or rewarded by management,
despite being a hugely critical part of the system. I believe the incorrect
hiring of scientists who are often strong in terms of core theory or number of
papers published but have no clue about building real production ML systems is
a huge organizational problem, often causing ML teams to fail to deliver any
real value.

------
gas9S9zw3P9c
Wow, I probably have seen 10 of these kind of companies over the past few
months. Personally I believe (and hope) the winners in this space are going to
be modular open-source companies/products as opposed to the "all-in-one
enterprise solutions"

~~~
yanovskishai
Could you please mention what are the other solutions you've got to see in
this space?

~~~
verdverm
[https://dolthub.com](https://dolthub.com) is the cool kid right now. There is
pacaderm, git lfs, IPFS.

Really what we need is version control for data, it's not just an ML data
problem. It's a little different though, because you would like to move
computation to data, rather than the other way around

~~~
wenc
The utility of version controling production-sized (not sample training data)
data (as opposed to code) is something I've having trouble grasping unless I'm
missing something here -- and I may be, so please enlighten me.

It seems to me to be able to time-travel in data you almost need to store the
Write-Ahead Log of database transactions and be able to replay that. Debezium
captures the CDC information, but it's a infrastructure level tool rather than
a version control tool.

In data science, most time-travel issues are worked around using bitemporal
data modeling: which is a fancy way of saying "add a separate timestamp column
to the table to record when the data was written". Then you can roll things
back to any ETL point in a performant fashion. This is particularly useful for
debugging recursive algorithms that get retrained every day.

But these are infrastructure level approaches. I'm not sure that it's a
problem for a version control tool.

~~~
sgt101
I worry about retraining every day. Isn't that a flag that says "It hasn't
learned a thing and actually I'm just improving my backfitting score"?

~~~
somurzakov
as long as it works well on out of sample data at deployment time, it is okay.

Until some major data drift happens, but you would notoce it anyways

~~~
sgt101
Honestly, I've heard people in Vegas tell me the same about their strategies
vs. slots. Genuinely, if you have made money from this - well done, take it
out now, congratulate yourself. If you haven't...

------
remmargorp64
I was the main data science engineer at one of my previous companies. We used
tools like airflow for running python scripts to import data, clean/transform
it, train models, and even test various models against datasets. We also used
Azure for similar things.

It's easy to do "dev ops" for machine learning. Basically, just automate
everything and implement gatekeeping mechanisms along with active monitoring.

It's true, though. I had to cobble together a lot of custom things at the
time, but it wasn't that hard to do.

~~~
nik_s
I'm the CTO at a data science company, and this has been my experience too.
I've been lucky enough to have quite a few engineers go from zero practical
experience to being able to train and deploy complex ml solutions, and the
most successful solutions have always involved a combination of just a couple
of tools: \- airflow and/or celery for running data extraction and
transformation jobs \- pandas and numpy for data wrangling \- sklearn,
xgboost, lightgbm, pytorch or tensorflow for training/inference \- flask or
Django to serve results

It's a handful of technologies, but they're (generally) mature, battle tested,
and well documented.

~~~
softwaredoug
Generally true. Though I will say that in larger orgs, you will occasionally
get someone doing some ML they read a paper on that's not well supported by
major tooling. I mean it's the same trend chasing you see in engineering...

------
starpilot
Good god it's hard to do this at a non-tech company. MLOps would be great, but
we don't really have "Ops," just IT, since our main business is not software.
And we don't have Dev either, so we don't have anyone to really emulate on the
inside. Our data scientists are foremost analysts who can write some Python,
they don't know OO or memory optimizations or anything. They've never used a
bash prompt or know what one is. Management thought we could orchestrate this
huge waterfall schedule for a project and now it's falling apart as we open
each new box of surprises...

~~~
proverbialbunny
If you don't have a dev, how are you collecting any data to begin with?

------
kostas_f
I 'll disagree with most comments that it's mainly an organizational problem.
Creating tooling for things like:

\- managing different data sources

\- versioning data

\- monitoring how new data affects the model

\- testing that certain SLAs are met before new features are deployed

\- ability to rollback

\- data & model quality monitoring

is technically challenging.

Obviously there are engineers that will quickly hack something together and
will falsely think that they have a good enough MLOps solution. I have been
part of such teams.

Most companies are not Google, Facebook, or Uber. The large ones very often
don't have the know-how to create a robust technical solution around this
process, and even if they do it can take them years and the smaller ones lack
both the resources and technical expertise.

I'm always looking for new ideas that can become successful business and when
I saw the Uber Michelangelo here on HN a few years ago, I was thinking that
selling similar tooling to other companies, had great potential. Seems that
the right team to create that company was the one the built Michelangelo
itself :)

------
smeeth
I really find it difficult to put into words just how little I care to pay for
a web ui so I can "manage" my data.

Data pipelines are a real problem though, and I'm very interested in what
startups do with this space.

~~~
factorialboy
> Data pipelines are a real problem though

Can you please elaborate more, thanks.

~~~
prions
Its not trivial to create and manage Data Pipelines if you care about scale,
serving a wide range of inputs and outputs, or making this data easy to
surface and spread throughout your org (i.e. making it actually useful to
regular people).

"Static ETL" like running the same database load every day at 1:00am isn't a
super challenging problem. Doing it across many tables with complex
transformations and multiple steps easily can be. You really have to consider
reliability, processing speed, failure methods and other problems that dont
really arise until you hit a certain scale.

There's also the issue of what people want out of a Pipeline that's changing.
If you want people to be ""data driven"", then that means they need easy
access to potentially all of your company's data on an ad hoc basis. So now
your boring ETL 1 am pipeline isnt really serving any of these new usecases.

How do you create flexible pipelines that can be created from any dataset on
an ad hoc basis? This is where tools like Airflow or Prefect come in. Creating
a platform that can create these types of Pipelines is a real problem.

And before you even ask yourself _how_ to process this data, you need to also
ask _where_? If you want to do what I outlined above - making your data more
accessible and easy to use - then you probably need to rework how you're
storing your data. But Data Lakes (and others) are a whole topic in and of
itself.

------
ska
"We need fewer data scientists and more data janitors" \- anon

------
dnautics
Honest question (though I suppose the clickthroughs to the comments are likely
to be a biased sample): is "getting to prod" really the gatekeeper/bottleneck
for most ML? I would have thought "a model that works" is much harder,
especially given how hyped the field is and how many people are trying to
tackle problems that are I'll suited to the current batch of ML techniques.

Unless the issue here is data collection in prod to start training your model.

~~~
proverbialbunny
>is "getting to prod" really the gatekeeper/bottleneck for most ML?

The most common bottleneck is collecting the right data. It can take years, or
even a task force just to get the right data before the data scientist can
begin.

>I would have thought "a model that works" is much harder

It depends how experienced the data scientist is. Early on into a project a
data scientist can do a feasibility assessment. They should identify what is
possible, and how possible. Sometimes some data science projects are heavy on
the research side where where it can take 2 weeks to 3 months to figure out if
something is possible. Sometimes the feasibility assessment ends up being
incorrect and a goal is shown to be impossible.

Once research is done it usually takes 4 weeks to 6 months for a data
scientist to build a model. The upper bound is rare and happens because of
recursive refinement to increase accuracy, trying to get every last drop out
of what is possible.

In contrast it can take months to years for the company to begin to collect
the right data for a data scientist to be able to begin to do what benefits
the company. Sometimes crowd source projects need to be created just to
collect the required data. It then takes an average of 3 to 6 months for
productionization if there is clever feature engineering in the model. Note:
When I say productionization, I mean all the way to the end customer, so
setting up and maintaining pipelines, frontend devs updating websites to add
the service, and whatever else is necessary. There is more work involved on
the production side, but it can be split up to multiple engineers.

~~~
dnautics
That's exactly what I suspected (I work at a dl/ml hardware co), thanks!

------
Jaruzel
Managing data is not an IT job. Data is just unformatted information, and
should be managed and governed by those who are trained in Information
Management: Modern day Librarians.

IT own the platform, and the software. They should never own the data as well.

~~~
alexilliamson
I agree with this, and I think data librarian is a role that any "data-driven"
company needs. IMO it makes a lot of sense for data scientists to fill that
role, but I think that's an issue for many. Data scientists may think being a
librarian and organizing the knowledge base is beneath them, or maybe
management thinks it's beneath them. Execs tend to not care about the state of
knowledge infrastructure as long as their reports get to them when they
expect.

------
akarve
This is close to home. Our approach to DevOps for ML Data is to use S3 as the
git core and build immutable datasets and models on top of S3 object
versioning. I wrote the piece below on "Versioning data and models for faster
iteration in ML" earlier this year. The key idea is for every model iteration
to be a pure function F(code, environment, data). Ideas welcome:
[https://medium.com/pytorch/how-to-iterate-faster-in-
machine-...](https://medium.com/pytorch/how-to-iterate-faster-in-machine-
learning-by-versioning-data-and-models-featuring-detectron2-4fd2f9338df5)

------
Tehchops
I'm reminded of: [https://blog.acolyer.org/2019/06/03/ease-ml-
ci/](https://blog.acolyer.org/2019/06/03/ease-ml-ci/)

------
iddan
This startup is trying to build the next GitHub for ML Data:
[https://dagshub.com/](https://dagshub.com/)

~~~
gunshai
This seems pretty cool.

------
beckingz
Data is hard to automate and standardized pipelines and processes are really
helpful. This is interesting.

------
tkyjonathan
Isn't this DataOps?

~~~
schnitsel
I was thinking the same, it could be that the OP isn't familiar with the term
yet.

------
pottertheotter
Is this not just a really long advertisement?

------
flaxton
Rule number one: define your terms as you introduce them. On and on about ML.
But what is it?

I had to search to see it was Machine Learning.

How hard is it to define it the first time you use it?

I can bet lots of people were scratching their heads but didn’t bother to look
it up or continue reading...

~~~
oplav
Genuinely curious, did you think "ML" stood for anything else? My day to day
work is not machine learning but if I ever see ML, "machine learning" is the
first thing I think of.

~~~
Tommah
There is also the ML family of programming languages:
[https://en.wikipedia.org/wiki/ML_(programming_language)](https://en.wikipedia.org/wiki/ML_\(programming_language\))

