
Why we do machine learning engineering with YAML, not notebooks - ChefboyOG
https://towardsdatascience.com/why-we-do-machine-learning-engineering-with-yaml-not-notebooks-a2a97f5e04f8
======
yuy7878
Its quite simple to develop a model, bundle it up for deployment and deploy.
Nobody cares about your fancy YAML based containerized deployment and
monitoring setup, everyone has that. The challenge comes in when you have a
continuous cycle of data ingestion to model optimization, training, evaluation
and deployment. Pretty much everybody has huge amount of code duplication in
there. It also comes from the fact that ml researchers are barely capable of
programming a light switch, like how are you ever gonna put the horrible trash
of code they ducked tape together from medium posts into a production
environment. Hopeless.

~~~
wooders
I don't think it's simple to deploy scaleable predictions - that's why model
hosting solutions like SageMaker's and GCP's AI Platform exist, and there's no
need for people to be re-implementing model deployment/monitering.

------
s1t5
The title is seriously misleading. They aren't doing their ML engineering in
yaml. If you look at the snipper the article, you can see that their code is
in flat .py files. The config is in yaml (which is also how everyone else uses
it). It's like someone saying that they do their ML in a dockerfile.

~~~
ABeeSea
The linked site (towardsdatascience) is kind of like medium blogs. High
variability in quality with a lot of self-promoters and the occasional diamond
in the rough. But a ton of rough.

~~~
bart_spoon
It _is_ a Medium blog. Its literally a Medium page, they just rely on other
people's contributions which they have editors look over and then aggregate
rather than produce their own. I suppose the editing process may help with
quality some, but it really shouldn't be seen as much more than a typical
blog.

------
karlicoss
For .ipynb notebooks, I _highly_ recommend using nbstripout [0] to strip the
Jupiter output before committing the notebooks to the repository (thus making
the diffs sane).

You can also set it up as a 'filter', so it automatically runs before any git
operations, whether it's add, commit, diff or an interactive rebase.

[0] [https://github.com/kynan/nbstripout](https://github.com/kynan/nbstripout)

~~~
amirathi
> thus making the diffs sane

You can also use ReviewNB [1] that is literally built for Jupyter notebook
diffs. You can see notebook visual diffs for any commit or pull request on
GitHub. For pull requests, you can also write comments on a notebook cell
(emulating the typical code review experience for Jupyter notebooks)

Disclaimer: I built ReviewNB for Jupyter notebook code reviews on GitHub.

[1] [https://www.reviewnb.com/](https://www.reviewnb.com/)

------
thelastbender12
The cortex tool mentioned looks really useful to get a service running out of
a trained model. Though I didn't really understand what the article is trying
to get at. Storing your deployment configuration in yaml and json files is
pretty much the standard.

~~~
mikorym
Perhaps for young people it's actually necessary to mention JSON and YAML as
you people tend to read the news rather than history or best practises
textbooks?

~~~
alxlaz
Many hyped development and administration practices make me snarky, too, but
let's not take it out on random (young) people on the Internet. There's a
great deal of history and best practices that are being ignored not just by
ADHD juniors, but also by team leads and managers of every seniority level.
Lots of people out there spend decades in the industry, build solid careers
based on playing the right office politics cards and being friends with the
right people, and miraculously manage not to learn almost anything.

(Edit: also, clearly, I don't read the news -- is YAML being "superseded"? By
_what_ now??)

~~~
baq
> is YAML being "superseded"? By what now??)

i hope by something that isn't the kitchen sink... i actually prefer XML and i
hate XML.

started looking at [https://dhall-lang.org/#](https://dhall-lang.org/#) which
compiles to json/yaml, is seriously strongly typed and explicitly not Turing-
complete both as a design goal and current reality.

------
mumblemumble
So, every month or two I see another article tut-tutting people for putting
notebooks into production, and I'm curious, _who is actually doing this?_ I've
never seen such a thing in the wild, and I'm genuinely (morbidly?) curious
what it would look like in practice.

~~~
bart_spoon
Netflix apparently. They've built an entire software framework for enabling
their data scientists to put their notebooks into production [0].

[0] [https://netflixtechblog.com/notebook-
innovation-591ee3221233](https://netflixtechblog.com/notebook-
innovation-591ee3221233)

~~~
vtuulos
We (Netflix) do a ton of prototyping/exploration in notebooks like everyone
else. We run many ETL pipelines in production as _templated notebooks_. When
something fails, you can just open a notebook to see the input and the output,
which is handy.

We don't deploy or execute ML models in production as notebooks. We have many
other solutions for that use case. In particular, check out
[https://metaflow.org](https://metaflow.org)

------
conjectures
> When I say production machine learning, I’m referring to machine learning
> that manifests as a product feature. For example, Uber’s ETA prediction, or
> Gmail’s Smart Compose.

You can bet that prod services from companies you heard of are running on
something more analogous to versioned docker images. Not a yaml file which
says, 'Go run whatever predict.py is in the current folder.'

The moment one of your dependencies breaks your code, or snookers your
performance, there will be a lot of head scratching going on.

------
dirtydroog
Reads like someone was forced by their marketing team to write an article
about anything at all.

~~~
BasilPH
I find the quality of articles on towardsdatascience.com has significantly
decreased in the past months. This article is no exception.

------
musingsole
Reason 1 (Your pipeline should be reproducible) for avoiding jupyter doesn't
make any sense. That's the whole point of jupyter. Out of order execution can
happen, but you can just as easily restart the kernel and run all... I can
only imagine this is a problem for someone who doesn't understand the tool
they're using.

Reasons 2 and 3 for avoiding jupyter are more justified but easy enough to
work around between jupytext and papermill.

------
ljvmiranda
I once wrote a survey of tools within the Jupyter Notebook ecosystem:
[https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-...](https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-
notebooks-in-2020-part-2/) (it’s a three part series and that link is Part 2).

The topic of production notebooks often shows up. I’ve seen tools like
papermill and dagster being used for notebook prod, just like in Netflix.

I concluded that using notebooks for prod is always a tech decision, often
influence by a tradeoff: risk for premature optimization (writing scripts
early on in the project that may only be used once) and underengineering
(using non-maintainable and clunky code to support mission-critical
workloads):
[https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-...](https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-
notebooks-in-2020-part-2/#putting-it-together)

