
Kedro: open-source library for production-ready machine learning code - ereli1
https://medium.com/@QuantumBlack/introducing-kedro-the-open-source-library-for-production-ready-machine-learning-code-d1c6d26ce2cf
======
FridgeSeal
> Machine learning models which can be deployed effortlessly and operate
> unattended are far more likely to achieve commercial objectives.

Likeliness of achieving commercial objectives is tied to the commercial
usefulness and accuracy of your analysis and predictions, not the ease of
deployment, or-even more curiously-ability to be left unattended.

~~~
IanCal
It's surely not a particularly contentious point that hard to deploy systems
that require lots of attention to keep running are less likely to achieve
commercial objectives.

Just like your website being stable and easy to update helps your business use
it to make money. Of course it _also_ needs to be tied to commercial
usefulness.

------
prepend
I really like how they implemented the data catalog [0] so that it’s yaml-
based and also has a paths-style cascading method of files that can be common
across or within teams as well as personal for individual projects. I think
this makes it easy to build up with tools for meta analysis (how many data
sets are used, etc) and even viz using a variety of tools rather than having
the metadata management tied to a system or product.

Are there other techniques for data catalogs that are file based or at least
open standard based that scale all the way up from developer?

[0]
[https://kedro.readthedocs.io/en/latest/04_user_guide/04_data...](https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html)

~~~
infinite8s
There's the intake project from the Anaconda folks.

------
domenicrosati
Conjecture: production quality of ml code has mostly to do with how heuristics
are designed and battle tested and almost nothing to do with how the
training/inference pipeline is constructed.

~~~
stichers
Just because the challenge is relatively trivial to solve, doesn't make it any
less important though. Experiment management, and the transition to
production, is recognised as having potentially high impact to successful
delivery. My understanding is that this takes care of details, which can
otherwise get forgotten in the race for the best model. But YMMV.

------
bserial
I’m curious as to if anyone can say how this compares to dagster since both
libraries seems to rely on deploying to engines like Airflow?

~~~
Peteris
Kedro puts emphasis on seamless transition to prod without jeopardizing work
in experimentation stage:

\- pipeline syntax is absolutely minimal (even supporting lambdas for simple
transitions), inspired by the Clojure library core.graph
[https://github.com/plumatic/plumbing](https://github.com/plumatic/plumbing)

\- sequential and parallel runners are built-in (don't have to rely on
Airflow)

\- io provides wrappers for existing familiar data sources, but directly
borrows arguments from Pandas, Spark APIs so no new API to learn

\- flexibility in the sense you could rip out anything, for example, the whole
Data Catalog replacing with another mechanism for data access like Haxl

\- there's a project template which serves as a framework with built-in
conventions from 50+ analytics engagements

------
wokwokwok
tldr, if you really dig past the marketing (from the FAQ (1)):

> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are
> tools that handle deployment, scheduling, monitoring and alerting. Kedro is
> the worker that should execute a series of tasks, and report to the Airflow
> and Luigi managers.

> Create the data transformation steps as pure Python functions

Personally, I feel mystified why you would use something like this rather than
a more mature product like say, Spark, that natively supports clustering, etc,
which is what I would really like to see in the FAQ.

Is it a processing solution? Not really, since it suggests you can offload the
heavy lifting to an engine, eg. spark. An orchestrator? Apparently not,
because that's a complementary product. So... it's like, a configuration
management tool?

Pretty hard to see the use case to me.

1\.
[https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...](https://kedro.readthedocs.io/en/latest/06_resources/01_faq.html#how-
does-kedro-compare-to-other-projects)

~~~
FridgeSeal
Because running Spark to do anything that doesn’t actually require a whole
cluster is like using earthmoving equipment to assemble a series of small ikea
tables?

~~~
wokwokwok
If you're doing something that trivial, you don't need anything more
complicated than airflow.

~~~
tsanikgr
We experienced a big hit on our productivity when we were using airflow, as
there is significant overhead when running pipelines.

We think this is easier than airflow and needs less setup:

    
    
      - You don't need a scheduler, neither a db, nor any initial setup. On the contrary, kedro provides the `kedro new` command which will create a project for you that runs out of the box (optionally with a small pipeline example).
      - You can run your pipelines as simple python applications, making it easy to iterate in IDEs or terminals
      - Tasks are simple python functions, instead of operators
      - Datasets are first level citizens. You don't need to explicitly define dependencies between the tasks: they are resolved according to what each task produces/consumes
    

We also think that a big differentiating factor is the `DataCatalog`. Being
able to define in YAML files where your data is and how it is stored/loaded
means that the same code will run in any environment given the appropriate
configuration files.

This makes testing & moving from development to production much easier.

(Disclaimer - I am one of the lead developers of kedro)

We hope that you give it a try and give us feedback :)

------
coverman
Starting to see a lot of these frameworks pop up to simplify deployment of
machine learning models. I’m really hoping one or two start to stand out...but
it doesn’t feel like this one.

