Kedro: open-source library for production-ready machine learning code

FridgeSeal · on June 6, 2019

> Machine learning models which can be deployed effortlessly and operate unattended are far more likely to achieve commercial objectives.

Likeliness of achieving commercial objectives is tied to the commercial usefulness and accuracy of your analysis and predictions, not the ease of deployment, or-even more curiously-ability to be left unattended.

IanCal · on June 6, 2019

It's surely not a particularly contentious point that hard to deploy systems that require lots of attention to keep running are less likely to achieve commercial objectives.

Just like your website being stable and easy to update helps your business use it to make money. Of course it also needs to be tied to commercial usefulness.

joelschw · on June 6, 2019

This is a wider point for anyone looking to take advantage of machine learning, but reproducibility is also a problem which needs to be catered for.

prepend · on June 6, 2019

I really like how they implemented the data catalog [0] so that it’s yaml-based and also has a paths-style cascading method of files that can be common across or within teams as well as personal for individual projects. I think this makes it easy to build up with tools for meta analysis (how many data sets are used, etc) and even viz using a variety of tools rather than having the metadata management tied to a system or product.

Are there other techniques for data catalogs that are file based or at least open standard based that scale all the way up from developer?

[0] https://kedro.readthedocs.io/en/latest/04_user_guide/04_data...

infinite8s · on June 6, 2019

There's the intake project from the Anaconda folks.

domenicrosati · on June 6, 2019

Conjecture: production quality of ml code has mostly to do with how heuristics are designed and battle tested and almost nothing to do with how the training/inference pipeline is constructed.

stichers · on June 7, 2019

Just because the challenge is relatively trivial to solve, doesn't make it any less important though. Experiment management, and the transition to production, is recognised as having potentially high impact to successful delivery. My understanding is that this takes care of details, which can otherwise get forgotten in the race for the best model. But YMMV.

bserial · on June 6, 2019

I’m curious as to if anyone can say how this compares to dagster since both libraries seems to rely on deploying to engines like Airflow?

Peteris · on June 6, 2019

Kedro puts emphasis on seamless transition to prod without jeopardizing work in experimentation stage:

- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing

- sequential and parallel runners are built-in (don't have to rely on Airflow)

- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn

- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl

- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements

wokwokwok · on June 6, 2019

tldr, if you really dig past the marketing (from the FAQ (1)):

> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers.

> Create the data transformation steps as pure Python functions

Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

Pretty hard to see the use case to me.

1. https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...

deepyaman · on June 6, 2019

> Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:

    1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
       used it once.
    2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
       is a huge plus, and we also rely heavily on data versioning.
    3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)

Additionally, it aligns well with standards we have internally, like data layering. (edit: Apparently this is also part of the FAQ: https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h... Who knew!)

> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...

Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)

FridgeSeal · on June 6, 2019

Because running Spark to do anything that doesn’t actually require a whole cluster is like using earthmoving equipment to assemble a series of small ikea tables?

wokwokwok · on June 6, 2019

If you're doing something that trivial, you don't need anything more complicated than airflow.

tsanikgr · on June 6, 2019

We experienced a big hit on our productivity when we were using airflow, as there is significant overhead when running pipelines.

We think this is easier than airflow and needs less setup:

  - You don't need a scheduler, neither a db, nor any initial setup. On the contrary, kedro provides the `kedro new` command which will create a project for you that runs out of the box (optionally with a small pipeline example).
  - You can run your pipelines as simple python applications, making it easy to iterate in IDEs or terminals
  - Tasks are simple python functions, instead of operators
  - Datasets are first level citizens. You don't need to explicitly define dependencies between the tasks: they are resolved according to what each task produces/consumes

We also think that a big differentiating factor is the `DataCatalog`. Being able to define in YAML files where your data is and how it is stored/loaded means that the same code will run in any environment given the appropriate configuration files.

This makes testing & moving from development to production much easier.

(Disclaimer - I am one of the lead developers of kedro)

We hope that you give it a try and give us feedback :)

stichers · on June 6, 2019

I personally don't think it's that black and white. Not everyone has the same training in best practices for software engineering, and this tool looks like it places some constraints on the anarchy that can result, without requiring huge amounts of front-loading.

musabilal · on June 6, 2019

I personally find it simpler then airflow since there is less boiler plate required to construct DAGs and in my opinion there is less of a learning curve.

joelschw · on June 6, 2019

I think one of the big differences is that during development the pipeline DAG is inferred from the data catalog and not explicitly coded in the same way you need to do in something like Airflow.

The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.

coverman · on June 6, 2019

Starting to see a lot of these frameworks pop up to simplify deployment of machine learning models. I’m really hoping one or two start to stand out...but it doesn’t feel like this one.