> Machine learning models which can be deployed effortlessly and operate unattended are far more likely to achieve commercial objectives.
Likeliness of achieving commercial objectives is tied to the commercial usefulness and accuracy of your analysis and predictions, not the ease of deployment, or-even more curiously-ability to be left unattended.
It's surely not a particularly contentious point that hard to deploy systems that require lots of attention to keep running are less likely to achieve commercial objectives.
Just like your website being stable and easy to update helps your business use it to make money. Of course it also needs to be tied to commercial usefulness.
I really like how they implemented the data catalog [0] so that it’s yaml-based and also has a paths-style cascading method of files that can be common across or within teams as well as personal for individual projects. I think this makes it easy to build up with tools for meta analysis (how many data sets are used, etc) and even viz using a variety of tools rather than having the metadata management tied to a system or product.
Are there other techniques for data catalogs that are file based or at least open standard based that scale all the way up from developer?
Conjecture: production quality of ml code has mostly to do with how heuristics are designed and battle tested and almost nothing to do with how the training/inference pipeline is constructed.
Just because the challenge is relatively trivial to solve, doesn't make it any less important though. Experiment management, and the transition to production, is recognised as having potentially high impact to successful delivery. My understanding is that this takes care of details, which can otherwise get forgotten in the race for the best model. But YMMV.
Kedro puts emphasis on seamless transition to prod without jeopardizing work in experimentation stage:
- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing
- sequential and parallel runners are built-in (don't have to rely on Airflow)
- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn
- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl
- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements
tldr, if you really dig past the marketing (from the FAQ (1)):
> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers.
> Create the data transformation steps as pure Python functions
Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.
Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?
> Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?
I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:
1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
used it once.
2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
is a huge plus, and we also rely heavily on data versioning.
3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)
> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.
I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...
Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)
Because running Spark to do anything that doesn’t actually require a whole cluster is like using earthmoving equipment to assemble a series of small ikea tables?
We experienced a big hit on our productivity when we were using airflow, as there is significant overhead when running pipelines.
We think this is easier than airflow and needs less setup:
- You don't need a scheduler, neither a db, nor any initial setup. On the contrary, kedro provides the `kedro new` command which will create a project for you that runs out of the box (optionally with a small pipeline example).
- You can run your pipelines as simple python applications, making it easy to iterate in IDEs or terminals
- Tasks are simple python functions, instead of operators
- Datasets are first level citizens. You don't need to explicitly define dependencies between the tasks: they are resolved according to what each task produces/consumes
We also think that a big differentiating factor is the `DataCatalog`. Being able to define in YAML files where your data is and how it is stored/loaded means that the same code will run in any environment given the appropriate configuration files.
This makes testing & moving from development to production much easier.
(Disclaimer - I am one of the lead developers of kedro)
We hope that you give it a try and give us feedback :)
I personally don't think it's that black and white. Not everyone has the same training in best practices for software engineering, and this tool looks like it places some constraints on the anarchy that can result, without requiring huge amounts of front-loading.
I personally find it simpler then airflow since there is less boiler plate required to construct DAGs and in my opinion there is less of a learning curve.
I think one of the big differences is that during development the pipeline DAG is inferred from the data catalog and not explicitly coded in the same way you need to do in something like Airflow.
The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.
Starting to see a lot of these frameworks pop up to simplify deployment of machine learning models. I’m really hoping one or two start to stand out...but it doesn’t feel like this one.
Likeliness of achieving commercial objectives is tied to the commercial usefulness and accuracy of your analysis and predictions, not the ease of deployment, or-even more curiously-ability to be left unattended.