That seems really out of place. I'm somewhat used to automatic data collection from applications, but automatic data collection from programming libraries / frameworks? Really?
On the other hand, I'm glad that they mentioned it --- I would have a much more negative reaction if I had to find this out on my own.
I don’t like packages that require external access to function. I understand the business mode and think there are clear ways to do this (plotly and graphistry come to mind), but I don’t think the benefit outweighs the downsides to use these types of libraries.
At least this can be easily refactored out, plotly and graphistry don’t really function well without the api calls. Plotly offline exists, but trying to keep track of features between the two is a pain. And the reasoning given for the api (massive scale conpute) could be easily abstracted for local mode if they wanted.
1. The day to day experimentation and iteration by a computational researcher.
2. The repeated execution of a workflow on different data sets submitted by different people, such as in a clinical testing lab.
2. The ongoing processing of a stream of data by a deployed system, such as ongoing data processing for a platform like Facebook.
For (1), there is a crucial insight that is often missing: the unit of work for such people is not the program, but the execution. If you have a Makefile or a shell script or even a nicely source controlled program, you end up running small variations of it, with different parameters, and different input files. Very quickly you end up with hundreds of files, and no way of tracking what comes from which execution under what conditions. make doesn't help you with this. Workflow engines don't help you with this. I wrote a system some years ago when I was still in computational science to handle this situation (https://github.com/madhadron/bein), but I haven't updated it to Python 3, and I would like to use Python's reflection capabilities to capture the source code as well. It should probably be integrated with Jupyter at this point, too, but Jupyter was in its infancy when I did that.
For (2), there are systems like KNIME and Galaxy, and, crucially, they integrate with a LIMS (Laboratory Information Management System) which is the really important part. The workflow is the same, but it's provenance, tracking, and access control of all steps of the work that matters in that setting.
For (3), what you really want is a company wide DAG where individuals can add their own nodes and which handles schema matching as nodes are upgraded, invalidation of downstream data when upstream data is invalidated, backfills when you add a new node or when an upstream node is invalidated, and all the other upkeep tasks required at scale. I have yet to see a system that does this seriously, but I also haven't been paying attention recently.
For none of these is chaining together functions with error handling and reporting the limiting factor. It's just the first one that a programmer sees when looking at one of these domains.
At the other end of the spectrum you have every small team with some data analysis steps producing their own workflow engine when Make would be just fine.
I agree however the streaming case is particularly poor, but consider that paired with an appropriate fuse file system Make can address most use cases.
I've never seen this work in practice, and doubt it can work, due to the complexities involved.
There are just engineering projects. There are not any other things.
For some engineering projects, you need to support dashboard-like, interactive interfaces that depend on data assets or other assets (like a database connection, a config file, a static representation of a statistical model, whatever). Sometimes you need a rapid feedback system to investigate properties of the engineering project and deduce implications for productively modifying it. These are universal requests that span tons of domains, and have very little to do with anything that differentiates data science from any other type of engineering.
At the level of an engineering project, you should use tools that have been developed by highly skilled system engineers, for example like Make or Bazel, or devops tools for containers or deployment and task orchestration, like luigi, kubernetes tools, and many others.
For a web service component, you should use web service tooling, like existing load balancing tools, nginx, queue systems, key value stores, frameworks like Flask.
For continuous integration or testing, use tools that already exist, like Jenkins or Travis, testing frameworks, load testing tools, profilers, etc.
Stop trying to stick a handful of these things into a bundle with abstractions that limit the applicability to only “data science” projects, and then brand them to fabricate some idea that they are somehow better suited for data science work than decades worth of tools that apply to any kind of engineering project, whether focused on data science or not.
The first one I think is that more and more people are now using/trying to use machine learning models in production and they discover that the workflows and tools they used to use and work with are not suited for delivering machine learning models in fast, repeatable, and simple way.
The second reason is that I objectively think that a machine learning pipeline or CI/CD system is a bit different than the one used for pure software engineering practices, partly because machine learning does not only involve code, but more layers of complexity: data, artifacts, configuration, resources... All these layers can impact the reproducibility of a "successful build". Hence, a lot of engineering is required to both ensure that teams can achieve both reproducible and reliable results, and increase their productivity.
All I can say is that in based on my experience, I would dramatically disagree with what you wrote.
I’ve always found pre-existing generalist engineering tooling to work more efficiently and cover all the features I need in a more reliable and comprehensive way than any of the latest and greatest ML-specific workflow tools of the past ~10 years.
I’ve also worked on many production systems that do not involve any aspects of statistical modeling, yet still rely on large data sets or data assets, offline jobs that perform data transformations and preprocessing, require extensibly configurable parameters, etc. etc.
I’ve never encountered or heard of any ML system that is in any way different in kind than most other general types of production engineering systems.
But I have seen plenty of ML projects that get bogged down with enormous tech debt stemming from adopting some type of fool’s gold ML-specific deployment / pipeline / data access tools and running into problems that time-honored general system tools would have solved out of the box, and then needing to hack your own layers of extra tooling on top of the ML-specific stuff.
You're not wrong about the value of all the different tools you mention, but I think overlooking the integration and maintenance costs that a specialty tool can reduce, at the expense of some flexibility. I think that's the same reason many people prefer an IDE.
But they didn't really explain/sell what those optimisations are in the readme.
- The workflow in the readme is missing the part when you actually use the model. This will often need to be connected to your original preprocessing in some way - for example, if the dataset that you're predicting on has a categorical variable with a unique value which wasn't present in the training dataset, this effectively introduces a new feature in your dataset, which makes it impossible to do model.predict(). The need to manage things like this changes that workflow chart quite a bit.
I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure.
At work I have what are essentially cron jobs running scripts which invoke sklearn pipelines. I've never even thought to make the scheduler aware of what they were running and I'm not sure why I would.
I think it depends on the use case, sometimes the components of the pipeline aren't necessarily running on the same machine, and they don't know where and how to get access to data and artifacts generated by previous steps, and so scheduling and orchestration becomes an important component of the pipeline itself.
> "I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure."
I think the idea behind building such frameworks is to help people avoid going through the same steps of building such tool internally by "assembling together a set of tightly scoped "UNIX philosophy" libraries". In general these frameworks are using libraries and tools, and exposing an easy way to leverage them instead of spending time doing that over and over.
If cron works for you, that’s great, and you should continue to use it. However, I would be interested to know how many data sources you have, how you handle failures in pipe segments, and your general throughput.
In more complicated flows, ones that require different different data sets to to be combined, or lots of data flows that depend on each other, moving to a DAG with event triggering is a much better setup in my experience. Data is generated faster, and errors are handled more gracefully, and recovery much faster since data is only recalculated when needed.
Sole developer of process doesn't see the need for anything more than cron jobs, news at 11
I think it would actually be professionally negligent to introduce coupling at this point.
It would be much clearer if they compared side by side a few simple examples with regular makefiles that do the same thing, and people could see the advantages.
Internationalisation is long. Databolt isn't.