Hacker News new | past | comments | ask | show | jobs | submit login
D6tflow: Python library for building data science workflows (github.com)
140 points by DBCerigo 19 days ago | hide | past | web | favorite | 49 comments

> We have put a lot of effort into making this library useful to you. To help us make this library even better, it collects ANONYMOUS error messages and usage statistics. See d6tcollect for details including how to disable collection. Collection is asynchronous and doesn't impact your code in any way.

That seems really out of place. I'm somewhat used to automatic data collection from applications, but automatic data collection from programming libraries / frameworks? Really?

I have a strong, negative reaction to this. I read the collection code (in d6collect), and it does as they claim (with perhaps minor qualms about what anonymous really means). And (for now) it's easy to disable without mucking around in the code. But in fact I'm not sure what I was really looking for, since I don't imagine using this library when I might need to reverify that they still aren't collecting anything I don't want to be collected.

On the other hand, I'm glad that they mentioned it --- I would have a much more negative reaction if I had to find this out on my own.

I feel like an even better approach here would have been for the developers to offer the data collection functionality as a separately installed module, and then make the case to the user during installation of the main package.

This need to be opt in or banned. Or else anyone not collecting data will have an disadvantage, meaning sooner or later every library will be collecting data. And it doesn't stop at collecting usage statistics, some popular software are already recording what web sites you visit, and what you search for! (For example how to do x in library y, so they can improve their documentation or what not)

This library has a GPDR consent problem-- just putting it at the end of the README doesn't cut it. I wonder if this is the first open source library to violate GPDR.

gdpr really only cares about personally identifiable information, afaik. You don't need consent for anonymized stats.

If you read what it sends they’re sending function names / kwargs / module names etc Put an IP in there (for example) or a persons name etc and you have potential GDPRviolation.

A while back I found out that the popular Serverless framework/library tracks and reports back usage (https://serverless.com/framework/docs/providers/aws/cli-refe...). This similarly struck me as really out of place, and (at the time at least) it didn't seem sufficiently disclosed or described in the docs. If I NPM install it and invoke it, have I implicitly agreed to this?

Also interesting to see what happens when such a library becomes nested within a more popular library. Disabling should still work, but fewer people would be aware.

I thought this was a neat package and would have tested it. I won’t install or use it because of this.

I don’t like packages that require external access to function. I understand the business mode and think there are clear ways to do this (plotly and graphistry come to mind), but I don’t think the benefit outweighs the downsides to use these types of libraries.

At least this can be easily refactored out, plotly and graphistry don’t really function well without the api calls. Plotly offline exists, but trying to keep track of features between the two is a pain. And the reasoning given for the api (massive scale conpute) could be easily abstracted for local mode if they wanted.

Actually, Plotly offline works exactly the same was as Plotly online: same code runs on both ends, total feature parity from Python :)

I recently read that there are over 100 workflow engines [1], it is still often difficult to make the case that Make is insufficient

[1] https://vatlab.github.io/blog/post/sos-workflow-engine/

It's easy to make the case that make is insufficient. There are three very distinct domains being conflated here:

1. The day to day experimentation and iteration by a computational researcher. 2. The repeated execution of a workflow on different data sets submitted by different people, such as in a clinical testing lab. 2. The ongoing processing of a stream of data by a deployed system, such as ongoing data processing for a platform like Facebook.

For (1), there is a crucial insight that is often missing: the unit of work for such people is not the program, but the execution. If you have a Makefile or a shell script or even a nicely source controlled program, you end up running small variations of it, with different parameters, and different input files. Very quickly you end up with hundreds of files, and no way of tracking what comes from which execution under what conditions. make doesn't help you with this. Workflow engines don't help you with this. I wrote a system some years ago when I was still in computational science to handle this situation (https://github.com/madhadron/bein), but I haven't updated it to Python 3, and I would like to use Python's reflection capabilities to capture the source code as well. It should probably be integrated with Jupyter at this point, too, but Jupyter was in its infancy when I did that.

For (2), there are systems like KNIME and Galaxy, and, crucially, they integrate with a LIMS (Laboratory Information Management System) which is the really important part. The workflow is the same, but it's provenance, tracking, and access control of all steps of the work that matters in that setting.

For (3), what you really want is a company wide DAG where individuals can add their own nodes and which handles schema matching as nodes are upgraded, invalidation of downstream data when upstream data is invalidated, backfills when you add a new node or when an upstream node is invalidated, and all the other upkeep tasks required at scale. I have yet to see a system that does this seriously, but I also haven't been paying attention recently.

For none of these is chaining together functions with error handling and reporting the limiting factor. It's just the first one that a programmer sees when looking at one of these domains.

Ok but none of the those situations can be addressed with zero budget. If you’ve got those problems to solve then you usually ha e a budget to build/buy an appropriate tool.

At the other end of the spectrum you have every small team with some data analysis steps producing their own workflow engine when Make would be just fine.

I agree however the streaming case is particularly poor, but consider that paired with an appropriate fuse file system Make can address most use cases.

> what you really want is a company wide DAG where individuals > can add their own nodes and which handles schema > matching as nodes are upgraded, invalidation > of downstream data when upstream data is invalidated

I've never seen this work in practice, and doubt it can work, due to the complexities involved.

It really helped in our case. We have a team of 10+ researchers who alse ship code in production. They were repeatedly running into a problem where they recompute same data in runtime, or reinvent the wheel because they didn’t know somebody already computed that datum. I end up writing a small single-process (for now) workflow engine running a “company-wide DAG” of reusable data processing nodes (all derived from user-submitted input + models). Now it is much easier for individuals to contribute + much easier to optimize pipelines separately. I might open source it some time soon.

It's what is done de facto by large enough groups anyway. They just have to kludge tooling together for it.

It's not so much that Make is insufficient; there are a huge number of reasons to use something over make. However its true the difference between what you can implement with any reasonable amount of work and Make doesn't usually justify not using something simple that everyone understands (like Make). The next step up "worth making" involves hundreds of features that go deep into territory most people don't realize exists when they start re-inventing this particular wheel.

Thanks for the link! That lightweight SoS notebook seems like the sweet spot between agility and tidiness.

I feel so frustrated by the emergence of these things and the constant attempt to brand them or stylize them towards data science.

There are just engineering projects. There are not any other things.

For some engineering projects, you need to support dashboard-like, interactive interfaces that depend on data assets or other assets (like a database connection, a config file, a static representation of a statistical model, whatever). Sometimes you need a rapid feedback system to investigate properties of the engineering project and deduce implications for productively modifying it. These are universal requests that span tons of domains, and have very little to do with anything that differentiates data science from any other type of engineering.

At the level of an engineering project, you should use tools that have been developed by highly skilled system engineers, for example like Make or Bazel, or devops tools for containers or deployment and task orchestration, like luigi, kubernetes tools, and many others.

For a web service component, you should use web service tooling, like existing load balancing tools, nginx, queue systems, key value stores, frameworks like Flask.

For continuous integration or testing, use tools that already exist, like Jenkins or Travis, testing frameworks, load testing tools, profilers, etc.

Stop trying to stick a handful of these things into a bundle with abstractions that limit the applicability to only “data science” projects, and then brand them to fabricate some idea that they are somehow better suited for data science work than decades worth of tools that apply to any kind of engineering project, whether focused on data science or not.

I might be wrong, and I also highly value and use the tools that you mentioned. However, I have to say that I see the emergence of these tools as a positive thing for 2 reasons.

The first one I think is that more and more people are now using/trying to use machine learning models in production and they discover that the workflows and tools they used to use and work with are not suited for delivering machine learning models in fast, repeatable, and simple way.

The second reason is that I objectively think that a machine learning pipeline or CI/CD system is a bit different than the one used for pure software engineering practices, partly because machine learning does not only involve code, but more layers of complexity: data, artifacts, configuration, resources... All these layers can impact the reproducibility of a "successful build". Hence, a lot of engineering is required to both ensure that teams can achieve both reproducible and reliable results, and increase their productivity.

I am a long-time practitioner of putting machine learning tools into production, improving ML models over time, doing maintenance on deployed ML models, and researching new ways to solve problems with ML models.

All I can say is that in based on my experience, I would dramatically disagree with what you wrote.

I’ve always found pre-existing generalist engineering tooling to work more efficiently and cover all the features I need in a more reliable and comprehensive way than any of the latest and greatest ML-specific workflow tools of the past ~10 years.

I’ve also worked on many production systems that do not involve any aspects of statistical modeling, yet still rely on large data sets or data assets, offline jobs that perform data transformations and preprocessing, require extensibly configurable parameters, etc. etc.

I’ve never encountered or heard of any ML system that is in any way different in kind than most other general types of production engineering systems.

But I have seen plenty of ML projects that get bogged down with enormous tech debt stemming from adopting some type of fool’s gold ML-specific deployment / pipeline / data access tools and running into problems that time-honored general system tools would have solved out of the box, and then needing to hack your own layers of extra tooling on top of the ML-specific stuff.

I was going to make a less general comment along these lines, that I have put models to production with GitLab CI, Make and Slurm, and it keeps us honest and on task. There’s no mucking about with fairy dust data science toolchains and no excuses not to find a solution when problems arise because we’re using well tested methodology on well tested software.

But I don't want to learn 50 different tools that are all best-in-class. My use case is social media analysis of specific communities with fairly limited resources, so every hour that I spend on tooling is time not spent observing my subjects.

You're not wrong about the value of all the different tools you mention, but I think overlooking the integration and maintenance costs that a specialty tool can reduce, at the expense of some flexibility. I think that's the same reason many people prefer an IDE.

Learning the time tested tools almost always involves spending less time setting up / reading tutorials / etc. The time sink of betting the farm on latest and greatest data science frameworks is often gigantic and gets worse over time.

"For data scientists and data engineers, d6tflow is a python library which makes building complex data science workflows easy, fast and intuitive. It is built on top of workflow manager luigi but unlike luigi it is optimized for data science workflows."

But they didn't really explain/sell what those optimisations are in the readme.

- I really don't think that Python needs an additional layer of abstraction, especially when using scikit-learn which is high level enough as it is.

- The workflow in the readme is missing the part when you actually use the model. This will often need to be connected to your original preprocessing in some way - for example, if the dataset that you're predicting on has a categorical variable with a unique value which wasn't present in the training dataset, this effectively introduces a new feature in your dataset, which makes it impossible to do model.predict(). The need to manage things like this changes that workflow chart quite a bit.

The full machine learning pipeline goes far beyond sklearn scope. Think about scheduled data reading or scheduled model updates. I think there is a need for such frameworks to help using ml.

Going far beyond the scope of sklearn pipelines isn't necessarily a good thing. I don't really see what there is to be gained by making scheduling part of the remit of pipelines.

I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure.

At work I have what are essentially cron jobs running scripts which invoke sklearn pipelines. I've never even thought to make the scheduler aware of what they were running and I'm not sure why I would.

> "I don't really see what there is to be gained by making scheduling part of the remit of pipelines."

I think it depends on the use case, sometimes the components of the pipeline aren't necessarily running on the same machine, and they don't know where and how to get access to data and artifacts generated by previous steps, and so scheduling and orchestration becomes an important component of the pipeline itself.

> "I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure."

I think the idea behind building such frameworks is to help people avoid going through the same steps of building such tool internally by "assembling together a set of tightly scoped "UNIX philosophy" libraries". In general these frameworks are using libraries and tools, and exposing an easy way to leverage them instead of spending time doing that over and over.

>At work I have what are essentially cron jobs running scripts which invoke sklearn pipelines.

If cron works for you, that’s great, and you should continue to use it. However, I would be interested to know how many data sources you have, how you handle failures in pipe segments, and your general throughput.

In more complicated flows, ones that require different different data sets to to be combined, or lots of data flows that depend on each other, moving to a DAG with event triggering is a much better setup in my experience. Data is generated faster, and errors are handled more gracefully, and recovery much faster since data is only recalculated when needed.

>At work I have what are essentially cron jobs

Sole developer of process doesn't see the need for anything more than cron jobs, news at 11

Sure, you can use cron jobs for it. But cron is missing for example notifications about failures or info how long it takes to execute the task.

I don't actually use cron directly. What I do use is capable of scheduling and error detection (nonzero errors). Even if it weren't, the script it invoked could do both with < 4 lines of code.

I think it would actually be professionally negligent to introduce coupling at this point.

This is so abstract that its purpose is incomprehensible.

It would be much clearer if they compared side by side a few simple examples with regular makefiles that do the same thing, and people could see the advantages.

For another way to do data science workflows, there's the Common Workflow Language (http://commonwl.org)

Most of what I've heard about CWL is that it's unwieldy to use directly; it's main value is as a backend or as an interchange format between different workflow systems. However, I haven't tried it myself and would be interested to hear others' experiences on it.

Am I missing any major difference between this tool and a solution like Apache Airflow?

Is there any advantage in using this (D6tflow) instead of some Data Management tools like StreamSets (https://streamsets.com/) or NiFi ?

Would be interested to find out what other tools are available in this space and how this compares with them. What’s the difference beteeen d6tflow, Luigi, Airflow?

Quickly perusing the code, it looks entirely luigi based.

The reasons I choose not to use any pre-existing workflow are complexity and extra compute overhead. I think there's a need for micro workflow where it's simple enough you can setup a simple production workflow in a single python script without requiring a ton of heavy package. d6tflow seems light enough for me, using luigi and two dataframe libraries.

"Databolt" is nowhere near long enough to need abbreviation and in any case you should tell people what it is before you start abbreviating it. Starting with "D6t", which is horrendous to pronounce, is silly.

Internationalisation is long. Databolt isn't.

This looks pretty interesting. I am currently implementing DS workflows that are essentially python classes to orchestrate R scripts. I'll have a closer look on Monday but if I can use it to handle rpy2 R format data I'll be happy

Are you using reticulate (https://github.com/rstudio/reticulate), or having Python spawn a new worker process for R?

We have python spawn processes for R.

I've been wondering about this - why would you pass data from one language to the other? If you already have existing R code, you can easily manage it through R itself.

I suppose we could have written an R API to serve ML results directly, but there are a lot of other auxiliary tasks needed like protobuf handling, AWS integration and some other business logic that means python makes more sense. The quickest way to make it work was to wrap the ML scripts (and they really are __scripts__, not even functions..) in python and just handle everything outside of R. I think eventually we will just re-implement the ML stuff in python anyway.

D6t appears to stand for databolt. Took me a while to figure that out.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact