
D6tflow: Python library for building data science workflows - DBCerigo
https://github.com/d6t/d6tflow
======
SebiH
> We have put a lot of effort into making this library useful to you. To help
> us make this library even better, it collects ANONYMOUS error messages and
> usage statistics. See d6tcollect for details including how to disable
> collection. Collection is asynchronous and doesn't impact your code in any
> way.

That seems really out of place. I'm somewhat used to automatic data collection
from applications, but automatic data collection from programming libraries /
frameworks? Really?

~~~
mixedmath
I have a strong, negative reaction to this. I read the collection code (in
d6collect), and it does as they claim (with perhaps minor qualms about what
anonymous really means). And (for now) it's easy to disable without mucking
around in the code. But in fact I'm not sure what I was really looking for,
since I don't imagine using this library when I might need to reverify that
they still aren't collecting anything I don't want to be collected.

On the other hand, I'm glad that they mentioned it --- I would have a much
more negative reaction if I had to find this out on my own.

~~~
ende
I feel like an even better approach here would have been for the developers to
offer the data collection functionality as a separately installed module, and
then make the case to the user during installation of the main package.

------
marmaduke
I recently read that there are over 100 workflow engines [1], it is still
often difficult to make the case that Make is insufficient

[1] [https://vatlab.github.io/blog/post/sos-workflow-
engine/](https://vatlab.github.io/blog/post/sos-workflow-engine/)

~~~
madhadron
It's easy to make the case that make is insufficient. There are three very
distinct domains being conflated here:

1\. The day to day experimentation and iteration by a computational
researcher. 2\. The repeated execution of a workflow on different data sets
submitted by different people, such as in a clinical testing lab. 2\. The
ongoing processing of a stream of data by a deployed system, such as ongoing
data processing for a platform like Facebook.

For (1), there is a crucial insight that is often missing: the unit of work
for such people is not the program, but the execution. If you have a Makefile
or a shell script or even a nicely source controlled program, you end up
running small variations of it, with different parameters, and different input
files. Very quickly you end up with hundreds of files, and no way of tracking
what comes from which execution under what conditions. make doesn't help you
with this. Workflow engines don't help you with this. I wrote a system some
years ago when I was still in computational science to handle this situation
([https://github.com/madhadron/bein](https://github.com/madhadron/bein)), but
I haven't updated it to Python 3, and I would like to use Python's reflection
capabilities to capture the source code as well. It should probably be
integrated with Jupyter at this point, too, but Jupyter was in its infancy
when I did that.

For (2), there are systems like KNIME and Galaxy, and, crucially, they
integrate with a LIMS (Laboratory Information Management System) which is the
really important part. The workflow is the same, but it's provenance,
tracking, and access control of all steps of the work that matters in that
setting.

For (3), what you really want is a company wide DAG where individuals can add
their own nodes and which handles schema matching as nodes are upgraded,
invalidation of downstream data when upstream data is invalidated, backfills
when you add a new node or when an upstream node is invalidated, and all the
other upkeep tasks required at scale. I have yet to see a system that does
this seriously, but I also haven't been paying attention recently.

For none of these is chaining together functions with error handling and
reporting the limiting factor. It's just the first one that a programmer sees
when looking at one of these domains.

~~~
m0zg
> what you really want is a company wide DAG where individuals > can add their
> own nodes and which handles schema > matching as nodes are upgraded,
> invalidation > of downstream data when upstream data is invalidated

I've never seen this work in practice, and doubt it can work, due to the
complexities involved.

~~~
ematvey
It really helped in our case. We have a team of 10+ researchers who alse ship
code in production. They were repeatedly running into a problem where they
recompute same data in runtime, or reinvent the wheel because they didn’t know
somebody already computed that datum. I end up writing a small single-process
(for now) workflow engine running a “company-wide DAG” of reusable data
processing nodes (all derived from user-submitted input + models). Now it is
much easier for individuals to contribute + much easier to optimize pipelines
separately. I might open source it some time soon.

------
mlthoughts2018
I feel so frustrated by the emergence of these things and the constant attempt
to brand them or stylize them towards data science.

There are just engineering projects. There are not any other things.

For some engineering projects, you need to support dashboard-like, interactive
interfaces that depend on data assets or other assets (like a database
connection, a config file, a static representation of a statistical model,
whatever). Sometimes you need a rapid feedback system to investigate
properties of the engineering project and deduce implications for productively
modifying it. These are universal requests that span tons of domains, and have
very little to do with anything that differentiates data science from any
other type of engineering.

At the level of an engineering project, you should use tools that have been
developed by highly skilled system engineers, for example like Make or Bazel,
or devops tools for containers or deployment and task orchestration, like
luigi, kubernetes tools, and many others.

For a web service component, you should use web service tooling, like existing
load balancing tools, nginx, queue systems, key value stores, frameworks like
Flask.

For continuous integration or testing, use tools that already exist, like
Jenkins or Travis, testing frameworks, load testing tools, profilers, etc.

Stop trying to stick a handful of these things into a bundle with abstractions
that limit the applicability to only “data science” projects, and then brand
them to fabricate some idea that they are somehow better suited for data
science work than decades worth of tools that apply to _any_ kind of
engineering project, whether focused on data science or not.

~~~
mmq
I might be wrong, and I also highly value and use the tools that you
mentioned. However, I have to say that I see the emergence of these tools as a
positive thing for 2 reasons.

The first one I think is that more and more people are now using/trying to use
machine learning models in production and they discover that the workflows and
tools they used to use and work with are not suited for delivering machine
learning models in fast, repeatable, and simple way.

The second reason is that I objectively think that a machine learning pipeline
or CI/CD system is a bit different than the one used for pure software
engineering practices, partly because machine learning does not only involve
code, but more layers of complexity: data, artifacts, configuration,
resources... All these layers can impact the reproducibility of a "successful
build". Hence, a lot of engineering is required to both ensure that teams can
achieve both reproducible and reliable results, and increase their
productivity.

~~~
mlthoughts2018
I am a long-time practitioner of putting machine learning tools into
production, improving ML models over time, doing maintenance on deployed ML
models, and researching new ways to solve problems with ML models.

All I can say is that in based on my experience, I would dramatically disagree
with what you wrote.

I’ve always found pre-existing generalist engineering tooling to work more
efficiently and cover all the features I need in a more reliable and
comprehensive way than any of the latest and greatest ML-specific workflow
tools of the past ~10 years.

I’ve also worked on many production systems that do not involve any aspects of
statistical modeling, yet still rely on large data sets or data assets,
offline jobs that perform data transformations and preprocessing, require
extensibly configurable parameters, etc. etc.

I’ve never encountered or heard of any ML system that is in any way different
_in kind_ than most other general types of production engineering systems.

But I have seen plenty of ML projects that get bogged down with enormous tech
debt stemming from adopting some type of fool’s gold ML-specific deployment /
pipeline / data access tools and running into problems that time-honored
general system tools would have solved out of the box, and then needing to
hack your own layers of extra tooling on top of the ML-specific stuff.

~~~
marmaduke
I was going to make a less general comment along these lines, that I have put
models to production with GitLab CI, Make and Slurm, and it keeps us honest
and on task. There’s no mucking about with fairy dust data science toolchains
and no excuses not to find a solution when problems arise because we’re using
well tested methodology on well tested software.

------
DBCerigo
"For data scientists and data engineers, d6tflow is a python library which
makes building complex data science workflows easy, fast and intuitive. It is
built on top of workflow manager luigi but unlike luigi it is optimized for
data science workflows."

But they didn't really explain/sell what those optimisations are in the
readme.

------
altairiumblue
\- I really don't think that Python needs an additional layer of abstraction,
especially when using scikit-learn which is high level enough as it is.

\- The workflow in the readme is missing the part when you actually use the
model. This will often need to be connected to your original preprocessing in
some way - for example, if the dataset that you're predicting on has a
categorical variable with a unique value which wasn't present in the training
dataset, this effectively introduces a new feature in your dataset, which
makes it impossible to do model.predict(). The need to manage things like this
changes that workflow chart quite a bit.

~~~
pplonski86
The full machine learning pipeline goes far beyond sklearn scope. Think about
scheduled data reading or scheduled model updates. I think there is a need for
such frameworks to help using ml.

~~~
pytester
Going far beyond the scope of sklearn pipelines isn't necessarily a good
thing. I don't really see what there is to be gained by making scheduling part
of the remit of pipelines.

I'd rather assemble together a set of tightly scoped "UNIX philosophy"
libraries and tools rather than try and use an all encompassing framework and
be straitjacketed by its imposed structure.

At work I have what are essentially cron jobs running scripts which invoke
sklearn pipelines. I've never even thought to make the scheduler aware of what
they were running and I'm not sure why I would.

~~~
pplonski86
Sure, you can use cron jobs for it. But cron is missing for example
notifications about failures or info how long it takes to execute the task.

~~~
pytester
I don't actually use cron directly. What I do use is capable of scheduling and
error detection (nonzero errors). Even if it weren't, the script it invoked
could do both with < 4 lines of code.

I think it would actually be professionally negligent to introduce coupling at
this point.

------
enriquto
This is so abstract that its purpose is incomprehensible.

It would be much clearer if they compared side by side a few simple examples
with regular makefiles that do the same thing, and people could see the
advantages.

------
tetron
For another way to do data science workflows, there's the Common Workflow
Language ([http://commonwl.org](http://commonwl.org))

~~~
snackematician
Most of what I've heard about CWL is that it's unwieldy to use directly; it's
main value is as a backend or as an interchange format between different
workflow systems. However, I haven't tried it myself and would be interested
to hear others' experiences on it.

------
afrnz
Am I missing any major difference between this tool and a solution like Apache
Airflow?

------
damajor
Is there any advantage in using this (D6tflow) instead of some Data Management
tools like StreamSets ([https://streamsets.com/](https://streamsets.com/)) or
NiFi ?

------
asimjalis
Would be interested to find out what other tools are available in this space
and how this compares with them. What’s the difference beteeen d6tflow, Luigi,
Airflow?

~~~
SmirkingRevenge
Quickly perusing the code, it looks entirely luigi based.

------
theblackcat1002
The reasons I choose not to use any pre-existing workflow are complexity and
extra compute overhead. I think there's a need for micro workflow where it's
simple enough you can setup a simple production workflow in a single python
script without requiring a ton of heavy package. d6tflow seems light enough
for me, using luigi and two dataframe libraries.

------
black-tea
"Databolt" is nowhere near long enough to need abbreviation and in any case
you should tell people what it is before you start abbreviating it. Starting
with "D6t", which is horrendous to pronounce, is silly.

Internationalisation is long. Databolt isn't.

------
L_226
This looks pretty interesting. I am currently implementing DS workflows that
are essentially python classes to orchestrate R scripts. I'll have a closer
look on Monday but if I can use it to handle rpy2 R format data I'll be happy

~~~
perturbation
Are you using reticulate
([https://github.com/rstudio/reticulate](https://github.com/rstudio/reticulate)),
or having Python spawn a new worker process for R?

~~~
L_226
We have python spawn processes for R.

------
mijoharas
D6t appears to stand for databolt. Took me a while to figure that out.

