
Pypeline: A Python library for creating concurrent data pipelines - cgarciae
https://github.com/cgarciae/pypeln
======
anentropic
Too much abbreviation!

pypeline --> pypeln

multiprocessing pipeline --> pr

threads pipeline --> th

asyncio pipeline --> io

this is totally unnecessary

If I want to use short abbreviated names in my code I can always `from
pypeline import multiprocess_pipeline as pr`

Your library shouldn't export them like this as the default.

`io` is especially bad since this overshadows the `io` module in the Python
stdlib

~~~
cgarciae
Point taken! Thanks a lot for your feedback. Just a few points: * pypeline is
already taken :( * My main reason for this was because initially I was
thinking that you did an `import pypeln as pl` and then called things like
e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I
picked short names, but then I decided to go for and import the module kind of
strategy.

I am thinking about expanding the module names to their worker names: * pr -->
process * th --> thread * io --> task

And then have the conventions * from pypeln import process as pr * from pypeln
import thread as th * from pypeln import task as io # as ta?

This conversation is very valuable, thank you all for the feedback.

~~~
neuromantik8086
"import pypeln as pl" could cause quite a bit of confusion in Poland I would
imagine.

~~~
bb88
And maybe perl...

------
elsherbini
Snakemake [0] is a tool worth checking out. You can use it to create
declarative workflows, and similar to make, it creates a DAG of dependencies
when you give it your desired output. Each rule can specify how many threads
it needs and other arbitrary resources and the scheduler uses that to
constrain execution. Workflows are architecture independent - you should be
able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC
cluster.

It also allows you to use UNIX pipes with your dependent jobs when that is
appropriate [1].

[0]
[https://snakemake.readthedocs.io/en/stable/index.html](https://snakemake.readthedocs.io/en/stable/index.html)

[1]
[https://snakemake.readthedocs.io/en/stable/snakefiles/rules....](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#piped-
output)

------
somewhatoff
I wonder if you might compare this to Bonobo [[https://www.bonobo-
project.org/](https://www.bonobo-project.org/)] which I think has similar
design goals?

~~~
nestorD
Pypeline is a library you use in your code while Bonobo seems to be a
framework that use your code. I tend to think that you lose flexibility with
the latter.

------
adamcharnock

        Pypeline was designed to solve simple medium 
        data tasks that require concurrency 
        and parallelism but where using frameworks 
        like Spark or Dask feel exaggerated or unnatural.
    

This is exactly what I was looking for very recently. Thank you for writing
this, I'll certainly look into it.

~~~
marcyb5st
What about Apache Beam? Getting started with the Python SDK has been very easy
IMHO. Also, you are future proof as you can easily switch runner from Local to
Dataflow/Flink/...

~~~
cgarciae
I use BEAM for my Dataflow jobs. But their local "DirectRunner" is just for
testing purposes. As with Spark, BEAM is a huge beast, Pypeline was created
with simplicity in mind, its a pure python library, no dependencies.

------
chrisjc
Seems like a good time to link to this curated list of pipeline toolkits (not
all python).

[https://github.com/pditommaso/awesome-
pipeline/blob/master/R...](https://github.com/pditommaso/awesome-
pipeline/blob/master/README.md)

~~~
neuromantik8086
This too:

[https://github.com/common-workflow-language/common-
workflow-...](https://github.com/common-workflow-language/common-workflow-
language/wiki/Existing-Workflow-systems)

Also, whenever these conversation of flow-based / piplining tools come up, I
always like to point people to Common Workflow Language to remind people that
there is an attempt at standardizing workflow descriptions so that they can be
used with different packages:

[https://www.commonwl.org/](https://www.commonwl.org/)

------
snidane
From my experience building similar pipelining and reverse polish function
application tooling in python.

Piping using the | operator can make tracebacks pretty ugly with some
operators.

If you want to keep the code still somewhat 'pythonic' without introducing the
syntax magic using |, you can do it similarly:

    
    
      range(10)
      | pp.flatmap(lambda x: [x + 1, x + 2])
      | pp.map(lambda x: x * x)
      ...
    

You can do this instead:

    
    
      xs = range(10)
      xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
      xs = pp.map(xs, lambda x: x * x)
      ...
    

It helps to keep the operand as first argument, instead of last, because those
lambdas are best kept at the end.

So instead of

    
    
      map(fn, xs)
    

do

    
    
      map(xs, fn)

------
roel_v
None of these frameworks (there are many) seem to have support for repeating a
certain target multiple times, with different arguments. For example, say you
have a data set with per-country data; how do you repeat the same analysis on
each country? This simple example is easy with a loop, but when you have
multiple dimensions like this, you want to call each target with all possible
permutations, depending on which type of dimension is actually relevant for
that target. Does any ETL framework support that?

(I was actually just writing a spec for a new tool that does just this this
afternoon because I can't find anything suitable)

~~~
memoir2comment
snakemake does this trivially:

    
    
        rule analyze_country:
            input: 'whatever.{country}.txt'
            output: 'analysis.{country}.txt'
            shell:
                'run-analysis-on-country {input} {output} --country=country'
    
        rule analyze_target_countries:
            input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt']

~~~
elsherbini
Small change, you have to use wilcards.country inside the shell call:

    
    
        rule analyze_country:
            input: 'whatever.{country}.txt'
            output: 'analysis.{country}.txt'
            shell:
                'run-analysis-on-country {input} {output} --country={wildcards.country}'

------
bayesian_horse
Dask is relatively lightweight actually, because it is pure Python.

Also, there is "Streamz" which solves a similar problem, seems more mature and
can work with or without Dask or Dask-Distributed.

~~~
cgarciae
Dask might be lightweight internally but resorting to it just to solve a
simple task that requires concurrency is not "simple".

Streamz looks nice! However:

"Streamz relies on the Tornado framework for concurrency. This allows us to
handle many concurrent operations cheaply and consistently within a SINGLE
THREAD."

Apparently you can set it up to use Dask to escape the single threads but that
is kind of a global config. With Pypeline you can mix and match between using
Processes, Threads, and asyncio.Tasks where it makes sense, resource
management per stage is simple and explicit. If you have some understanding of
the multiprocessing, threading and asyncio modules, Pypeline will save you
tons of time.

Still, will keep an eye on Streamz, its a very nice work, lots of features, it
should get more visibility.

~~~
bayesian_horse
Pypeline doesn't seem that simple itself.

~~~
unethical_ban
Why not? It seems to be a boilerplate remover for simple parallel processing
tasks.

~~~
bayesian_horse
Maybe I'm just not quite able to get why "lightweight" is a thing. I also
prefer Django over Flask for even the simplest of server software...

------
timkpaine
Similar to a library I've been working on as well:
[https://github.com/timkpaine/tributary](https://github.com/timkpaine/tributary)

------
TBastiani
mpipe might also be of interest.

[http://vmlaker.github.io/mpipe/](http://vmlaker.github.io/mpipe/)

~~~
cgarciae
Thanks! Did take a look at mpipe (its actually referenced in the readme). But
mpipe has its flaws:

1\. It uses None as the stage terminator, this is VERY error prone, what if
you actually want to send None? Pypeline uses a special private terminator.

2\. You have to first manually put all the data into the pipe in a for-loop
and then manually get it out. In Pypeline all this is simplified: it consumes
iterables and all stages are iterables, so its 100% compatible with any
function/framework that accepts iterables.

------
davidnet
Wow. It seems to save a lot of boilerplate code for ETL.

------
make3
Looks similar to what tf.data does for Tensorflow

~~~
cgarciae
Nice to hear that. When I wrote this:

"Pypeline was designed to solve simple medium data tasks that require
concurrency and parallelism but where using frameworks like Spark or Dask feel
exaggerated or unnatural."

it was actually because I've resorted / hacked into tf.data and Dask in the
past just to get concurrency and parallelism. Pypeline is way more natural for
pure python stuff.

