
Databot: High-performance Python data-driven programming framework - garygoog
https://github.com/kkyon/databot
======
shoo
I was staring at the example in the readme, and considering some of the
features ("replayable"), and it sounds a bit like what a Makefile does. Well,
if you had a single event to process, and decided to use the filesystem to
store input, intermediate results, and output.

So here's an implementation of the example from the readme using make, and
bash, and jq, and a silly python script to implement a timer by modifying a
file every X seconds:

    
    
      ~/projects/makething$ cat Makefile 
    
      default: d
    
      b:  a
          curl -o $@ "http://api.coindesk.com/v1/bpi/currentprice.json"
    
      c:  b
          jq '.bpi.USD.rate_float' $< > $@
    
      d:  c
          cat $<
          cp $< $@
    
    
      ~/projects/makething$ cat timer.py 
    
      import sys
      import time
    
      def main():
          delay = float(sys.argv[1])
          fn = sys.argv[2]
          while True:
              with open(fn, 'w') as f:
                  f.write(str(time.time()))
              time.sleep(delay)
    
      if __name__ == '__main__':
          main()
    
    
      ~/projects/makething$ cat go.sh 
    
      #! /usr/bin/env bash
    
      python timer.py 2.0 a &
      TIMER_PID=$!
    
      function cleanup() {
          kill $TIMER_PID
      }
      trap cleanup EXIT
      
      while true
      do
          while make -j -q
          do
              sleep 0.1
          done
          make -j
      done
    
    

make decides if things are up-to-date by comparing timestamps of files in the
filesystem, so we can emulate a timer that triggers an event every 2 seconds
by having a process modify a file every 2 seconds, and rig a rule in our
makefile to use that file as an input.

~~~
LiveTheDream
I am a huge fan of using `make` for this sort of ad hoc data pipeline. The
workflow is very natural, as you can play around with each step on the command
line and then drop it into the makefile once you get it right..better for
reproducibility than search up through terminal history to replay individual
lines!

In your example, I would drop the shell and python scripts and simply run:

    
    
        watch -n 2 -d "touch a && make"

~~~
shoo
thank you for the review and the improvement! much cleaner.

------
cleansy
Recently I started to use Apache NiFi[1] for everything that does not required
too complex operations. It's pretty much what this framework does, just with
an UI and a lot of monitoring features.

However, one downside is the massive RAM consumption. 1GB of RAM even if it
does pretty much nothing is quite a bill to start off with.

1: [https://nifi.apache.org/](https://nifi.apache.org/)

~~~
ekianjo
Yes, but Nifi is pretty good at what it does. 1GB RAM cost is nothing compared
to the time it saves you in the end to make very robust data flows.

~~~
cleansy
Of course 1GB RAM is nothing nowadays, but this was something that I wanted to
point out since sometimes you have constrained resources available or need to
calculate what machine size you want to use.

NiFi might not run on an AWS t2.micro instance. Whereas Apache Airflow does.

~~~
ekianjo
When you use Nifi usually you are not in a budget constrained environment.
Because CPU wise it takes its toll too and you need more than one core to be
comfy, so micro instances are out anyway.

------
yetkin
[https://www.enterpriseintegrationpatterns.com/patterns/messa...](https://www.enterpriseintegrationpatterns.com/patterns/messaging/PipesAndFilters.html)

[https://www.coursera.org/lecture/software-
architecture/3-2-7...](https://www.coursera.org/lecture/software-
architecture/3-2-7-pipes-and-filters-bYHgh)

------
asavinov
Another project relying on lamdas for data processing
[https://github.com/asavinov/lambdo](https://github.com/asavinov/lambdo) yet
focused more on feature engineering and ML

------
gabcoh
This reminds me a lot of reactive programming like ReactiveX
[[http://reactivex.io](http://reactivex.io)] which has a python implementation

~~~
Rotareti
_> which has a python implementation_

I think this is the most popular one:

[https://github.com/ReactiveX/RxPY](https://github.com/ReactiveX/RxPY)

This one is a rewrite of RxPY, that makes use of async / await / asyncio:

[https://github.com/dbrattli/aioreactive](https://github.com/dbrattli/aioreactive)

Pretty interesting stuff!

------
lixtra
Somehow I would expect pipes to be connectable and nestable. This does not
seem to be the case (by studying the source). Then you could have some
function to build and parametrise a sub-pipeline and connect them to something
bigger.

I'm still looking for a perfect natural python ETL dsl, so I will follow that
project.

So far I'm using [https://github.com/petl-
developers/petl](https://github.com/petl-developers/petl) and mostly happy
with it.

~~~
erikb
> I would expect pipes to be connectable and nestable.

How would that look like? I mean a pipe is something where one writes data
into and another loads data from in the same order it was written into. I
don't see how that could be nested, or why two pipes would need to be
connected.

------
ekianjo
So it's like Nifi, but not as good? What would be the benefit to use that?

------
edem
Isn't using "performance" and "python" in the same sentence an oxymoron?

