
Bonobo – A data processing toolkit for Python 3.5+ - rdorgueil
https://www.bonobo-project.org/
======
goodside
I'm annoyed that I bothered to read the tutorial to this. The TLDR: "Write
some generators or functions, put them in a list, and Bonobo will call them
all for you in order. Look at the example files for more." The example files
are all basic string transformations. The docs are mostly blank pages and
missing sections. What little is written has more jokes and conversational
tics than information.

What does this even do? There's mention of DAGs and different execution
strategies if you really dig through the docs, but is that it? If so, why
would you use this instead of joblib or some other established parallelism
lib?

~~~
rdorgueil
Bonobo runs each functions in the pipeline in parallel and make the fifo
queues plumbing and thread pool management completely transparent.

The TLDR would then be "Write some generators or functions, link them in a
graph, and call them in order on each line of data as soon as the previous
transformation node output is ready.". For example if you have a database
cursor that yields each line of a query as its output, it starts to run the
next step(s) in the graph as soon as the first result is ready (yet not stop
yielding from database until the graph is done for the current row). I did not
find it easy to do with the libraries I tried.

The docs clearly lacks completion to say the least, and would need an example
with a big dataset, one with long individual operations and one with a non
linear graph, so it's more obvious that, of course, it's not made to process
strings to uppercase twice in a row.

Stay tuned, I'm very happy HN brought it to homepage, did not really think it
could happen at this stage though and I understand you. But that's a good
thing for the project to move forward.

~~~
robzyb
This is really cool!

Python is my usual language of choice, but recently I picked up Go for some
data processing because there was a lot of benefits to parallelising the task
- which Go made easy.

------
spangry
I haven't tried this yet, but am praying that it delivers even half of what it
promises. For whatever reason I _just can 't get my head around pandas_,
despite multiple attempts.

If this also turns out to be inscrutable I may be forced to conclude that I'm
stupid...

~~~
closed
You're not alone! I think pandas made some design decisions around their
transformation functions that make it a lot more cumbersome to use than R's
dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <\- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all
kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] [http://pandas.pydata.org/pandas-
docs/stable/comparison_with_...](http://pandas.pydata.org/pandas-
docs/stable/comparison_with_r.html#grouping-and-summarizing)

~~~
madenine
you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x:
np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)

~~~
closed
The summarise example I gave creates a single new column (some_col), that is a
function of two columns from the grouped data frame. Passing a map to agg is
just creating multiple columns, each a function of at most a single column in
the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola /
colb))

------
payne92
I'm trying to figure out if this is "all hat, no cattle". There seems to be a
lot of "framework" here, without much core functionality.

Stated more precisely: if I'm stitching together things that process data and
'yield' results, why can't I just do that in pure Python? What does this
framework add?

~~~
rdorgueil
Short answer : parralel execution.

~~~
rs86
Multiprocessing or muilthreading? Why don't you market it as parallel
coroutines processing? That gets me interested. Because there are dozens of
frameworks with overgeneral descriptions.

~~~
rdorgueil
Today, as a default, multithreading. But that's an implementation detail.
Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it
would be a lie to market it this way. The plan though is to allow to use
coroutines/futures in the future, for specific reasons (like long
running/blocking operations where keeping output order tied to input order is
of no importance). Still, there is a lot on the roadmap before this becomes a
priority.

I note that I still have a lot of work explaining in simple terms what is
actually bonobo, without falling in the trap of "overgeneral description".

------
IanCal
It'd be good to see some comparisons, why this and not one of the other
currently available systems? Why should I use this over, for example, Luigi?

What scale is this intended for?

Is it intended to nearly solve a simple problem over my 20TB of data on S3?
Big complex graphs? Or more for transitioning a small local report system
that's currently in three excel files into a tested python script?

~~~
rdorgueil
It's indeed intended for «small data», by opposition to «big data». I know,
that does not say much, but I basically wanted to handle small flux of data
without having to install the "big weapons".

I'm preparing explanation pages for a lot of the questions I got, including
comparisons, volumes of data, where it is good and where it is not ...

All that will be well ready before 1.0, but for now, we're at 0.2 ...

Thanks for all the hackerlove, though!

------
rkda
All those references to monkeys hurt my head. Bonobos are not monkeys. If they
wanted to name it after monkeys, they should've called it Capuchin or
something.

~~~
rdorgueil
Noted, sorry for that. I'll get more infos about bonobos.

~~~
nn3
The picture looks more like a Gorilla than a Bonobo too

~~~
e5an
Came here to post just that. It's called 'Bonobo', there's a picture of a
gorilla, and the page keeps saying 'monkey'\- as petty as it sounds, you're
probably losing potential users to zoological nerdrage.

~~~
rdorgueil
Yes, hackernews and twitter brutally told me I should take animal reign
culture classes asap ...

This being said, if any of you have a good picture of bonobos that I can use
instead of the current one, I'd be really glad to replace it! It needs to be
released under a free license, though.

Thanks HN

------
vittore
Interesting, right now we are using PETL[1] that we used to do with SSIS,
bonobo for some reason reminds me of Bubbles library.

[1]:
[https://petl.readthedocs.io/en/latest/](https://petl.readthedocs.io/en/latest/)

~~~
ctippett
+1 for petl, I'm using it right now on a project that deals with a lot of
tabular data and it's been a huge time saver.

------
ziikutv
Ah, interesting. The example on the 'on-boarding' page reminds me of what we
used to do at work.

We used itertools chains to write producers and consumer to create 'Chain'
objects that process data exactly as the bonobo.Graph.

Can't wait to try this.

------
srean
Sweet! generator based utilities for ETL. I think this really a good use of
generators and coroutines. Reminds me of stackless based
[https://bitbucket.org/diji/pypes/src](https://bitbucket.org/diji/pypes/src)
(backing video [http://pyvideo.org/pycon-us-2011/pycon-2011--large-scale-
dat...](http://pyvideo.org/pycon-us-2011/pycon-2011--large-scale-data-
conditioning--amp--p.html) )

------
teilo
Before there was Pandas I wrote a website (using Django) to transform grid
data into a denormalized CSV file. In other words, a reverse pivot. Basically
it converts multiple header rows and header columns as separate fields for
each intersecting value.

I've written this basic routine several times over in my career (once in
Access VBA!) for different reasons. The current version of it is used to
convert a store/item/quantity grid into per-store pick/pack slips.

Pandas has a built-in function that can de-pivot a table. I'm not sure it can
handle my use case, however, with multiple header rows. Mine also has extra
goodies like populating blank row or column values with the previous value in
the row-column, among other bizarre features written to grapple with the
inconsistent ways our clients make their distro spreadsheets. Trying to break
them of their reliance on Excel for this type of planning has proven futile.

I'll have to spend some time with Bonobo and Pandas before I take on
refactoring our grid tool. It needs a refactor mostly because I'm the only one
who understands it. The new data munging libraries would surely simplify some
very gnarly logic, and make it accessible to other developers should I get hit
by a bus or leave the company.

------
rcarmo
Hmm. So what would be the advantage over Dask, which lets me scale out over a
cluster?

------
jwilk
This name is already taken.

[https://wiki.gnome.org/Attic/Bonobo](https://wiki.gnome.org/Attic/Bonobo)

------
maxerickson
There's a syntax error in mutate_my_dict_like_crazy at [http://docs.bonobo-
project.org/en/0.2/guide/purity.html](http://docs.bonobo-
project.org/en/0.2/guide/purity.html).

Seems the documentation is still quite WIP.

------
throwaway_374
So how is this different from Airflow, other than Windows compatibility and a
lack of dashboard?

~~~
glial
That's my question too. I've come to heavily rely on Airflow. As an Apache
project now it's becoming mature.

From what I can tell browsing the site, Bonobo looks like it's designed to do
data processing within the framework. Airflow insists that it's really a task
coordinator/scheduler...however, tasks can be Python function calls. So it
seems like Bonobo is a specific use case, where Airflow is the more general
case (tasks can be SQL queries, bash commands, etc).

------
erwinvaneyk
So, what is the advantage of using this over existing workflow management
systems, such as Airflow, Azkaban and Luigi?

------
madenine
Seems like sklearn pipelines with a more generalized use case + additional
helpful features for ETL. Very interested

------
zfrenchee
Who is behind this?

~~~
rdorgueil
Me (as an individual), and a few great people that helped me along the way.
Not commercially endorsed, or supported.

~~~
lookACamel
How does this compare to Dask, Luigi or Airflow?

~~~
rdorgueil
As soon as I can, I'll include comparison pages to the documentation, trying
to keep it as objective as possible. I can't seriously answer this question in
depth here, but it is planned, so at least experts from other systems can also
jump in and complement/correct my understanding of each systems. I used a
bunch of them, but I'm in no mean expert user of each so making it
collaborative sound like a better idea than just giving my point of view.

