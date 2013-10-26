What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?
The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.
The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.
Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.
Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.
If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...
You need to work with pandas consistently for a month or two, and then it'll all click.
pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.
My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.
If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.
And it does help if you're familiar with NumPy.
https://pythonprogramming.net/search/?q=pandas
Modern pandas is a bit more idiomatic now though: https://tomaugspurger.github.io/modern-1.html
Pandas is basically an R data frame for Python. A sloppy description of that is a text mode spreadsheet.
The description of Bonobo doesn't immediately invite the comparison to Pandas, to me anyway.
Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.
Can anybody comment how Bonobo compares to Luigi?
Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?
Create a object/class called
AuctionResult
- some datetime
- value
then you load it into a pandas dataframe:
df = read_frame(qs)
After that you can do all sorts of the fun stuff I imagine.
As an example from the pandas docs [1], in dplyr you can do
> gdf <- group_by(df, col1)
> summarise(gdf, avg=mean(col1))
In pandas this is similar to
> df.groupby('col1').agg({'col1': 'mean'})
But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.
> summarise(gdf, some_name = f1(col1) + f2(col2))
But in pandas you can apply 1 function to 1 column with agg.
[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...
gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})
once you've got your grouped dataframe, go nuts
gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)
(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))
https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+...
Their is some pratical exercises that you do in your browser that really helps to get the grasp of it.
Don't miss Pandas, they are really cool!
http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pi...
(I´m not affiliated to site)
Stated more precisely: if I'm stitching together things that process data and 'yield' results, why can't I just do that in pure Python? What does this framework add?
I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".
What scale is this intended for?
Is it intended to nearly solve a simple problem over my 20TB of data on S3? Big complex graphs? Or more for transitioning a small local report system that's currently in three excel files into a tested python script?
I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...
All that will be well ready before 1.0, but for now, we're at 0.2 ...
Thanks for all the hackerlove, though!
This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.
Thanks HN
Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.
But yeah, a CGI gorilla for a site called bonobo. Le sigh.
We used itertools chains to write producers and consumer to create 'Chain' objects that process data exactly as the bonobo.Graph.
Can't wait to try this.
I've written this basic routine several times over in my career (once in Access VBA!) for different reasons. The current version of it is used to convert a store/item/quantity grid into per-store pick/pack slips.
Pandas has a built-in function that can de-pivot a table. I'm not sure it can handle my use case, however, with multiple header rows. Mine also has extra goodies like populating blank row or column values with the previous value in the row-column, among other bizarre features written to grapple with the inconsistent ways our clients make their distro spreadsheets. Trying to break them of their reliance on Excel for this type of planning has proven futile.
I'll have to spend some time with Bonobo and Pandas before I take on refactoring our grid tool. It needs a refactor mostly because I'm the only one who understands it. The new data munging libraries would surely simplify some very gnarly logic, and make it accessible to other developers should I get hit by a bus or leave the company.
[1]: https://petl.readthedocs.io/en/latest/
https://wiki.gnome.org/Attic/Bonobo
Seems the documentation is still quite WIP.
From what I can tell browsing the site, Bonobo looks like it's designed to do data processing within the framework. Airflow insists that it's really a task coordinator/scheduler...however, tasks can be Python function calls. So it seems like Bonobo is a specific use case, where Airflow is the more general case (tasks can be SQL queries, bash commands, etc).
What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?