Hacker News new | comments | show | ask | jobs | submit login
Plotnine: A grammar of graphics for Python (github.com)
170 points by carlosgg on May 27, 2017 | hide | past | web | favorite | 38 comments

I feel like a lot of attempts to recreate ggplot2 end up being superficial because they don't recognize / duplicate the power of the underlying Grid graphics that ggplot2 uses.

I know that web technologies are all the rage these days, but at least for static, publication-ready graphics, Grid is really nice substrate, with well thought out lower-level abstractions.

EDIT: I should also add that it's documented within an inch of its life should anyone feel that it's worth recreating: https://stat.ethz.ch/R-manual/R-devel/library/grid/html/grid...

In what way does eschewing the underlying Grid graphics make a less desirable experience for a ggplot2 port?

The ports I see feel like a < 100% enumeration of the plots in the mainline ggplot2 package. However, there are heaps of great extensions to ggplot2 that I suspect are in part due to there being a carefully thought out set of abstractions at the low level of Grid that mesh nicely with the high level abstractions of ggplot.

ggplot2 being built on top of Grid means that modestly complex stuff is easy (in ggplot2 by itself), but that it's relatively easy to drop down into the lower layer (grid) to do more.

Surprise to see this at the top, I am the creator* of plotnine. The most common question seems to be, what to expect of plotnine? The answer; a high quality implementation of a grammar of graphics with an API that closely matches ggplot2, and more.

I also want other packages to be able to build off of plotnine, e.g. a package with the functionality of Seaborn could be built off of plotnine. The only constraint should be whether the backend -- in this case Matplotlib -- does stand in the way. Matplotlib is evolving (though slowly) and has a very receptive community so there is lots of hope.

* - Many people contributed to its history.

I watched your refactor of yhat's py ggplot branch, and was disappointed when glamp dropped in a totally new implementation out of the blue. Thanks for all your hard work--glad it is it's own package now :).

Well, I think it was different priorities. My main objective for contributing was to have a full on grammar of graphics package in python. I appreciate those warm feelings from afar.

Looks sweet. May I ask: what did you find wanting in the other attempts (like Seaborne)?

Seaborn is not lacking in any way, it has a goal and it accomplishes it. However, I think Seaborn would have been easier to create if it had been based on a grammar based package, a few caveats not withstanding.

Recreating and keeping up with Hadley's hard work is challenging, particularly because ggplot2's layout and extensions are really nice and continue to evolve.

As an alternative that preserves the full power of Wickham's implementation, pygg[1] is a Python wrapper that provides R's ggplot2 syntax in Python and runs everything in R.

[1] https://github.com/sirrice/pygg

Another grammar of graphics: altair[1]. The altair are simpler and easier to read, i.e.

    x='age', y='height', color='sex')
Also, see Jake Vaderplas's talk on an overview of Python visualization tools at https://youtube.com/watch?v=FytuB8nFHPQ


If this is an implementation of ggplot2, what does it offer over http://ggplot.yhathq.com/?

I don't mean to undermine your project, just wanted to know about significant differences.

I too am interested in what the differences are. I have used the yhathq ggplot library for a while and it is quite useful but I sometimes find it lacking certain types of plots and documentation.

Also the last commit to the yhathq ggplot library was on Nov 20, 2016, so this library looks like it is currently more active in development.

Edit: Someone made an article comparing the two here: http://pltn.ca/plotnine-superior-python-ggplot/

This library started as a refactor of that project, after it laid broken and unmaintained for a long time. Having followed the yhat ggplot for a long time, I've lost faith that it will be actively maintained.

- yhat's ggplot has 256 commits, 13 contributors, last commit Nov 2016.

- plotnine has 1,283 commits, 42 contributors, most recent commit is 3 days ago.

These comparisons are pointless. Would you like to compare loc as well? How about man hours spent?

The only comparison that is important is how well the two projects work. I have no idea how well plotnine works yet (but I intend to find out). I do know that ggplot works OK - and seeing as it leverages matplotlib if there is anything that isn't implemented I can finish the plot off manually.

EDIT it seems that plotnine also leverages matplotlib and produces nicer plots for some common cases :).

It's not pointless at all. For long-term maintenance, the community strength and level of active development is just as important (and sometimes more important) than minor feature differences.

I would often rather use a decade-old project that was developed solo by a world-class expert dumping code over the wall once every 6 months than a community project being hacked on by 100 amateurs.

Without additional context I find recency of last commit and number of committers to be almost impossible to draw useful conclusions from.

I wish this contrarian argument would die already. It amounts to basically this:

"On average, X is better than Y."

"Ah but I would rather have the top end of Y than the bottom end of X, therefore comparing averages is useless."

Yes, a brand new Ford Focus would be better than a Ferrari that doesn't run, but generally Ferrari is the better brand.

I wish this both-lazy-and-condescending missing-the-point hand-wavy-analogy argument style would die already.

The earlier poster in this thread implied that number of contributors and recency of commits in one of two competing github projects was evidence that it was better.

My point is that these are inadequate (often totally misleading) heuristics unless both projects are otherwise extremely similar, which they usually are not, and even then are usually not very useful heuristics compared to other ways of comparing the projects.

Unless you know who the authors are, what the project management/organization style is, how the project is funded / what level of commitment the authors have, what the project release cycle is like, etc., or unless you directly examine the code yourself, the only thing that looking at the most recent git commit tells you is how recently someone published public code changes. Which is not something that anyone evaluating two projects cares about directly, but only as some heuristic signal of other features that might be more costly to examine.

But note that commit recency doesn’t give a remotely useful sense of how extensible the project is, how readable or efficient the code is, how well designed the API is, how good the documentation is, how friendly the community is, how competent the project management is, .....

If we want to make a car analogy, it’s like choosing which car to buy based on how frequently the company introduces new models, or how many engineers they employ, rather than based on customer reviews, reliability estimates, accessibility of mechanics, gas mileage, top speed, or storage capacity.

Your argument is basically analogous to: “because the average car with frequent updates is better than the average car with infrequent model updates, criticizing that as a primary criterion for choosing a car is an invalid argument”. Notice that you haven’t even bothered to examine whether your premise about the relation between updates and quality is true, or whether that average relationship makes update frequency a practically useful heuristic or not.

You are arguing a red herring. I never posited that update frequency is the only useful metric, or that you shouldn't consider how well the library itself works. You should consider all the aspects that are relevant to your use cases...and often this should include the community strength, along with the intrinsic library design, funding, documentation, etc. Certainly you shouldn't stop at the library design and code itself, that is just one of many considerations to weigh when adopting a dependency.

In the absence of any other information, the more recently updated codebase is preferred to the least recently updated, for the same reason that an abandoned codebase is dispreferable.

Of course you should seek additional context if you need it. Community strength and commit activity are pretty useful to consider though.

One useful conclusion: security bugs are likely to be fixed in a timely manner that won't put your users at risk.

Alternative equally speculative conclusion from the same data: the very-in-flux code is so shoddy that there are constant security bugs needing weekly fixes, whereas the stable and relatively inactive code is so rock solid that nobody ever needs to touch it for it to keep working.

For example, how often does DJB publish new code changes to his various projects?

You are right, but we don't the proficiency of the developers. The best info we have is the repository update info. That's what you have to quickly compare the projects.

Just look at the code.

What specifically should we be looking for?

Plotnine is a real grammar of graphics. Checkout http://plotnine.readthedocs.io/en/stable/about-plotnine.html for a little more information about the two projects.

What is meant by "a grammar"?

Is it the way we concatenate functions to create what's essentially a sentence of what we want the plot to be?

ggplot2 was inspired by: https://www.amazon.com/Grammar-Graphics-Statistics-Computing...

and Hadley Wickham wrote about it in http://vita.had.co.nz/papers/layered-grammar.pdf.

I'm no expert, but I think that one of the main ideas is to separate the elements of making a plot from the way that the data is presented. For example, in ggplot2, you have the data that will go into the graph, the type of plot (or "geometry") that defines how the data are presented (scatterplot, bar plot, etc.), and then various "layers" that can be added that affect style.

In order to split a plot into subplots, you simply define how it is to be faceted (what column should be used to define groups). Grammar-of-graphics moves plotting away from the "turtle graphics" model and lets you specify what should be done. Then ggplot figures out how to do it, kind of like SQL vs. writing for loops to retrieve information.

Aha! Thank you. So it's kind of like a declarative way of plotting.

I found http://r-statistics.co/ggplot2-Tutorial-With-R.html helpful to learn about the gglplot "grammar of graphs"

I get it is ggplot-based, by as it is Python, why not being more idiomatic and using chaining instead of adding?

This is nice and all, but the syntax and names are totally unintuitive.

If I'm to dig in the manual, I might as well build my plots with the standard syntax of any random plotting library.

Is this "grammar of graphics" any good if you invest more time in it?

I find it the opposite. The syntax is the most intuitive of all plotting applications.

Layers are as follows [1]

1. Data

2. Aesthetic mappings

3. Statistical transformation (stat)

4. Geometric object (geom)

5. Position adjustment

Once you get a hang of this, it becomes easy to create new plots purely from the understanding of the layers. In matplotlib or even in Seaborn, I find myself constantly Googling for examples.

ggplot2 is the most beautiful thing to happen in visualization space!

[1] Wickham, Hadley, and Carson Sievert. "4.4.1 Layers." Ggplot2: Elegant Graphics for Data Analysis. Dordrecht: Springer, 2016. N. pag. Print.

> Is this "grammar of graphics" any good if you invest more time in it?

For some, it was probably one of THE things that kept them using R over something else. Yes, definitely worth it.

There are a few concepts to learn. With a grammar, you can create plots in 5 minutes that would take an hour to create using "the standard syntax". For many people once they experience it, they do not want to go back. You could be one of them.

Yes, it's worth investing the time. Grammar of graphics is life changing if data visualization is part of your day to day.

This looks interesting, but I find the documentation somewhat lacking. Is the user supposed to know ggplot?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact