Hacker News new | past | comments | ask | show | jobs | submit login
Ggplot2 for python (github.com)
187 points by martingoodson on Oct 14, 2013 | hide | past | web | favorite | 60 comments

This is really exciting! I've always liked ggplot2 but sadly struggled with R syntax (the only time I ever used it was for ggplot2, and found it really confusing).

For anyone reading this, I don't think this is intended to replace e.g. matplotlib which I suspect is more powerful overall. It's more for creating good looking, publication-quality graphs. A lot of thought has gone into ggplot2's default theme. A lot of influence from Tufte et al. if I remember rightly. Though, sadly, it looks like this port doesn't replicate ggplot2's graphs pixel-for-pixel, which is a shame.

Edit: Incidentally if anyone has a recommendation for a good graphing app for OS X, I'd be all ears. I still haven't found a good one. I usually use Plot (http://plot.micw.eu) which is great (scriptable, produces good looking graphs) but has a bit of a clunky interface. I personally find DataGraph's interface horrendous, even though its graphs are good, and Excel takes forever to make anything remotely decent. matplotlib and its ilk are fine but require custom scripting. I've been using xmgrace recently simply because it launches quickly, but it's (understandably) not retina and not native and just a pain really. What I'd give for a nice graphing app...

ggplot2 may be influenced by Tufte, but the main influence is Leland Wilkinson's grammar of graphics. In fact, that is what the 'gg' in the name stands for.

The grammar of graphics approach attempts to identify the components of a graphic so we can specify them in a high level, abstract manner. This is also why the syntax is so unusual compared to most plotting libraries. (Grammar of graphics also deeply influenced protovis and D3.js.)

I've found learning the basics of ggplot2 to be a great investment in my productivity at data analysis. Basically you get to say "make a plot; x-axis is page views and y-axis is time and make it a bar plot with bars grouped and colored by user age" and you get something that looks great.

(edited to fix minor autocomplete typo.)

You could try my Veusz http://home.gna.org/veusz/ package. GUI, scripting and multiplatform. Quite a few satisfied users!

I did all my graphs in phd work with Veusz; awesome work. Now I study it for the PyQt excellence. Thank you for the great work!

Wow, I'm surprised I haven't heard of this. Looks like it could be a good competitor to the Sigmaplot/Origin/Igor pro audience?

Hopefully it's a good competitor to these (people tell me so), but I hadn't used them before embarking on Veusz! I mainly got annoyed with the state of command-line plotting facilities in astronomy and unix (like IDL, gnuplot, etc).

I started with the idea that graphs should be built in an object-oriented fashion out of widgets. The widgets are arranged in a tree, a page holds a graph or a grid of graphs and a graph has axes and plotting widgets, and so on. Each widget has a set of properties and formatting settings which can be changed. It's now gained more data manipulation abilities. Data is stored in named datasets, linked to external files or entered manually, or you can even capture it from sockets or external programs. You can easily write a plugin to load your own file format. There are a set of data manipulation plugins for filtering and so on (again you can write your own). When plotting, you can just enter expressions to investigate your data.

Most of Veusz is written in Python with PyQt. There's a bit of C++ code to handle the inner loops, but it's pretty responsive. The next release should support Python 3, too.

One nice thing is that the saved files are simply python commands to regenerate the plot, so it's easy for the user to automate making plots. You can also use it as a python module for plotting, using the same command interface as the saved plots use.

Veusz is nicely designed and quite original. There are a few Qt-based programs more similar Origin: QtiPlot, LabPlot, SciDAVis. In 2009 there was a big announcement, featured on LWN and other Linux websites, that LabPlot and SciDAVis (QtiPlot fork) will merge, but soon after both projects died.

At that time QtiPlot was very actively developed by one guy who was living off license fees on his GPL'ed program. I was trying similar thing then (with program called fityk) and emailed him to ask how is he doing. In my case, it was interesting experience but after a few months and depleted savings I had to find a day job. He survived much longer with more popular program, but I suppose he also has a day job now.

Thanks for the recommendation :)

Just FYI: This is just a wrapper around matplotlib, so it's not intended to replace it at all. Thus the non-pixel-to-pixel relationship with ggplot.

Quite a nice project, though!

Please don't blame R for the ggplot syntax -- it is bizarre on its own. Also ggplot requires very specific form of input, and thus many scripts must be heavily polluted with a nontrivial preprocessing routines.

Your pollution is my tidy-ness: http://vita.had.co.nz/papers/tidy-data.html.

hadley, I couldn't pass up an opportunity to say thank you for ggplot2 and plyr. Both have made R infinitely more useful to me and have been used thousands of times by me over the past 3 years.

You're very welcome!

It is certainly a pollution of code -- a necessity to convert a good, efficient data structure which base graphics can plot into some specific form will always seem redundant from a programming point of view (and I was replying to a programmer).

Obviously the situation is opposite when one considers, let's say, "analysts", i.e. people that expect to be able to completely ignore the form of the data and focus purely on the content -- for them your stack of tools is a pure gold, even with its trade-offs.

I'd love to see an example of that - I agree that you might have to do some data transformation to use ggplot2, but I think the net effect is still fewer lines of code than with base graphics (except for v. simple graphics).

Some clarification about what you mean regarding efficiency would be useful - do you prefer arrays?

Boxplots from a list, matplot of a matrix, all the cool default plot methods for various complex objects like clusterings, trees, models...

Maybe I'm alone in that, but I believe the power of R lies in its resilience and the fact that it allows me to write code that is equally obvious in what it does as how it does. The philosophies of one-true-data-structure and natural-language-like expressions doesn't play well with it, though I admit they are very useful for people accustomed to a declarative programming.

R is a scripting language used a lot in statistics and just graphing.


EDIT: oops, posted to the wrong person

heh, no worries. thanks for the link.

I'd say more that it's approaching "domain-specific language". (I think R's syntax is just horrendous; love the libraries, though! ggplot2 included. Thanks Hadley!)

I think R is a lot like JavaScript: there's a beautiful language at its heart, but it's surrounded by layers of cruft. It doesn't help that most people using r are not programmers.

hm. i've never even heard of R .... it's a graphing library ??

Also, seems like that's a lot of dependencies for ggplot2. I'm wondering if this is really worth switching off of matplotlib. Sounds like all it's really doing is changing the method that you call ?

R is the language of choice for statisticians. That means just about any statistical technique you can imagine is available as a library in R. It is very nice as an interactive environment to explore your data. The fact that it was designed by stats people also leads to some funky language choices. There are actually a number of other graphics packages in addition to ggplot. Good integration with C for speed (it can be slow otherwise on iterative code, and a bit of a memory hog, as it copies objects in memory with reckless abandon.)

It is my second choice language these days after python, and the only one to use when you need a certain library.

have you looked at plyr/pryr/reshape2 for the preprocessing routine?

Igor pro is quite popular among my colleagues. It's not free though, but does produce nice-looking figures. It's scriptable, and when you create a graph from GUI it leaves a trace of your actions in its history so you can copy and use it as a function template. I don't like the language myself personally, but there are several research groups completely built around it.

I did my PhD in such a group and was able to put up with Igor Pro for a couple of years, before going over to python+matplotlib. Igor does contain some nice ideas, and can certainly produce great plots, but the scripting language is so arcane and inconsistent that I just couldn't take it. At least back then there was no way around it.

I've tried Igor Pro before. I was impressed by its curve fitting abilities, but not much by its ease of use / speed of just getting things done. But perhaps it's time to revisit it. Thanks for the rec :)

This is amazing.

FYI, this works out of the box with Mac OS X Lion's system Python and the gfortran compiler from homebrew - that's the only thing I need to install separately (Xcode and brew are required to install that, of course, but most people doing any sort of coding seem to have them anyway).

Also, a tip for those of you wanting to do the same:

I'm running IPython notebook, ggplot and all required packages under my user account without any hassles whatsoever thanks to this:

$ cat ~/.pydistutils.cfg


install_lib = ~/Library/Python/$py_version_short/site-packages

install_scripts = ~/Library/Python/$py_version_short/bin

install_data = ~/Library/Python/$py_version_short/share

Just set $PATH to include the install_scripts, run pip, and you're all set. Everything just works, without the need to install another Python interpreter or mess about with system components.

(edited for formatting)

> I've tried other libraries like Bockah and d3py but what I really want is ggplot2.

I wonder if that's meant to say, "bokeh"... and I would be interested to know the major differences. This library might be better than bokeh if there's no dependencies on coffeescript and such. But I wonder in what way this is un-pythonic, except perhaps in its imitation of R syntax. But ggplot2 is also not very much like the rest of R either.

I'm a Bokeh developer. We have plans to make the coffescript situation a lot easier to deal with, this fix should be coming soon.

I think you only need coffeescript if you want to develop bokeh not if you want to use it.

Is there a reason that the dependencies are not installed via the pip call for ggplot? I ask this out of ignorance, not trying to be facetious.

For a more painless install try Continuum's Anaconda or Enthought's Canopy. They bundle together the most important scientific libraries, work cross-platform and in this case include all dependencies.

Wow, I just checked these out. Thanks for pointing me in this direction.

At times numpy/scipy/matplotlib have proven pretty troublesome to install with pip. I think they have it working now, but it's still not 100% reliable.

This is why you use debian / ubuntu.

Something like 70% of R users are on windows.

Its 2013 - Microsoft needs to bite the bullet and start shipping a C/C++ compiler as part of windows :)

It doesn't come with, but Visual Studio Express has been free for a while: http://www.microsoft.com/visualstudio/eng/downloads#d-2012-e...

I write a couple of python c extensions - it would be nice if you could assume your users would be able to compile the extensions out of the box without expecting them to go install another piece of software.

Same goes for OS X, having to install a multi GB IDE to get gcc is ridiculous.

I do have XCode, but I am wondering if I could have just installed gcc with macports since they seem to have it.

The main problem is that they have system library dependencies that pip doesn't handle.

I think all you need is build-essentials and python-dev.

Matplotlib depends on freetype, libpng, and libjpeg as well. Numpy can be built without any external dependencies, but it will be a very slow installation. You're better off with an accelerated BLAS library (e.g. ATLAS/MKL/etc) and some sort of LAPACK library. Some of numpy's functionality also needs a fortran compiler to build. At any rate, there are other system dependencies beyond a basic compiler and the header files for python.

Ah I think the reason why I didn't think I needed BLAS/LAPACK is because I install R on the server first which installs those as shared libraries.

Nope, numpy needs libblas and scipy needs libatlas, both of which have to installed from the system package manager. I slap my forehead every time I move to a new system and find that the pip install in the virtualenv failed.

it's on the todo list. i was hesitant to put all of them in the dependencies list since not all are pip installable (all of the time)

I'm really excited for this. I might finally switch to python now!

Indeed, there's some incredible work put into the Python data analysis toolkit. Pandas is very impressive. However, no GGplot was holding me back from switching but this and improvements to other graphing libraries really make me want to use Python in my next projects.

As someone who does all my data analysis in Python, I'm curious what it is about GGplot that it's absence was a deal breaker for you? It's obviously quite powerful, considering all the excitement this port has generated, but I haven't seen anything explain what it provides over other plotting solutions.

You need understand what ggplot is all about, i.e a structured/formulaic way to think about and build graphs. Explaining it cannot possibly do it justice. You have dive into it, even a little bit and the way you see graphs and plots totally changes, you immediately buy into it and would never wish to go back.

I haven't really said anything to give you a sense of comparison. It's kind of like PGs essay (http://www.paulgraham.com/avg.html) about how the merits of higher level programming (specifically in lisp) can hardly be explained to those who have not been exposed to it.

The merits of higher level programming are hard to explain to a Blub programmer, but the differences are trivial to explain. I can tell a C programmer that Lisp has automatic memory management, higher order functions, macros, and atoms. She may not know why those matter, but I can tell her they exist. I can tell a Python programmer that Haskell has pure functions, Monads, Arrows, and an advanced type system. He'll think those will make his life harder, but he'll know they're there.

What makes GGplot different? I know that I won't understand why this difference matters until I've played with it, but it will at least tell me where to start playing.

Here's another stab at why ggplot2 is a Good Thing. In the easiest plotting case, there is a function that builds exactly the visualization that you're looking for. This is great as long as you don't need to do anything that deviates from the normal set of barplots, histograms, linegraphs, piecharts (shudder), etc.

But let's say you need something a little different. Maybe you want to add additional dimensions to that dot plot. Maybe you want the dots' size or color to be mapped to different aspects of your data. In a world without ggplot2 (or more generally the grammar of graphics), you're pretty much stuck going to a very primitive drawing system in which you're specifying the virtual path of a pen, or working with basic geometric figures.

The grammar of graphics and ggplot2 occupy a sweet middle ground between being able to simply pick an off-the-shelf visualization, and needing to draw the whole damn works manually. And because the grammar really is consistent, you can also play with different facets of your data and build completely different charts to see which is better at presenting your thesis.

In short, ggplot2 rocks, Hadley/Leland rock, and a port of ggplot2 to Python is nothing but good news for the Python community.

I mentioned the difference

"a structured/formulaic way to think about and build graphs"

This is what plotting using the "grammar" provides. Each plot is a like a sentence, for a sentence can be composed using some of the following - verb, noun, adverb, adjective, interjection, pronoun, proposition, conjunction. Alike, ggplot2 allows you to think of a graph in terms of layers, which layers can have different components that describe the "geometry" or "statistics" of the plot. Plus, there is a lot more.

You could probably describe your favourite plotting package as structured and formulaic, but an experience with ggplot2 would convince you otherwise.

ggplot is a beautiful thing - excellent job porting it to python. I only wish something similar were available for ruby.

Having written all my simulation code in Python, and having done all the visualization for said simulations in R mainly for ggplot2 graphs, this is very exciting.

This is fantastic - I remember the pain of sorting out R issues for some finance analysts who were wedded to ggplot. So much better to be able to do it in Python.

Great stuff. Thanks a ton!


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact