I would strongly recommend against anyone using Seaborn.
I used it for charts in a paper recently, since it includes swarm plots. I hit a problem when overlaying certain types of plot on the same axes (I think it was swarm plots on top of box plots, so I could show every data point as well as the quartiles). The problem was that the data would end up shifted, so the x axis labels weren't correct, even though each plot on its own would work fine when the other was commented-out. The only reason I noticed this was because I spotted that a peak I knew occurred at x=13 was showing up at x=14.
Although it took a specific set of circumstances to trigger, I thought this was still a serious problem since it causes data to be misrepresented, and it doesn't cause any warning, etc. that something is wrong. I made a minimal example script, opened a github issue ( https://github.com/mwaskom/seaborn/issues/1409 ), where the library author insulted me and locked the issue. I sent them a followup email, to explain that I was not after "free tech support" (I'd already worked around the issue for my paper by 'faking' the labels), I wanted to help improve the library so that others would avoid having incorrect plots (especially those not lucky enough to spot it like I was) and how I'd already spent considerable time narrowing down the problem to the minimal example, as evidence that I wasn't trying to be a freeloader. I also began my email with an apology for using this side channel, and that I wouldn't contact them again unless they consented to it. The author replied with more insults.
I didn't contact them again, as promised, but now I'm actively opposed to anyone using this project, due to the author's complete disregard for corruption of scientific data.
Going to agree with the Seaborn guy here. While you qualify your stance with some diplomatic terms, it does come off as very aggressive. Adding polite statements to a rude letter doesn’t make it polite; and a provider of an open source maintainer doesn’t have do to anything for you. You call the software objectively wrong and are telling the author that his product is wrong when you really should be asking how to use it on stack overflow.
Seaborn is a really nice way to take off matplotlib overhead and make your non-seaborn plots look better too. I highly recommend it.
I'm not sure I get this characterization of the exchange; the parent comment is a little curt in his third reply, but the original issue is very well-formulated and the Seaborn guy's replies are incomprehensible. After elaborating, the parent gets a little pedantic in trying to get the maintainer to respond with something other than a one-sentence dismissal, and although at that point the comment is a little hostile, it clearly isn't "an obnoxious rant" or "asking for free tech support". Relative to the number of stars the repository has, this exchange was probably one of the most user-hostile I've ever seen for a legitimate question on a bugtracker.
As I wrote in my above comment, and in the followup email I sent after the bug was locked, I got the graphs I needed for my paper after about 5 minutes playing with the axes. No StackOverflow needed.
My problem wasn't "I need these particular graphs to look right", my problem was "Using this library in a seemingly reasonable way can silently produce corrupt/misleading results". That's a problem for all potential new users, and hence why I decided to invest a bunch of time in narrowing it down, report and pursue it.
I also didn't call the software itself objectively wrong; only the decisions/execution trace which lead to the incorrect graphs:
"this script doesn't set any axis labels or ticks: those decisions are made by matplotlib and/or seaborn, and are objectively wrong in this situation, regardless of the algorithmic details of how they have been arrived at."
I wouldn't have minded if it were closed as unfixable or something, but it was instead mischaracterised:
"This is, in my opinion, a legitimate issue worthy of acknowledging (even if such acknowledgement takes the form of a "won't fix" closure)."
I wouldn't recommend Seaborn to anyone who wasn't prepared to manually cross-check all of the entries and statistics in the resulting plots.
I find Pandas+Bokeh to be a pretty killer combination. You get nice vectorized operations for manipulating the data upfront, and a pretty wide range of visualizations that are plenty customizable. Main beefs with bokeh are:
- It's great that I can manipulate a view by zooming, panning, turning a series off, whatever, but all that state is lost when I send the URL of the plot to someone else (eg, for the case where it was generated by a CI job or something rather than locally on my machine). It should be easier in Bokeh to grab all that state and stuff it into a querystring that can then rehydrate the same view later on.
- Doing plots with more than one y-axis is a lot more awkward and fiddly than it should be, and even once you succeed, the axes can't be panned independently.
- If your data series name overlaps with the label in your legend, the whole plot breaks in an extremely non-obvious way (basically, Bokeh thinks that you want a legend entry per row in the named data series). Even being fully prepared for it, I lose time to this every once in a while, and it's almost a rite of passage that every junior dev hits it and burns an hour or two trying to figure out what is wrong.
Yup. A likely result is that if you pick one and spend the time to learn it and use it for a project, there's a non-trivial chance that the choice you make will be join the ever growing collection of library abandon-ware in the not too distant future.
This is why my favorite Python visualization tools are not Python - I've been burned too many times by libraries coming and going, and I just don't have the time to spend farting around trying to track the latest library fads.
Argh, it's so frustrating to see this kind of sentiment. We're truly spoiled by choice. The libraries don't all serve the same purpose, and not everyone needs every library. This is literally a guide to help pick which one I may want. What more could one ask for??
That said, I understand it sucks to build on top of someone else's code only to have it be discontinued, but honestly it only takes a few minutes to judge a project's maturity. Here are some very easy rules of thumb. Don't write a lot of important code relying on a library that:
- Is younger than 3 years old.
- Has less than 3-4 major contributors and 20 overall contributors.
- Has lost steam: a lot of issues and pull requests open and stale with no triage tags, no discussion or responses.
- Doesn't seem to have any automated testing / packaging infrastructure set up.
- Doesn't seem to have a regular ongoing cycle of releases, be it long or short.
- Doesn't have nicely laid out documentation.
- Doesn't seem to have a user base, as indicated by a preponderance of questions and answers on stackoverflow, etc.
- Just posted their "we made a cool new thing" post on HN a few weeks ago.
Yes, the cutoffs are arbitrary (and flexible!), but this hasn't failed me yet. The python world is filled with many wonderful and mature libraries. It just also has a lot of up and coming, promising young ones. Use whichever!
A refactoring and merging of these libraries to enable fewer people to maintain more functionality, such that the bus-factor of any given part of the functionality is higher.
That's entirely backwards, to reduce the bus factor, you need more people maintaining less, not fewer people maintaining more.
In any case, the incentives are simply not there, as different projects have very different priorities (which is why there are so many projects). Some folks want to monetize their special sauce (Plotly). Some folks want to focus on high level statistical charting (Altair, Chartify) Some folks want to focus on interactive data exploration (Holoviews) Some folks want to focus on high performance and streaming (Bokeh) Some folks want to focus on high quality static image generation (MPL). The human and economical cost of getting all those groups under one tent is astronomical.
> you need more people maintaining less, not fewer people maintaining more
I see how you got confused, but I was making two separate assertions.
Assertion 1: If you merge the library projects together, and strip out the redundancies between them, then you'll have the same number of contributors, now distributed over fewer total lines of code. So each contributor can learn more of the codebase. (Bus factor goes up.)
Assertion 2: If you refactor the resulting libraries to reduce the total complexity (i.e. reduce the API surface from that of the union of all the merged-in libs), then you can begin to strip out technical debt from the project from the outside in. By eliminating now-dead code, you remove places where bugs can arise, and you lower the number of dependencies (which could otherwise have been sources of API-breaking changes when they update.) Thus, the number of people needed to maintain the project goes down. So the same total functionality can then be maintained by fewer contributors. (You don't actually remove contributors; they just become reserve capacity, with each contributor able to be less-overworked for the same result.)
Alas, it's not so simple. How much actual overlap is there to merge in Bokeh and Matplotlib, for instance? MPL renders images in Python and has no JS component. Bokeh does all of its rendering in JavaScript! The Python API is mostly just a thin wrapper around BokehJS. MPL has no server component at all. Neither of the does what Datashader does for large data sets. Neither of them has a high level statistical charting API, that's only in Seaborn or Chartify. Merging all these things together would cost a fortune in time and money, and at the end of the day not actually reduce the total codesize to any appreciable degree.
Then again, if you learn one well it's not too difficult to learn another one as well. And once you've worked with a couple different libraries, you start to understand the different data models and it's even easier to adapt to the next library.
Also, for many of the visualizations I've worked with I'm fine to use an older library that's reached a stable point of development. It feels like the whole conversation about whether to build your new project on the latest web framework, or use something old and established like Django or Rails. Boring, relatively stable software in any area is a pleasure to work with if you really want to focus on your real-world problem more than you want to focus on newer tech.
there's a non-trivial chance that the choice you make will be join the ever growing collection of library abandon-ware in the not too distant future.
I’ve been using Matplotlib for the last 10 years and expect to continue using it for the next 20. The stability is there, you just need to resist the urge to keep switching. It’s like vi: whatever your “favourite” editor is, you know that vi will always be there.
Of course as others have said ggplot2 is better but I am going to check out plotnine now...
I think what happens with visualization libraries is similar to what happens with things like cars or clothes, there's a lot of diversity because people have different tastes, if it was only about features there would be a lot less variety, but since it's very subjective there's a lot of very similar choices with slight differences and people choose one or another mostly based on personal preference.
Then for other things people will choose based on more objective factors and therefore almost everyone will reach the same conclusion, so you will get only a few choices.
Python visualization needs a visionary like Hadley Wickham. Once you get used to the clarity of "Grammar of Graphics", every other plotting library feels like a clumsy tool.
Also, people pointing out the diversity of needs behind the multiverse of plotting libraries in Python probably aren't aware of the large number of extensions built on top of ggplot2[1]. In python, they end up being completely new packages.
Personally, I don't _want_ a grammar of graphics; I want a grammar of data, where the data happens to have a graphical representation. I don't want to spend ages piecing together a fancy plot; I want to spend just a little time annotating my data to declare what it means, and then no matter how I slice and dice my data it will show up in a meaningful way. That way I can explore it to really understand it, which is the point of HoloViews (http://holoviews.org). But people approach plotting in lots of different ways, and some people actually _do_ want to spend their time making plots, so they are welcome to their ggplot2!
I don't _want_ a grammar of graphics; I want a grammar of data, where the data happens to have a graphical representation
The point of having a graphical representation is to communicate with other humans therefore some subjective judgement is required; there’s no “one true plot” for each type of data.
Of course! HoloViews does allow infinite customizability, to pull out more and more subtle features, show things more clearly, and just to make it look nice or to match your favorite style. But unlike ggplot2, HoloViews does that in a way that can apply to the _data_, rather than having to recapitulate the process every single time you build an individual plot. That way you and your colleagues can together build up whatever style you find most effective, then keep working with it across the full multidimensional landscape of data that you work with in a particular field. HoloViews is a completely different approach, if you really let the ideas sink in (e.g. from our paper about it at http://conference.scipy.org/proceedings/scipy2015/pdfs/jean-...), and is in no way a second-class citizen compared to ggplot2 or any other approach in R...
I think we have one in the team behind Vega Lite. Vega has a concise and well-designed grammar of graphics. It comes from the same academic centre (UW Interactive Data Lab) where Mike Bostock developed d3, and I think complements d3 nicely as a higher level tool.
Hadley Wickham's slightly abortive ggvis project was a wrapper around Vega.
Altair is the python library that wraps Vega Lite, making it even more concise to write chart specs.
One important unique feature is that Vega and Vega Lite (and Altair) provide declarative syntax for complex interactions such as cross filters.
There's a lot of good things to say about ggplot, but I feel like the best thing is RStudios ability to auto-complete column names of the input dataframe throughout your pipe of ggplot commands / functions.
I think I would enjoy Python just as much if I figured out how to auto complete column names when writing code for my plots and pandas operations.
In that kind of cases RStudios is smart enough to know that such an argument should be populated by a column name from the dataframe that function is being applied to.
ggplot2 is great as a high level plotting library, though as soon as you want to do something slightly different you end up in the rats nest that is grob and ggproto (because 3 OOP systems aren't enough, so let's create an entirely new one just for this library).
I normally give up and tolerate matplotlib instead.
Altair hands down. I've tried and worked with a lot of those in data analysis, Altair just seems like 'magic'.
What should be easy is easy, what is hard is still possible - for me it's the perfect mix. The only (but a big one) drawdown is the peformance implication of the JSON generated for chart - over 50k lines of raw data for chart, you start to feel the lag.
Vega-Lite author here. Performance is hugely important for us and is covered y the first three points on our roadmap: https://docs.google.com/document/d/1fscSxSJtfkd1m027r1ONCc7O.... Some improvements will ship with Vega-Lite 3. Others are being worked on as we speak.
I'm pretty good at Python and do most of my work in it nowadays, but if I need to make a data visualization, I still go to R just for ggplot2. Nothing currently in Python compares (not even the "ggplot2 port"), and it takes an order of magnitude longer to make a comparable viz.
I heartily recommend altair, which is maturing nicely - while the API is not a clone of ggplot, it takes the same grammar-of-graphics idea and applies it to interactive graphics.
It's 5-6 lines of code per chart for most examples here, with a few more lines for interaction (e.g. selection and linked brushing across graphs).
Huh, when I was looking into Python options I had thought plotnine was dead, but looking at the GitHub that's not the case: https://github.com/has2k1/plotnine
Plotnine is quite nice. Its api is very similar to R's ggplot so it gives you the same level of expressiveness and flexibility while being able to stay the Python ecosystem to use numpy, pandas, etc..
If you are using Python with Jupyter notebooks, another way to create graphs with ggplot2 is to interface directly with an R session using the rpy2 package. A nice tutorial can be found here: https://www.linkedin.com/pulse/interfacing-r-from-python-3-j...
Feather, based on the Apache Arrow format, is a lot faster than CSVs, and there's good support in both R and Python for it, maintained by Hadley Wickham (who has developed Almost Ever R Package You've Heard Of) and Wes McKinney (who is the BDFL of pandas) respectively. It also involves less inferring of data types using heuristics.
>You probably shouldn't be keeping data as pickles, for both compatibility and security reasons
It's science research. My own simulation code and data. Some of it is hdf5, but pickle files are pretty convenient. Workflow is mainly turning data into plots. Heh, I thought I was doing alright since I'm not using textfiles.
It seems like a common critique of Python is that there’s choice in which tool to use for a given task (serve web, fast numerical code, and here, visualization). This is ironic given a language that says there should be one obvious way to do it, but perhaps it reflects a diversity of applications, similarly to how enterprise Java seems overengineered until the day that that dependency injection mumbo jumbo allows you to cover a paying use case that would’ve otherwise taken ages.
If I want pretty web-ish scatter plots Bokey or Plotly, and if I need desktop GUI visualizations, PyQtGraph, etc. Thankfully there’s no one saying we need a single toolkit for everything (though Matplotlib does try to do this to some extent)
Ideology is meh. I always took the whole "there should be one obvious way" as a guideline for the language or libraries close to the core. Who cares if it isn't that way for everything? Data visualization is a library at the end of a pipeline, which make the requirements for it to interface with anything else much less stringent as it is for the standard library for example.
It's fine you have multiple tools. The only thing that matters at the end of the day is the product you provide, so it doesn't seem to matter to me.
A friend of mine created a page to show code snippets and comparisons between a few of the plotting libraries available in Python (and R) -- http://pythonplot.com/
The reason there's so many libraries is that none of them has emerged as a clear leader. If there was one really great library for the majority of use-cases, you'd see consolidation around it, like you mostly have with ggplot2 in R.
It hinders collaboration and the development of an ecosystem. One nice thing about ggplot2 is how there's a growing universe of things to support it -- packages to extend it or to add themes or what have you. Having more options means that effort gets split.
I can totally recommend vega and vega-lite. It's just json and web based so it pretty language independent. There's no difference between just messing around with data sciency stuff in a repl or making a graph for your website, it's all the same.
I tried a number of libraries a few years ago. I eventually settled on plotly. Between that and dash (https://plot.ly/products/dash/) i can build some pretty powerful visualizations very easily.
While appreciate this catalog of options this is a lengthy article about different visualization libraries and there are no actual images of what they look like.
There are links to each library included, which will let you look at galleries of examples that are much more helpful than any single image would be. But if you want to collect images from each library, feel free to post that in the comments!
I'm a huge fan of Altair, which I recommend as the 'sensible default' tool in the data science team I work in. I recently wrote a blog post about why I think it's the best option:
As someone who used to use ROOT, I would strongly recommend against it. Its C++ implementation has all manner of hidden state, and this leaks into PyROOT.
For example, ROOT has the concept of ownership, entirely distinct from the python sense of references. A file owns the histograms that were read from it. Closing the file changes all references to histograms read from that file to None. It took a large amount of digging to figure out how that was even possible (overwriting an existing object from within the C API), but I cannot for the life of me understand why.
ROOT has way too many gotchas for me to ever recommend it.
Yeah it has a lot of state that's not very pythonic (and I almost always use it from c++ or cling) but it is very good for semiinteractive plotmaking that's not possible in something like matplotlib. A lot of ROOT's legacy from the 90's shows, but ROOT6 has made a lot of progress.
Does anyone outside of High Energy Physics use ROOT? I'm not trying to be argumentative, I'm genuinely curious, I'm a physicist by day so seeing ROOT talked about on here amongst popular libraries is like someone recommending my local pub as the best in town.
I used root in astrophysics for a couple things, but that's hardly outside it's intended audience. yt is generally considered to be easier to pick up now instead though.
ROOT and yt have quite different domains (although it's probably possible for each to emulate the other...) so I'm not sure it makes sense to compare them.
Python is an easy to use powerful development language. Ironically, this leads to too many projects solving the same problems. It isn’t just visualization that has so many libraries, it’s almost everything.
I first noticed this effect with Python web-frameworks years ago. There was always some new framework that worked better or differently for some use case. (There are 73 web-frameworks listed in the python.org wiki.)
A few more examples from pypi.org searches:
“Reed-Solomon” 84 packages
“Elliptic-curve” 535 packages
“Nearest neighbor” 408 packages
“Simplex” (as in simplex method) 29
“Django” 10,000+
I just noticed a HN post about a RAFT implementation in Haskell. Pypi.org says there are 22 hits for RAFT in the python package index.
The more I try to understand plotting documentations the more I end up going to SO for even minute detour from the conventional usecase. This is the same situation with pandas. I wonder how I would ever work with these without an internet connection.
Perhaps atleast some of the python visualization libs are built upon Matplotlib. I do feel that Matplotlib is kind of unintutive, trying in some ways to emulate Matlab, which isn't an elegant system to start with. For people directly using python without any prior experience with matlab, that is unnecessary baggage.
Also 3d plotting and graphing needs much more features. Trying to plot vectors in 3d, I found things are still very rudimentary. Even for surfaces, if you want to plot an ellipsoid, for eg, you need to reformulate the surface in polar form, then only matplotlib is able to generate it, were as Mathematica can generate surfaces from Cartesian expressions.
I wonder if it's too easy to make one of these libraries. Don't get me wrong I'VE never tried to build one of these, I'm sure it's hard. But maybe it's not so hard that people won't just make their own that fit their needs better.
If this is true I would say it's a positive feature of python, but also a reason for fragmentation. You tend to see less fragmentation in earlier and lower level technologies. This is partially due to time, only good software or hardware lasts. But I also hypothesis it was because things are harder down there, you need bigger teams and more resources, which tends towards centralization.
I don’t really understand the appeal of D3. It’s basically just a thin wrapper over SVG, so thin in fact, to do anything interesting, you’re stuck manipulating SVG elements yourself.
There’s a good idea there, but D3 just isn’t quite right.
Remember the old joke, "Python is the only language that has more web frameworks than keywords."?
What eventually happened was that the BDFL blessed Django and most of the others withered. Some had enough ecosystem, or were components of some other larger project, or had a niche advantage of Django, and they managed to survive.
I think the reason why web frameworks and data viz systems proliferate in Python is just that they are so easy to write, yet still challenging enough to be really fun, and you get a lot of highly, uh, visible feedback and reward for doing it.
I'd argue exactly the opposite, the reason there are so many is that it's really hard to write a good data visualization framework. Data visualization encompasses a vast range of different tasks that are often only loosely related, and it's extremely difficult (if not impossible) to write one library that serves all uses well. Add on top of that the fact that you have different output types - web, interactive analysis, print etc and you end up with a situation where you get lots of different libraries optimized for different use cases.
And that's not even counting differences of opinion on aesthetics...
5+ years and 321 contributors later, I can state that Bokeh was not easy to write. Fairly sure MPL and Plotly devs would disagree with this assessment as well.
Python has a multitude of vis systems because there are multiple use-cases, sometimes overlapping and sometimes not, and different people prioritize these very differently (and that's OK).
I teach Python for Data Analysis and struggle in recommending good visualization solutions in Python.
Most of my students are used to Excel and are new to programming.
Expensive proprietary packages like Tableau(and even Power BI) provide a much better experience when it comes to presenting your results to management.
So we do the data gathering/wrangling/munching/analysis in Python but export the results for commercial visualization packages.
I can't in good conscience recommend any one Python library for visualization.
I mean for people who are just getting a hang of Python not Wes McKinney :)
I think SQL can be used, so I just use SQLite for the data visualization. I wrote a extension to do so, but many thing it does not yet include. I think SQLite extensions for doing graphics will be good idea.
The data is either stored in a SQLite database or can be imported (either permanently or temporarily), and then queried (including with a WHERE clause and/or various other stuff, if applicable), and extension for graphics can be use for making the graphics (sqlext_graph), display on screen (sqlext_xwindows), or save to disk (sqlext_pipe and Farbfeld Utilities). Mainly, I fail to see why should I need a different program, if SQL seems good enough? (Of course, if other programs are more helpful to you, you can use those other programs, either instead of or in addition to, SQL. For example, I think sometimes SQLite is combined with R for statistics and graphics.)
Google searches for both sqlext_graph and sqlext_xwindows return your comment as the only result, but I'd love to learn more about these. Can you provide an example of how to produce a scatter plot of, for example, the iris dataset, with points coloured by species?
Also, by the time you're leveraging SQL extensions, are you really still "using SQL"?
I used it for charts in a paper recently, since it includes swarm plots. I hit a problem when overlaying certain types of plot on the same axes (I think it was swarm plots on top of box plots, so I could show every data point as well as the quartiles). The problem was that the data would end up shifted, so the x axis labels weren't correct, even though each plot on its own would work fine when the other was commented-out. The only reason I noticed this was because I spotted that a peak I knew occurred at x=13 was showing up at x=14.
Although it took a specific set of circumstances to trigger, I thought this was still a serious problem since it causes data to be misrepresented, and it doesn't cause any warning, etc. that something is wrong. I made a minimal example script, opened a github issue ( https://github.com/mwaskom/seaborn/issues/1409 ), where the library author insulted me and locked the issue. I sent them a followup email, to explain that I was not after "free tech support" (I'd already worked around the issue for my paper by 'faking' the labels), I wanted to help improve the library so that others would avoid having incorrect plots (especially those not lucky enough to spot it like I was) and how I'd already spent considerable time narrowing down the problem to the minimal example, as evidence that I wasn't trying to be a freeloader. I also began my email with an apology for using this side channel, and that I wouldn't contact them again unless they consented to it. The author replied with more insults.
I didn't contact them again, as promised, but now I'm actively opposed to anyone using this project, due to the author's complete disregard for corruption of scientific data.