
Python Data Visualization 2018: Why So Many Libraries? - ehudla
https://www.anaconda.com/blog/developer-blog/python-data-visualization-2018-why-so-many-libraries/
======
chriswarbo
I would strongly recommend against anyone using Seaborn.

I used it for charts in a paper recently, since it includes swarm plots. I hit
a problem when overlaying certain types of plot on the same axes (I think it
was swarm plots on top of box plots, so I could show every data point as well
as the quartiles). The problem was that the data would end up shifted, so the
x axis labels weren't correct, even though each plot on its own would work
fine when the other was commented-out. The only reason I noticed this was
because I spotted that a peak I knew occurred at x=13 was showing up at x=14.

Although it took a specific set of circumstances to trigger, I thought this
was still a serious problem since it causes data to be misrepresented, and it
doesn't cause any warning, etc. that something is wrong. I made a minimal
example script, opened a github issue (
[https://github.com/mwaskom/seaborn/issues/1409](https://github.com/mwaskom/seaborn/issues/1409)
), where the library author insulted me and locked the issue. I sent them a
followup email, to explain that I was not after "free tech support" (I'd
already worked around the issue for my paper by 'faking' the labels), I wanted
to help improve the library so that others would avoid having incorrect plots
(especially those not lucky enough to spot it like I was) and how I'd already
spent considerable time narrowing down the problem to the minimal example, as
evidence that I wasn't trying to be a freeloader. I also began my email with
an apology for using this side channel, and that I wouldn't contact them again
unless they consented to it. The author replied with more insults.

I didn't contact them again, as promised, but now I'm actively opposed to
anyone using this project, due to the author's complete disregard for
corruption of scientific data.

~~~
b_tterc_p
Going to agree with the Seaborn guy here. While you qualify your stance with
some diplomatic terms, it does come off as very aggressive. Adding polite
statements to a rude letter doesn’t make it polite; and a provider of an open
source maintainer doesn’t have do to anything for you. You call the software
objectively wrong and are telling the author that his product is wrong when
you really should be asking how to use it on stack overflow.

Seaborn is a really nice way to take off matplotlib overhead and make your
non-seaborn plots look better too. I highly recommend it.

~~~
notafraudster
I'm not sure I get this characterization of the exchange; the parent comment
is a little curt in his third reply, but the original issue is very well-
formulated and the Seaborn guy's replies are incomprehensible. After
elaborating, the parent gets a little pedantic in trying to get the maintainer
to respond with something other than a one-sentence dismissal, and although at
that point the comment is a little hostile, it clearly isn't "an obnoxious
rant" or "asking for free tech support". Relative to the number of stars the
repository has, this exchange was probably one of the most user-hostile I've
ever seen for a legitimate question on a bugtracker.

------
mikepurvis
I find Pandas+Bokeh to be a pretty killer combination. You get nice vectorized
operations for manipulating the data upfront, and a pretty wide range of
visualizations that are plenty customizable. Main beefs with bokeh are:

\- It's great that I can manipulate a view by zooming, panning, turning a
series off, whatever, but all that state is lost when I send the URL of the
plot to someone else (eg, for the case where it was generated by a CI job or
something rather than locally on my machine). It should be easier in Bokeh to
grab all that state and stuff it into a querystring that can then rehydrate
the same view later on.

\- Doing plots with more than one y-axis is a lot more awkward and fiddly than
it should be, and even once you succeed, the axes can't be panned
independently.

\- If your data series name overlaps with the label in your legend, the whole
plot breaks in an extremely non-obvious way (basically, Bokeh thinks that you
want a legend entry _per row in the named data series_ ). Even being fully
prepared for it, I lose time to this every once in a while, and it's almost a
rite of passage that every junior dev hits it and burns an hour or two trying
to figure out what is wrong.

------
frankc
There's like 15 comments in this post and 10 different suggestions on what
visualization library to use. That's not great.

~~~
ms013
Yup. A likely result is that if you pick one and spend the time to learn it
and use it for a project, there's a non-trivial chance that the choice you
make will be join the ever growing collection of library abandon-ware in the
not too distant future.

This is why my favorite Python visualization tools are not Python - I've been
burned too many times by libraries coming and going, and I just don't have the
time to spend farting around trying to track the latest library fads.

~~~
makmanalp
Argh, it's so frustrating to see this kind of sentiment. We're truly spoiled
by choice. The libraries don't all serve the same purpose, and not everyone
needs every library. This is literally a guide to help pick which one I may
want. What more could one ask for??

That said, I understand it sucks to build on top of someone else's code only
to have it be discontinued, but honestly it only takes a few minutes to judge
a project's maturity. Here are some very easy rules of thumb. Don't write a
lot of important code relying on a library that:

\- Is younger than 3 years old. \- Has less than 3-4 major contributors and 20
overall contributors. \- Has lost steam: a lot of issues and pull requests
open and stale with no triage tags, no discussion or responses. \- Doesn't
seem to have any automated testing / packaging infrastructure set up. \-
Doesn't seem to have a regular ongoing cycle of releases, be it long or short.
\- Doesn't have nicely laid out documentation. \- Doesn't seem to have a user
base, as indicated by a preponderance of questions and answers on
stackoverflow, etc. \- Just posted their "we made a cool new thing" post on HN
a few weeks ago.

Yes, the cutoffs are arbitrary (and flexible!), but this hasn't failed me yet.
The python world is filled with many wonderful and mature libraries. It just
also has a lot of up and coming, promising young ones. Use whichever!

~~~
derefr
> What more could one ask for??

A refactoring and merging of these libraries to enable fewer people to
maintain more functionality, such that the bus-factor of any given part of the
functionality is higher.

~~~
bigreddot
That's entirely backwards, to reduce the bus factor, you need more people
maintaining less, not fewer people maintaining more.

In any case, the incentives are simply not there, as different projects have
very different priorities (which is why there are so many projects). Some
folks want to monetize their special sauce (Plotly). Some folks want to focus
on high level statistical charting (Altair, Chartify) Some folks want to focus
on interactive data exploration (Holoviews) Some folks want to focus on high
performance and streaming (Bokeh) Some folks want to focus on high quality
static image generation (MPL). The human and economical cost of getting all
those groups under one tent is astronomical.

~~~
derefr
> you need more people maintaining less, not fewer people maintaining more

I see how you got confused, but I was making two separate assertions.

Assertion 1: If you merge the library _projects_ together, and strip out the
redundancies between them, then you'll have the same number of _contributors_
, now distributed over fewer total lines of code. So each contributor can
learn more of the codebase. (Bus factor goes up.)

Assertion 2: If you refactor the resulting libraries to reduce the total
complexity (i.e. reduce the API surface from that of the union of all the
merged-in libs), then you can begin to strip out technical debt from the
project from the outside in. By eliminating now-dead code, you remove places
where bugs can arise, and you lower the number of dependencies (which could
otherwise have been sources of API-breaking changes when they update.) Thus,
the _number of people needed to maintain the project_ goes down. So the same
total _functionality_ can then be maintained by fewer contributors. (You don't
actually _remove_ contributors; they just become reserve capacity, with each
contributor able to be less-overworked for the same result.)

~~~
bigreddot
> If you merge the library projects together

Alas, it's not so simple. How much actual overlap is there to merge in Bokeh
and Matplotlib, for instance? MPL renders images in Python and has no JS
component. Bokeh does all of its rendering in JavaScript! The Python API is
mostly just a thin wrapper around BokehJS. MPL has no server component at all.
Neither of the does what Datashader does for large data sets. Neither of them
has a high level statistical charting API, that's only in Seaborn or Chartify.
Merging all these things together would cost a fortune in time and money, and
at the end of the day not actually reduce the total codesize to any
appreciable degree.

------
msaharia
Python visualization needs a visionary like Hadley Wickham. Once you get used
to the clarity of "Grammar of Graphics", every other plotting library feels
like a clumsy tool.

Also, people pointing out the diversity of needs behind the multiverse of
plotting libraries in Python probably aren't aware of the large number of
extensions built on top of ggplot2[1]. In python, they end up being completely
new packages.

1\.
[http://www.ggplot2-exts.org/gallery/](http://www.ggplot2-exts.org/gallery/)

~~~
jbednar
Personally, I don't _want_ a grammar of graphics; I want a grammar of data,
where the data happens to have a graphical representation. I don't want to
spend ages piecing together a fancy plot; I want to spend just a little time
annotating my data to declare what it means, and then no matter how I slice
and dice my data it will show up in a meaningful way. That way I can explore
it to really understand it, which is the point of HoloViews
([http://holoviews.org](http://holoviews.org)). But people approach plotting
in lots of different ways, and some people actually _do_ want to spend their
time making plots, so they are welcome to their ggplot2!

~~~
gaius
_I don 't _want_ a grammar of graphics; I want a grammar of data, where the
data happens to have a graphical representation_

The point of having a graphical representation is to communicate with other
humans therefore some subjective judgement is required; there’s no “one true
plot” for each type of data.

~~~
jbednar
Of course! HoloViews does allow infinite customizability, to pull out more and
more subtle features, show things more clearly, and just to make it look nice
or to match your favorite style. But unlike ggplot2, HoloViews does that in a
way that can apply to the _data_, rather than having to recapitulate the
process every single time you build an individual plot. That way you and your
colleagues can together build up whatever style you find most effective, then
keep working with it across the full multidimensional landscape of data that
you work with in a particular field. HoloViews is a completely different
approach, if you really let the ideas sink in (e.g. from our paper about it at
[http://conference.scipy.org/proceedings/scipy2015/pdfs/jean-...](http://conference.scipy.org/proceedings/scipy2015/pdfs/jean-
luc_stevens.pdf)), and is in no way a second-class citizen compared to ggplot2
or any other approach in R...

~~~
gaius
I’ll check it out, thanks for the tip!

------
TheAlchemist
Altair hands down. I've tried and worked with a lot of those in data analysis,
Altair just seems like 'magic'.

What should be easy is easy, what is hard is still possible - for me it's the
perfect mix. The only (but a big one) drawdown is the peformance implication
of the JSON generated for chart - over 50k lines of raw data for chart, you
start to feel the lag.

~~~
domoritz
Vega-Lite author here. Performance is hugely important for us and is covered y
the first three points on our roadmap:
[https://docs.google.com/document/d/1fscSxSJtfkd1m027r1ONCc7O...](https://docs.google.com/document/d/1fscSxSJtfkd1m027r1ONCc7O8RdZp1oGABwca2pgV_E/edit?ts=5bcc2d98#).
Some improvements will ship with Vega-Lite 3. Others are being worked on as we
speak.

~~~
stuartaxelowen
I love vega-lite, thank you so much for your hard work!

------
minimaxir
I'm pretty good at Python and do most of my work in it nowadays, but if I need
to make a data visualization, I still go to R just for ggplot2. Nothing
currently in Python compares (not even the "ggplot2 port"), and it takes an
order of magnitude longer to make a comparable viz.

~~~
ObscureMind
Check plotnine, which is a decent reimplementation of ggplot2 in python:
[https://plotnine.readthedocs.io/en/latest/](https://plotnine.readthedocs.io/en/latest/)

~~~
ExcaliburZero
Plotnine is quite nice. Its api is very similar to R's ggplot so it gives you
the same level of expressiveness and flexibility while being able to stay the
Python ecosystem to use numpy, pandas, etc..

~~~
ObscureMind
The authors also provide a reimplementation of dplyr in Python that integrates
well with plotnine:
[https://github.com/has2k1/plydata](https://github.com/has2k1/plydata)

------
marmaduke
It seems like a common critique of Python is that there’s choice in which tool
to use for a given task (serve web, fast numerical code, and here,
visualization). This is ironic given a language that says there should be one
obvious way to do it, but perhaps it reflects a diversity of applications,
similarly to how enterprise Java seems overengineered until the day that that
dependency injection mumbo jumbo allows you to cover a paying use case that
would’ve otherwise taken ages.

If I want pretty web-ish scatter plots Bokey or Plotly, and if I need desktop
GUI visualizations, PyQtGraph, etc. Thankfully there’s no one saying we need a
single toolkit for everything (though Matplotlib does try to do this to some
extent)

~~~
noobermin
Ideology is meh. I always took the whole "there should be one obvious way" as
a guideline for the language or libraries close to the core. Who cares if it
isn't that way for everything? Data visualization is a library at the end of a
pipeline, which make the requirements for it to interface with anything else
much less stringent as it is for the standard library for example.

It's fine you have multiple tools. The only thing that matters at the end of
the day is the product you provide, so it doesn't seem to matter to me.

------
danthelion
Missed one! Spotify just released their own dataviz library yesterday
[https://labs.spotify.com/2018/11/15/introducing-chartify-
eas...](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-
creation-in-python-for-data-scientists/)

~~~
anotheryou
(which is a wrapper for bokeh, sounds good to me)

------
rsivapr
A friend of mine created a page to show code snippets and comparisons between
a few of the plotting libraries available in Python (and R) --
[http://pythonplot.com/](http://pythonplot.com/)

~~~
spott
I've run across that page before. It really needs altair, bokeh and holoviews
support.

------
cwyers
The reason there's so many libraries is that none of them has emerged as a
clear leader. If there was one really great library for the majority of use-
cases, you'd see consolidation around it, like you mostly have with ggplot2 in
R.

~~~
noobermin
A variety of choice is fine. It certainly doesn't hinder anyone.

~~~
cwyers
It hinders collaboration and the development of an ecosystem. One nice thing
about ggplot2 is how there's a growing universe of things to support it --
packages to extend it or to add themes or what have you. Having more options
means that effort gets split.

------
thecryusofiran
I can totally recommend vega and vega-lite. It's just json and web based so it
pretty language independent. There's no difference between just messing around
with data sciency stuff in a repl or making a graph for your website, it's all
the same.

------
broken_symlink
I tried a number of libraries a few years ago. I eventually settled on plotly.
Between that and dash
([https://plot.ly/products/dash/](https://plot.ly/products/dash/)) i can build
some pretty powerful visualizations very easily.

------
bleair
While appreciate this catalog of options this is a lengthy article about
different visualization libraries and there are no actual images of what they
look like.

~~~
jbednar
There are links to each library included, which will let you look at galleries
of examples that are much more helpful than any single image would be. But if
you want to collect images from each library, feel free to post that in the
comments!

------
xioxox
Also missed my Veusz GUI plotting package and library.
[https://veusz.github.io/](https://veusz.github.io/)

------
RobinL
I'm a huge fan of Altair, which I recommend as the 'sensible default' tool in
the data science team I work in. I recently wrote a blog post about why I
think it's the best option:

[https://medium.com/@robin.linacre/why-im-backing-vega-
lite-a...](https://medium.com/@robin.linacre/why-im-backing-vega-lite-as-our-
default-tool-for-data-visualisation-51c20970df39)

------
cozzyd
As a ROOT guru and occasional pythonist, I recommend PyROOT especially if you
want to interactively explore what you're plotting.

~~~
MereInterest
As someone who used to use ROOT, I would strongly recommend against it. Its
C++ implementation has all manner of hidden state, and this leaks into PyROOT.

For example, ROOT has the concept of ownership, entirely distinct from the
python sense of references. A file owns the histograms that were read from it.
Closing the file changes all references to histograms read from that file to
None. It took a large amount of digging to figure out how that was even
possible (overwriting an existing object from within the C API), but I cannot
for the life of me understand why.

ROOT has way too many gotchas for me to ever recommend it.

~~~
cozzyd
Yeah it has a lot of state that's not very pythonic (and I almost always use
it from c++ or cling) but it is very good for semiinteractive plotmaking
that's not possible in something like matplotlib. A lot of ROOT's legacy from
the 90's shows, but ROOT6 has made a lot of progress.

------
todd8
Python is an easy to use powerful development language. Ironically, this leads
to too many projects solving the same problems. It isn’t just visualization
that has so many libraries, it’s almost everything.

I first noticed this effect with Python web-frameworks years ago. There was
always some new framework that worked better or differently for some use case.
(There are 73 web-frameworks listed in the python.org wiki.)

A few more examples from pypi.org searches:

“Reed-Solomon” 84 packages

“Elliptic-curve” 535 packages

“Nearest neighbor” 408 packages

“Simplex” (as in simplex method) 29

“Django” 10,000+

I just noticed a HN post about a RAFT implementation in Haskell. Pypi.org says
there are 22 hits for RAFT in the python package index.

------
abhishekjha
The more I try to understand plotting documentations the more I end up going
to SO for even minute detour from the conventional usecase. This is the same
situation with pandas. I wonder how I would ever work with these without an
internet connection.

------
billfruit
Perhaps atleast some of the python visualization libs are built upon
Matplotlib. I do feel that Matplotlib is kind of unintutive, trying in some
ways to emulate Matlab, which isn't an elegant system to start with. For
people directly using python without any prior experience with matlab, that is
unnecessary baggage.

Also 3d plotting and graphing needs much more features. Trying to plot vectors
in 3d, I found things are still very rudimentary. Even for surfaces, if you
want to plot an ellipsoid, for eg, you need to reformulate the surface in
polar form, then only matplotlib is able to generate it, were as Mathematica
can generate surfaces from Cartesian expressions.

------
sgillen
I wonder if it's too easy to make one of these libraries. Don't get me wrong
I'VE never tried to build one of these, I'm sure it's hard. But maybe it's not
so hard that people won't just make their own that fit their needs better.

If this is true I would say it's a positive feature of python, but also a
reason for fragmentation. You tend to see less fragmentation in earlier and
lower level technologies. This is partially due to time, only good software or
hardware lasts. But I also hypothesis it was because things are harder down
there, you need bigger teams and more resources, which tends towards
centralization.

------
jonathankoren
I don’t really understand the appeal of D3. It’s basically just a thin wrapper
over SVG, so thin in fact, to do anything interesting, you’re stuck
manipulating SVG elements yourself.

There’s a good idea there, but D3 just isn’t quite right.

------
carapace
Remember the old joke, "Python is the only language that has more web
frameworks than keywords."?

What eventually happened was that the BDFL blessed Django and most of the
others withered. Some had enough ecosystem, or were components of some other
larger project, or had a niche advantage of Django, and they managed to
survive.

I think the reason why web frameworks and data viz systems proliferate in
Python is just that they are so easy to write, yet still challenging enough to
be really fun, and you get a lot of highly, uh, visible feedback and reward
for doing it.

~~~
bigreddot
5+ years and 321 contributors later, I can state that Bokeh was not easy to
write. Fairly sure MPL and Plotly devs would disagree with this assessment as
well.

Python has a multitude of vis systems because there are multiple use-cases,
sometimes overlapping and sometimes not, and different people prioritize these
very differently (and that's OK).

~~~
carapace
Sorry! I didn't mean to disparage your efforts, or anyone's.

I should have said maybe, " _relatively_ easy in _Python_ ".

------
sireat
I teach Python for Data Analysis and struggle in recommending good
visualization solutions in Python.

Most of my students are used to Excel and are new to programming.

Expensive proprietary packages like Tableau(and even Power BI) provide a much
better experience when it comes to presenting your results to management.

So we do the data gathering/wrangling/munching/analysis in Python but export
the results for commercial visualization packages.

I can't in good conscience recommend any one Python library for visualization.

I mean for people who are just getting a hang of Python not Wes McKinney :)

------
zzo38computer
I think SQL can be used, so I just use SQLite for the data visualization. I
wrote a extension to do so, but many thing it does not yet include. I think
SQLite extensions for doing graphics will be good idea.

~~~
EForEndeavour
Can you elaborate on how exactly you're "using SQLite for data visualization?"

~~~
zzo38computer
The data is either stored in a SQLite database or can be imported (either
permanently or temporarily), and then queried (including with a WHERE clause
and/or various other stuff, if applicable), and extension for graphics can be
use for making the graphics (sqlext_graph), display on screen
(sqlext_xwindows), or save to disk (sqlext_pipe and Farbfeld Utilities).
Mainly, I fail to see why should I need a different program, if SQL seems good
enough? (Of course, if other programs are more helpful to you, you can use
those other programs, either instead of or in addition to, SQL. For example, I
think sometimes SQLite is combined with R for statistics and graphics.)

~~~
EForEndeavour
Google searches for both sqlext_graph and sqlext_xwindows return your comment
as the only result, but I'd love to learn more about these. Can you provide an
example of how to produce a scatter plot of, for example, the iris dataset,
with points coloured by species?

Also, by the time you're leveraging SQL extensions, are you really still
"using SQL"?

------
mschuetz
I can absolutely recommend Jupyter +
[https://github.com/maartenbreddels/ipyvolume](https://github.com/maartenbreddels/ipyvolume)

------
taeric
Should have visualized the different libraries situation, using the different
libraries. :)

------
truculent
give me ggplot2 or give me death

~~~
j88439h84
[https://plotnine.readthedocs.io/en/stable/api.html](https://plotnine.readthedocs.io/en/stable/api.html)

~~~
truculent
Oh wow, thanks! I'd seen the yhat port before but not this one.

~~~
iron0013
It's way, WAY better than the yhat port

~~~
truculent
If I were being picky I would day that directly porting everything down to the
`+ geom_*` syntax isn't that idiomatic for python.

But if I said that, I have to be a pedant with no joy in their life, so...

------
th0ma5
I wonder if Plotly will ever take privacy and security seriously... I've had
this ticket open for years.
[https://github.com/plotly/plotly.js/issues/316](https://github.com/plotly/plotly.js/issues/316)

~~~
minimaxir
The transparent button-removal config option is a fair workaround. (in the
Plotly R docs, removing the button is the first code example:
[https://www.rdocumentation.org/packages/plotly/versions/4.8....](https://www.rdocumentation.org/packages/plotly/versions/4.8.0/topics/config)
)

~~~
th0ma5
1) shouldn't be there at all IMHO 2) should be off by default if anyone still
thinks it is even remotely a good idea

------
nerdponx
There's also Vaex for big(ger) data visualization.

~~~
eesmith
It's in the image, in green, slightly above the center, and mentioned twice in
the text.

~~~
nerdponx
Oops, I was skimming too fast. Thanks.

