Hacker News new | past | comments | ask | show | jobs | submit login
Comprehensive Guide on Data Visualization with Pandas (kanoki.org)
211 points by min2bro 9 months ago | hide | past | favorite | 63 comments

I really am not sure how "comprehensive" I would call this, after glancing over it it looks like one of the million currently existing basic pandas plotting guides.

What should one do if after following those million guides it still doesn’t stick? I always end up googling what I want and hit SO. Is there something wrong with me or does numpy and pandas seem more difficult than they should be?

The pandas api is a mess and that's why it feels that way. It is a great tool, and very powerful, but boy does it make the user's life difficult with a pretty convoluted API that makes it very hard to automatically discover functionality. As others have said, unless you use it all the time, you're essentially in for a bad time that involves lots of Googling and SO browsing even to get basic things done in pandas. I say this as someone who had developed a few smaller projects that utilize pandas extensively, so I'm not just criticizing it without having used it.

As a daily pandas user for a couple years, I agree that the API is rough in places, though the official docs are extremely helpful. Also, the Stack Overflow community has come up with some helpful "missing manual" writeups, such as "Pandas Merging 101" https://stackoverflow.com/questions/53645882/pandas-merging-...

API pain points off the top of my head:

- There are both `pivot` and `pivot_table` methods which behave slightly differently. I'd have to look up why `pivot` exists; I've learned simply to reach for `pivot_table` every time, since it aggregates more flexibly.

- The dreaded SettingWithCopyWarning is a huge pain for new pandas users, and simply should not exist in any mature data analysis software.

- Unlike in SQL, in pandas you can take a data table (a DataFrame) and group it by one or more columns without immediately specifying an aggregation function. The result of this `groupby` operation is a "groupby object", which contains all the information of the input DataFrame, but behaves completely differently from a DataFrame. By contrast, in R's tidyverse, my understanding is that when you group a tibble (the tidyverse's data table), the result is still a tibble, just grouped.

- It took a while to mentally sort out the similarities and differences of merge, join, append, and concatenate. In practice, I almost always reach for `merge` or `concat`.

I really wouldn't worry about it - I've been using Python for data work for the last 5 years or so and I have to look stuff all the time. Eventually the basic stuff sticks but it's like any kind of coding, I don't think anyone ever hits the point where they hardly ever have to Google stuff.

One useful tip I can suggest is to create a repo for useful code snippets, so if you ever find yourself doing something new that you think you might need again, just spend a bit of time commenting and describing it and add it to the repo. That way instead of having to spend time searching you'll hopefully remember doing it before and be able to find it easily.

After googling the same things every other week, I started using a snippets application (in my case it is SnippetsLab) where I wrote a nice description and keywords for every snippet. Life is so much easier that way.

Back in the old days, experts collected reference manuals. Googling things is the same as flipping through the pages of a manual.

It is possible to do the same things in Pandas is many different ways. It is good to have this flexibility, but it is confusing for new users. On top of that the bracket operator ([) is overloaded in many ways.

Things started to make sense after I read a very good book on Pandas[1]. Reading a book is better than reading blog posts, because it is consistent. In contrast, reading small tutorials for every little thing is confusing, because every blog post is using a different way to do the same thing.

[1] https://github.com/jakevdp/PythonDataScienceHandbook

I suggest looking into matplotlib structures: figures and axes. I think this article [0] is pretty good at detailing how to work with them. They definitely can be confusing but I think most can grasp what to use after reading the article.

[0] http://jonathansoma.com/lede/algorithms-2017/classes/fuzzine...

>Is there something wrong with me or does numpy and pandas seem more difficult than they should be?

If you're only an occasional user, this will be your life forever. My experience with pandas is that if you use it heavily for 3 months, then things start to "stick" and you need to look it up less often.

Unfortunately I changed jobs and have forgotten most of pandas, so I'm back to looking things up again.

pandas has a terrible API (IMO obviously). It looks like a weird mixture of Numpy and pre-2010 R code.

Like it has great fucntionality, but I waste so much time trying to figure out how to do something with pandas.

And I can remember the R-API's, which are even more annoying ;)

re: weird mixture of Numpy and pre-2010R

It does because it is.

Pandas was written by someone that was just starting out with Python at the time and was coming from programming in R.

Pandas was extensively rewritten since its first few versions, and its GitHub repo has 1500+ contributors now, so I doubt that Wes McKinney's initial lack of python experience had much to do with the API rough spots of modern pandas. "Pre-2010 R" doesn't mean much to me, but I can confidently point out that pandas "looks like a weird mixture of numpy" because it is based heavily on numpy.

It basically means pre-tidyverse R. pandas looks like a weird hybrid of Numpy and R. It's such a shame, as python has marvellous abstraction facilities, and yet they aren't used in pandas (__methods__ etc).

Is this actually a new feature of pandas? I've only used other libraries like seaborn and matplotlib.

It's not new, I remember using the .plot function 2 or 3 year ago. Seaborn is much better anyway though, so don't bother switching.

And the pandas plots are ugly

IIRC, if you import seaborn, then pandas plots will use seaborn styling. I don't think seaborn is a standalone plotting library - it's merely provides styling. So you can have your cake and eat it too.

Seaborn adds new plotting functions, not merely a new style, though I'm not sure that qualifies it to be called standalone. Certainly Seaborn has matplotlib as a dependency.

You can use plt.style.use("seaborn") [0] to just use the style. Style sheets reference [1]

[0] https://matplotlib.org/tutorials/introductory/customizing.ht...

[1] https://matplotlib.org/gallery/style_sheets/style_sheets_ref...

It is wrong to say Seaborn only provides styling. There are many types of plots available in Seaborn which are Not available in Matplotlib. It might not be extensive, but what it does, it does better than Matplotlib.

This capability is at least a few years old which is when I first started using pandas. I believe it uses matplotlib on the backend for the plotting by default, and works pretty seamlessly with seaborn too.

It's a matplotlib wrapper.

I've been using pandas since 2012 and it had decent plotting capabilities back then.

I’m not sure i would describe those graphs as “beautiful”

These are raw matplotlib graphs. Its crude. Not sure what's the definition of beautiful?

Beautiful means you want to post it on Instagram.

Yeah, this doesn't seem to add much...was hoping for something a little deeper.

Lately, `pip install pandas` is my first step after making a new virtual environment. Its read_sql and read_csv methods are magic. The resulting DataFrames are just like DataTables in C#. And for complex joins and aggregations, I can DataFrame.to_sql into an in-memory SQLite database.

Pandas feels like the wrong tool for this job. I don't use multi-indexes or any statistical methods. I don't chart anything.

But it's so darn convenient. If the time comes to optimize I can `import csv` directly and improve performance. But nothing beats it for prototyping.

Are there better options in this space?

On occasion, I've fired up pandas just to sanitize a CSV file and drop malformed rows as preparation to bulk ingesting into a database:

  import pandas as pd
  pd.read_csv('bad_file.csv', error_bad_lines=False).to_csv('good_file.csv')
It's not efficient (reads everything into memory), but read_csv is robust when it comes to handling embedded unescaped quotes/commas/etc., and supports dropping rows with the incorrect number of columns due to anomalies it can't handle.

Genuine question - would you be willing to spend money for a better version of pandas?

Better in some generic sense of lighter, faster, better API.

I share your implied concern that pandas can be quite large and I personally disagree with a lot of the design decisions when it comes to the pandas API, but building an alternative tool would be a full time job. Unfortunately, there is no mechanism to support Python library developers and the expectation is for Python libraries to be free.

I'm curious how many people would be ok paying for a Python library.

I think that would be an uphill battle and very hard to succeed financially. I agree with you regarding the API being a mess, but pandas is so heavily entrenched in the datascience space (in Python land) that it is almost impossible for a free replacement to take over, let a lone a paid library.

And pandas is valuable particularly because it has so many users. I'd worry that a paid product would stagnate over time, or change to meet the needs of its largest customers, leaving me behind.

I go out of my way to support open source projects. Closed source would be a much harder sell.

I have so many things to say about this but I also want to remind you that Wes is working on pandas 2.


Good to know!

I thought that Wes had said a while ago that he was taking a break from working on Pandas

If you need to scale out or speedup pandas, there's Modin https://modin.readthedocs.io/en/latest/ (which uses Ray from)

Other graphing libs for jupyter notebooks include:

  - Bokeh
  - Plotly
  - Seaborn
These libraries were built to improve upon matplotlib or each other, weren't they? Yet, people continue to reach for "the original" ¯\(ツ)/¯

Seaborn isn't usable without matplotlib, nor does it aim to be. It gives you simple high-level calls and as soon as you need to tweak your plot you're back to matplotlib. (Similar to the pandas plotting shown in the article actually.)

Bokeh is nice. And has made huge improvements over the past year or two. But it still doesn't directly compete with matplotlib because it's more focused on interactive plots.

Bokeh and Plotly obviously do much, much more than matplotlib, and if you want what they provide, you would go for those. Seaborn is more presentable-looking by default and easier to get things laid out in. I personally haven't used straight-up matplotlib for a long time.

Bokeh and Seaborn are very, very thin skins over matplotlib, and honestly don't improve the user experience at all. Only Altair has changed things for me.

Bokeh is not a skin over matplotlib.

The matlab plotting syntax got transferred to Python through matplotlib and got very deeply ingrained - was first, got popular, built into pandas and statsmodels, foundation of seaborn etc. Recently I saw a snippet of Julia code that uses "pyplot", I assume because people find it familiar and convenient. That API just refuses to die.

The pandas wrappers around matplotlib are convenient but for anything that needs customising, you'll need to reach for the full matplotlib API anyway.

Most of the things like ticks and lims are covered which is above basics. But if you are looking for annotation or animations then you need coding in matplotlib though

Does anybody here think that Pandas API design is ugly and inconsistent? It feels like hack after hack.

I really only use Pandas for DataFrame structures. Doesn't really bother me if the rest of it is bad.

absolutely. pandas is just something i put up with to be able to use everything else in python, i'd drop it in a heartbeat.

I strongly recommend Altair (https://altair-viz.github.io/) as an extremely Pandas-friendly alternative approach to data visualization. It's the first library that has successfully "hidden" the ugly, gnarled matplotlib layer underneath for me. It also looks killer.

Second that but there is no matplotlib underneath AFAIK. It does html based interactive graphs using a variant of the grammar of graphics, with extensions for interactions. Grammar of graphics means a great, proven (see ggplot) compromise of power and simplicity. <shameless plug> If you need a library of one-liners built on top of altair (that is if you need some standard stats graph) I wrote altair_recipes (https://github.com/piccolbo/altair_recipes/ or pip install altair_recipes) for that. </shameless plug>

I like Altair, but it still has some annoying missing features like the inability to caption and footnote charts. Or the inability to format filters like sliders.

It makes nice looking charts in html/d3, but is a hassle to save a real image because it requires chrome or Firefox. Which happens to not work in my CI environment.

So at least matplotlib can save png without needing a bunch of stuff.

Add lack of support of polar coordinates. But every lib started immature, and altair is quite new. I think they could use a few good PRs though.

I recommend reading the official docs. I think they improved the plotting interface recently, and I learned a lot from reading through the guide:


Here's a dataset link that doesn't require registering with Kaggle:


I had no idea what Pandas was (other than the plural of the cute fluffy creature). I was really hoping this was going to talk about how to use panda images to visualize data!

It shows how to generate the different charts but it doesn't show you how to save them as a image.

when you are using matplotlib tk backend then it opens the chart in a separate window which you can download/save as image

Pandas should listen to the unix philosophy a bit and remove its plotting API.

The name “Pandas” is really misleading, especially in the title of this article. It's missing the phrase, “No animals were harmed in the making of these graphs”.

I would have thought if you focused on bamboo leaves as the main visual motif then that would keep their attention.

Leave those peaceful furry creatures alone, please.

What do you mean exactly here?

I assume it's a pun about pandas.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact