API pain points off the top of my head:
- There are both `pivot` and `pivot_table` methods which behave slightly differently. I'd have to look up why `pivot` exists; I've learned simply to reach for `pivot_table` every time, since it aggregates more flexibly.
- The dreaded SettingWithCopyWarning is a huge pain for new pandas users, and simply should not exist in any mature data analysis software.
- Unlike in SQL, in pandas you can take a data table (a DataFrame) and group it by one or more columns without immediately specifying an aggregation function. The result of this `groupby` operation is a "groupby object", which contains all the information of the input DataFrame, but behaves completely differently from a DataFrame. By contrast, in R's tidyverse, my understanding is that when you group a tibble (the tidyverse's data table), the result is still a tibble, just grouped.
- It took a while to mentally sort out the similarities and differences of merge, join, append, and concatenate. In practice, I almost always reach for `merge` or `concat`.
One useful tip I can suggest is to create a repo for useful code snippets, so if you ever find yourself doing something new that you think you might need again, just spend a bit of time commenting and describing it and add it to the repo. That way instead of having to spend time searching you'll hopefully remember doing it before and be able to find it easily.
Things started to make sense after I read a very good book on Pandas. Reading a book is better than reading blog posts, because it is consistent. In contrast, reading small tutorials for every little thing is confusing, because every blog post is using a different way to do the same thing.
If you're only an occasional user, this will be your life forever. My experience with pandas is that if you use it heavily for 3 months, then things start to "stick" and you need to look it up less often.
Unfortunately I changed jobs and have forgotten most of pandas, so I'm back to looking things up again.
Like it has great fucntionality, but I waste so much time trying to figure out how to do something with pandas.
And I can remember the R-API's, which are even more annoying ;)
It does because it is.
Pandas was written by someone that was just starting out with Python at the time and was coming from programming in R.
And the pandas plots are ugly
Pandas feels like the wrong tool for this job. I don't use multi-indexes or any statistical methods. I don't chart anything.
But it's so darn convenient. If the time comes to optimize I can `import csv` directly and improve performance. But nothing beats it for prototyping.
Are there better options in this space?
import pandas as pd
Better in some generic sense of lighter, faster, better API.
I share your implied concern that pandas can be quite large and I personally disagree with a lot of the design decisions when it comes to the pandas API, but building an alternative tool would be a full time job. Unfortunately, there is no mechanism to support Python library developers and the expectation is for Python libraries to be free.
I'm curious how many people would be ok paying for a Python library.
I go out of my way to support open source projects. Closed source would be a much harder sell.
I thought that Wes had said a while ago that he was taking a break from working on Pandas
Bokeh is nice. And has made huge improvements over the past year or two. But it still doesn't directly compete with matplotlib because it's more focused on interactive plots.
It makes nice looking charts in html/d3, but is a hassle to save a real image because it requires chrome or Firefox. Which happens to not work in my CI environment.
So at least matplotlib can save png without needing a bunch of stuff.