* old attitude: why does pandas have to make things so hard
* new attitude: pandas has a crazy difficult job
And then, for the ubiquitous Object dtype, often figure out which of the many possible more specific types to cast it to.
If you think that is easy, ask yourself what this outputs:
import numpy as np
import pandas as pd
It also has a pathological fixation on when it tries to convert dtypes, since avoiding all the bad conversion outcomes is a relatively time intensive process (compared to e.g. creating a numpy array).
I realize things could be much easier in pandas user facing interface, but really appreciate the sheer amount of effort that has gone into its dtype wrangling.
Now, to be clear, that's a hard problem. Heterogenous named bags of homogenous columns with a variety of data types, storage patterns, and ideas about missingness isn't an easy domain... but instead of just trying to make everything work through hammering 6+ semi-coherent interfaces (indices, databases, mutability, immutability/chaining, numpy, dataframes) together, I'd be willing to pay a lot more in verbosity and explicitness for something simple.
pd.Series(str, [np.nan, "a"]) => ["nan", "a"] # or even an exception!
pd.Series(nullable(str), [np.nan, "a"]) => [nan, "a"]
I'd chalk this all up to just being "hard", but at the same time I can go pick up R's dplyr library and get a very nice existence proof of how a nice interface could work. Not to say dplyr has it all figured out, but it's a night-and-day improvement to Pandas.
Pandas is great. It makes doing data science in Python so vastly much less of a chore than working with straight Numpy. It steals some great ideas and tries out a few interesting ones of its own... but it is far from a joy to work with.
That's a pretty succinct way of expressing one of panda's biggest pain points.
The API surface is huge because it heroically tries to handle all possible use cases. I do think it would've been significantly easier to develop, maintain and _use_ if pandas was more opinionated and offered a smaller set of very focused functions.
As I've worked on a port of dplyr to python over the past year, though, I've realized the dtype issue (like you said), indexes, and GroupBy being difficult are likely connected. Basically,
* dplyr can chop up a dataframe into 50,000 groups and apply arbitrary functions to it--no problem.
* custom pandas grouped applies are very slow
1. creating an index for each subgroup is slow (will not be a RangeIndex)
2. initializing a series for each group is slow (mostly due to type inference being re-run; could be avoided)
3. AFAIK more type inference is run when concatenating results
I wrote a bit on how I tried to work around that, to enable fast dplyr-like syntax over grouped data in python. Would definitely be interested in your take! There are other libraries, like ibis that do a good job with it, too!
I think at its core, the groupby issue is a really big problem, and am devoting most of this year to working on it. So if you ever want to pair to work on pandas / pandas wrapping libraries send me an email (link in profile)!
plyr came before and I think there was something else
That’s how we got the amazing `dplyr`
I think pandas is well liked by those who move from C++ or Java, but is disliked by those who move from R
My god is it poorly documented. A lot of us newbies trying to do dataframe stuff for the first time get stuck on GroupBy, trying to figure out how it works. Dplyr's functions are miles ahead on useability and documentation.
Aaand.. well, I guess it kind of is. But at the same time it isn't. Our biggest issues haven't been the complexity of what we've been implementing, but rather the insanity of the pandas interface. We've tried to keep it in mind that we're all inexperienced wrt. pandas, and it has definitely gotten better as we have gained experience, but that doesn't give us the hundreds of man-hours back we've spent trying to please pandas.
All the insane overloading of everything is awful and stupid. Give me 3 different ways to do 3 different things, not 9 ways to use 1 feature to do 1 thing.
OP was wondering, why pandas facing a strange overhead after each 100th iteration in some very specific case. There was a proposal about Python's GC, but it was not clear at all.
Finally, I have dived into pandas and found that it has a hard-coded constant == 100 (!) of a number of internal data storage blocks. After reaching this value it runs some consolidation routines , and they consume a lot of memory even leading to crash with memory error.
What was much more wondering, is that after changing this constant to some large value (1000000, actually it disables consolidation at all) reduces memory consumption dramatically! This consolidation seems to reduce storage and memory consumption, so I still do not know why the opposite happens and why it works well in all other cases.
It wasn’t functionally very different from everything being a float, but I will appreciate not having to format floats as ints in all my reports.
The key idea is to allow abstraction of physical types (boolean, integers) by defining custom semantic types (URLs, paths, probabilities). The idea originated while working on pandas-profiling  and running into similar problems. We found this abstraction to be effective for many other downstream tasks, too, including compression and AutoML. More coming soon...
By comparison, the API of something like Pytorch is an absolute pleasure to use and even though I'm not using it all the time, I almost have no trouble every time I begin training models/trying out new things in Pytorch.
All that being said, this is definitely a step in the right direction and hopefully the API gets a bit more coherent over time.
Agreed. In particular one might have hoped that 1.0 would fix indexing. .ix (deprecated), .loc, .iloc and "[" is an example of what people mean by saying the API is (a) a mess and (b) "deeply unpythonic". Shouldn't "[" be removed entirely if .loc and .iloc are recommended, given the odd and unpredictable edge cases with "["?
> unless you use it regularly for a long period of time, it is almost impossible to get fluent with it.
I know a huge amount of valiant, voluntary, open source work has gone into it, but it is a shame that the primary data-frame library in the Python ecosystem lacks a clean, pythonic API. Having been negative I don't want to obscure the fact that it does have some great and powerful code behind its API.
Like Example x3 vs x5 in their docs on the sum function:
Pandas' API might be a mess but that's partly because they're been really good about experimenting with the best way to do things for the last 10 (?) years. Adding newer alternatives to fiddly APIs etc. but never removing them. Now they can start the removals.
Corporate control means tighter development schedules and consistent API's. It also means that if you don't like the path FAIR has chosen, too bad. As a result, there's multiple competing options in the deep learning space: Tensorflow (Google), MXNet (Amazon), CNTK (Microsoft), Paddle (Baidu), etc.
On the other hand, Pandas is something for everyone. The lack of opinioniation means that it can be easily adopted anywhere. Can you imagine what data science/analysis would feel like with multiple low-level Pandas competitors, from different corporations? Each one would feel consistent, but none would work together (and imagine building an ML platform which supported multiple dataframe sources).
I do sometimes miss working in R - yes, R takes flexibility to a fault, but there's a consistent set of primitives that mostly get reused. Perhaps R gives off that impression because of the work done by Hadley and others to build tooling according to the tidyverse principles. I wonder if Julia will combine the best of these worlds in the future.
> I do sometimes miss working in R - yes, R takes flexibility to a fault, but there's a consistent set of primitives that mostly get reused. Perhaps R gives off that impression because of the work done by Hadley and others to build tooling according to the tidyverse principles. I wonder if Julia will combine the best of these worlds in the future.
Funny that you mention R, which has exactly what you criticized before (base R data.frame, tidyverse/tibble, data.table), not to mention at least 6 different packages/datastructures to represent time series.
Debugger support is almost non-existent. Using Atom/Juno IDE is a D-grade experience. Julia offers little help to debug problems - errors are almost always without failure - completely tangent to what the real issue is.
Julia takes forever to start, syntax was wonderful pre 0.4 days, now the syntax just looks absolutely jarring to my eyes. Julia's speedups can be offset by using many many far better technologies - Numba, Numpy, Cython, PyPy etc.
Julia's ecosystem of libraries is a deserted land - that's expected for a new language. I hope it improves, but the core Julia experience needs to improve first.
On the other hand, I've been following Rust development and the developers made absolutely sure from the get-go to focus on debugging/errors that show what the actual problem is, provide a stack trace and figure out where the problem started. Julia absolutely sucks at this.
Really? I'd recommend assembler at gunpoint; even at stick-point for that matter. You either spin a mean hyperbole or you're one serious programmer.
I've had the same two complaints as you when I tried Julia ~2 years ago. I was told that the startup situation improved to the point of it not being a problem anymore, but even then people were just reusing Julia's processes. Don't know about the state of errors, but I'd expect noticable improvements there as well, as with most new languages.
I'm anxious to one day come back to Julia, due to its focus on numerical computation, but at the moment it's somewhat counter-balanced by Python ecosystem's maturity.
Although I spent a long time optimising numerical code with Numba the speedup I got (whilst significant) wasn’t really comparable to the speed of a Julia implementation.
The core developers are of Julia are very smart folks, they want to develop a great language that's fast and easy to use. They missed the opportunity to restrict syntax, provide useful exception message (just look at Rust! it is a thing of art when you get an exception, it is beautiful), and generally provide good documentation.
For example, just creating a Julia local registry requires significant overhead and time investment. Spinning up a registry should not take more than 30 mins.
All these aspects of Julia are prohibitive and in my opinion, Julia should not be used in any company or production use until perhaps version 2. People who have dealt with Julia issues will tell you the truth - not the academicians or researchers. The people that maintain infrastructure/maintenance support for Julia apps are almost ready to quit their jobs. It sucks so bad.
Any specific examples?
Pandas .map() and .apply() get real slow on big datasets. I found it quicker to solve a problem with a million line dataset by just using base python iterables instead, so nothing needed to fit into my RAM and I didn't have to work with slow pandas mapping.
For the existing project i’m thinking the switch to Julia + DataFrames library (despite it being a completely different language) is more of a 1:1 port. In contrast would need to use brain power to build arrays or dicts to mimic Pandas (and probably get it wrong and introduce bugs)
This so much. I've answered quite a bit of pandas on SO, and I have to say the APIs are a mess. There are always multiple ways to do things, there are hidden traps that can lead to huge run times, and stuff that are just wildly un-pythonic.
It's still the best general data processing has to offer. But a smaller, cleaner package might just take the cake.
On the other hand, I’ve gotten tremendous value from it, and I can’t aggressively criticize an open source project I can use without paying.
>>TypeError: Passing a bool to header is invalid.
Fuck, every time!!!!
pd.read_csv(filepath, sep="\t", header=False)
And I've been using this library almost every day for years
I think it's worth it to acknowledge the great stewardship of the community by all the Pandas developers (and the rest of people in the PyData ecosystem). It has been an inspiration for me as I create and contribute to open source libraries for data science .
How are we supposed to interpret this in light of the promise that there will be no more API breakages until 2.0? It reads as if this promise does not apply to string data, which impacts rather a lot of use cases.
There are quite a few complaints here about the interface being confusing and difficult to use, and I feel like some of this is due to there being significant differences between versions. I would love to read a medium-length online free tutorial on Pandas 1.0, but it seems like most of what turns on up google are short idiosyncratic tutorials on specific tasks in various versions.
Whenever one can use a utc epoch column for time indexed data in a raw numpy array instead, one should.
It has a cleaner, leaner API + the ability to use memory-mapped files.
It's great that the whole thing with extension arrays, custom types etc. has lead to this, but when the devs have, after 10+ years, the biggest chance for a backward incompatible change, this is the one to make. By making it optional, they are fixing it for the very few that know of its existence.
I love pandas and a sizeable part of my career depended on it - and while I don't use it anymore (partly because of the NULLs), I wish it the best and I hope there will be a future release with this breaking change.
Long live the King.
If anyone else is wondering what this is. (Source: project homepage