Hacker News new | past | comments | ask | show | jobs | submit login
Pandas 1.0 (pydata.org)
510 points by kylebarron 19 days ago | hide | past | web | favorite | 76 comments

I've had to dive into the pandas code over the last year for a project [0], and my attitude has shifted dramatically from...

  * old attitude: why does pandas have to make things so hard
  * new attitude: pandas has a crazy difficult job
I think this is most apparent in the functions that decide what "[d]type" a Block--the most basic thing that stores data in pandas--should be.


And then, for the ubiquitous Object dtype, often figure out which of the many possible more specific types to cast it to.

If you think that is easy, ask yourself what this outputs:

  import numpy as np
  np.array([np.nan, 'a'])

Lo and behold--it produces an array where the np.nan has been converted to the string "nan".

And yet

  import pandas as pd
  pd.Series([np.nan, "a"])

Knows this, has your back, and does not stringify it.

It also has a pathological fixation on when it tries to convert dtypes, since avoiding all the bad conversion outcomes is a relatively time intensive process (compared to e.g. creating a numpy array).

I realize things could be much easier in pandas user facing interface, but really appreciate the sheer amount of effort that has gone into its dtype wrangling.

[0]: http://github.com/machow/siuba

I really, really dislike all the dtype wrangling and how those choices resonate throughout the API. I understand that a lot of work has been done to make that API "work", but in practice it feels like that effort would have been better avoided by changing expectations and interfaces.

Now, to be clear, that's a hard problem. Heterogenous named bags of homogenous columns with a variety of data types, storage patterns, and ideas about missingness isn't an easy domain... but instead of just trying to make everything work through hammering 6+ semi-coherent interfaces (indices, databases, mutability, immutability/chaining, numpy, dataframes) together, I'd be willing to pay a lot more in verbosity and explicitness for something simple.

    pd.Series(str, [np.nan, "a"]) => ["nan", "a"] # or even an exception!
    pd.Series(nullable(str), [np.nan, "a"]) => [nan, "a"]
Indexing is vastly over-designed. GroupBy is a very common API and is poorly documented and just weird in no small part due to attempts at dtype inference. Foundational useful concepts like categories feel bolted on. There's join, merge, pivot, pivot_table.

I'd chalk this all up to just being "hard", but at the same time I can go pick up R's dplyr library and get a very nice existence proof of how a nice interface could work. Not to say dplyr has it all figured out, but it's a night-and-day improvement to Pandas.

Pandas is great. It makes doing data science in Python so vastly much less of a chore than working with straight Numpy. It steals some great ideas and tries out a few interesting ones of its own... but it is far from a joy to work with.

> hammering 6+ semi-coherent interfaces (indices, databases, mutability, immutability/chaining, numpy, dataframes) together,

That's a pretty succinct way of expressing one of panda's biggest pain points.

The API surface is huge because it heroically tries to handle all possible use cases. I do think it would've been significantly easier to develop, maintain and _use_ if pandas was more opinionated and offered a smaller set of very focused functions.

Agreed RE GroupBy being challenging, especially compared to dplyr.

As I've worked on a port of dplyr to python over the past year, though, I've realized the dtype issue (like you said), indexes, and GroupBy being difficult are likely connected. Basically,

  * dplyr can chop up a dataframe into 50,000 groups and apply arbitrary functions to it--no problem.
  * custom pandas grouped applies are very slow
There are basically three reasons for slow pandas apply methods...

  1. creating an index for each subgroup is slow (will not be a RangeIndex)
  2. initializing a series for each group is slow (mostly due to type inference being re-run; could be avoided)
  3. AFAIK more type inference is run when concatenating results
This leads to a world where grouped calculations can't be run using arbitrary expressions (e.g. lambdas), but have to go through specific SeriesGroupBy methods.

I wrote a bit on how I tried to work around that, to enable fast dplyr-like syntax over grouped data in python. Would definitely be interested in your take! There are other libraries, like ibis that do a good job with it, too!


I have been through this path of trying to figure out why pandas Groupby is slow. It is a big bottleneck when building packages on top of pandas.

Well, I'm a big fan of plotnine, and plydata was part of the inspiration for siuba!

I think at its core, the groupby issue is a really big problem, and am devoting most of this year to working on it. So if you ever want to pair to work on pandas / pandas wrapping libraries send me an email (link in profile)!

Hadley had the discipline to let go and start from scratch 2 if not 3 times before getting it perfect with `dplyr`.

plyr came before and I think there was something else

That’s how we got the amazing `dplyr`

I think pandas is well liked by those who move from C++ or Java, but is disliked by those who move from R

I agree - R has performant and robust dataframe functions. dplyr is great for small-medium sized datasets, data.table seems to be really performant for larger sets.

And now there is dtplyr, which simply creates a data.table back end with dplyr syntax on the front end.

I think this is key: there was a massive learning effort that went into the current dplyr interface. This risked fragmenting the community, but Hadley and his collaborators navigated those waters effectively. That's super tough work and R benefits significantly from it. If I had a magic wand for Pandas, I feel giving that team the opportunity to rework interfaces without killing momentum and community would go so far.

>GroupBy is a very common API and is poorly documented

My god is it poorly documented. A lot of us newbies trying to do dataframe stuff for the first time get stuck on GroupBy, trying to figure out how it works. Dplyr's functions are miles ahead on useability and documentation.

I've just spent the past 3 months reengineering what is essentially a spreadsheet calculator to use pandas. Pandas was chosen before I was there but it seemed like it was perfect for the job.

Aaand.. well, I guess it kind of is. But at the same time it isn't. Our biggest issues haven't been the complexity of what we've been implementing, but rather the insanity of the pandas interface. We've tried to keep it in mind that we're all inexperienced wrt. pandas, and it has definitely gotten better as we have gained experience, but that doesn't give us the hundreds of man-hours back we've spent trying to please pandas.

All the insane overloading of everything is awful and stupid. Give me 3 different ways to do 3 different things, not 9 ways to use 1 feature to do 1 thing.

I think another area pandas has done a lot of work on is with datetimes. Numpy's datetime objects are pretty deficient when you need to perform computations / data wrangling with them and utilizing python's native datetime objects would slow things down a decent bit. So they have done a lot of work to create their own datetime implementation that helps a lot when dealing with tabular data and performing date/time based arithmetic/manipulations.

Datetimes are still a mess. Now we have 3 datetime objects and converting between them is not obvious or trivial:


Yeah sure, but that isn't their fault really. I can see a role for python's datetime being separate from numpy/pandas, but I do think a consolidated datetime object would be better to have rather than the numpy and pandas versions that are similar but not the same.

Definitely agree, pandas/numpy can have internal representations of efficient datetimes but the user should not have to deal with these conversions.

I use Julia and this is one area where Pandas still kicks the Julia ecosystem's proverbial ass: awesome support for working with nanosecond precision epoch timestamps.

I dare to promote one StackOverflow question [1] about pandas I have tried to investigate and answer [2] half a year ago. And I was rather horrified by its internal complexity after digging into pandas source :)

OP was wondering, why pandas facing a strange overhead after each 100th iteration in some very specific case. There was a proposal about Python's GC, but it was not clear at all.

Finally, I have dived into pandas and found that it has a hard-coded constant == 100 (!) of a number of internal data storage blocks. After reaching this value it runs some consolidation routines [3], and they consume a lot of memory even leading to crash with memory error.

What was much more wondering, is that after changing this constant to some large value (1000000, actually it disables consolidation at all) reduces memory consumption dramatically! This consolidation seems to reduce storage and memory consumption, so I still do not know why the opposite happens and why it works well in all other cases.

[1] https://stackoverflow.com/questions/56690909/python-is-facin... [2] https://stackoverflow.com/a/56705419/978424 [3] https://github.com/pandas-dev/pandas/blob/761bceb77d44aa63b7...

I like that pandas 1.0 took “Int64” out of trial mode.

It wasn’t functionally very different from everything being a float, but I will appreciate not having to format floats as ints in all my reports.

Ah, same! I'm also relieved to hopefully never again have to say: "whoops I converted nan into a very large integer".

Hi, I'm working on this if anyone is interested (work in progress):


The key idea is to allow abstraction of physical types (boolean, integers) by defining custom semantic types (URLs, paths, probabilities). The idea originated while working on pandas-profiling [0] and running into similar problems. We found this abstraction to be effective for many other downstream tasks, too, including compression and AutoML. More coming soon...

[0]: https://github.com/pandas-profiling/pandas-profiling

There are always choices to make. In this case, I would much prefer to let the data be treated as is, i.e., no silent casting of np.nan to the string "nan". When dealing with numerical data, a string "nan" is rarely useful. But when you need it, you can still create a data series with a string "nan" using

    pd.Series([str(np.nan), "a"])

So you want the behavior that Pandas has (i.e., no silent casting of np.nan to the string "nan").

The point I wanted to make was that string "nan" is usually not as useful as floating point NaN when dealing with numerical data. Therefore, as default behavior it is acceptable, if one has to make a choice.

Great accomplishment and kudos to the dedicated maintainers. That being said, I've always had a love-hate relationship with pandas. It is a very powerful library and does a ton, but yet the API is all over the place and unless you use it regularly for a long period of time, it is almost impossible to get fluent with it. Every time I am away from it for a couple of months, I find even doing the most basic things to be complicated/confusing and find myself on stackoverflow way too often.

By comparison, the API of something like Pytorch is an absolute pleasure to use and even though I'm not using it all the time, I almost have no trouble every time I begin training models/trying out new things in Pytorch.

All that being said, this is definitely a step in the right direction and hopefully the API gets a bit more coherent over time.

> yet the API is all over the place

Agreed. In particular one might have hoped that 1.0 would fix indexing. .ix (deprecated), .loc, .iloc and "[" is an example of what people mean by saying the API is (a) a mess and (b) "deeply unpythonic". Shouldn't "[" be removed entirely if .loc and .iloc are recommended, given the odd and unpredictable edge cases with "["?

> unless you use it regularly for a long period of time, it is almost impossible to get fluent with it.


I know a huge amount of valiant, voluntary, open source work has gone into it, but it is a shame that the primary data-frame library in the Python ecosystem lacks a clean, pythonic API. Having been negative I don't want to obscure the fact that it does have some great and powerful code behind its API.

Definitely agree the API can be tightened up, but just for fun and perspective take a look at pre pandas data wrangling tools like sas.

Like Example x3 vs x5 in their docs on the sum function:



That's almost exactly what they've started to do in 1.0. You say .ix is deprecated but in fact it was removed in 1.0. Now that 1.0 is out, they have a deprecation policy which will allow them to remove things like this.

Pandas' API might be a mess but that's partly because they're been really good about experimenting with the best way to do things for the last 10 (?) years. Adding newer alternatives to fiddly APIs etc. but never removing them. Now they can start the removals.

PyTorch is a project primarily funded/developed by Facebook/FAIR. Pandas is fully an open-source project, without corporate control.

Corporate control means tighter development schedules and consistent API's. It also means that if you don't like the path FAIR has chosen, too bad. As a result, there's multiple competing options in the deep learning space: Tensorflow (Google), MXNet (Amazon), CNTK (Microsoft), Paddle (Baidu), etc.

On the other hand, Pandas is something for everyone. The lack of opinioniation means that it can be easily adopted anywhere. Can you imagine what data science/analysis would feel like with multiple low-level Pandas competitors, from different corporations? Each one would feel consistent, but none would work together (and imagine building an ML platform which supported multiple dataframe sources).

I do sometimes miss working in R - yes, R takes flexibility to a fault, but there's a consistent set of primitives that mostly get reused. Perhaps R gives off that impression because of the work done by Hadley and others to build tooling according to the tidyverse principles. I wonder if Julia will combine the best of these worlds in the future.

> Can you imagine what data science/analysis would feel like with multiple low-level Pandas competitors, from different corporations? Each one would feel consistent, but none would work together (and imagine building an ML platform which supported multiple dataframe sources).

> I do sometimes miss working in R - yes, R takes flexibility to a fault, but there's a consistent set of primitives that mostly get reused. Perhaps R gives off that impression because of the work done by Hadley and others to build tooling according to the tidyverse principles. I wonder if Julia will combine the best of these worlds in the future.

Funny that you mention R, which has exactly what you criticized before (base R data.frame, tidyverse/tibble, data.table), not to mention at least 6 different packages/datastructures to represent time series.

I feel data.frame and tibble are mostly compatible (you can use tidyverse tools on dataframes), and nearly all R users use one or the other, while data.table is used by a few finance folks who grumble about how slow tidyverse is.

Tidyverse tools work on data.frame objects but they do have the unfortunate and undocumented habit of turning them into tibbles.

I started using Julia recently. It seems like Julia has been able to take the good parts of Python and iron out the quirks. For example, I'm guessing the Julia DataFrame library is a knock off of Pandas, but the syntax more intuitive and concise - and I can remember it.

For Julia itself, the syntax is very similar to Python but doesn't have the weird lambda functions. It has the Javascript style arrow for short anonymous functions and the Ruby style "do" for longer functions. And finally, Julia is fast. I have a python/pandas script that take 3 days to run. Moving it over to Julia now.

Julia doesn't feel production ready at all. Its fine to mess around in notebooks but I would never recommend it for production use. Not even at a gunpoint.

Debugger support is almost non-existent. Using Atom/Juno IDE is a D-grade experience. Julia offers little help to debug problems - errors are almost always without failure - completely tangent to what the real issue is.

Julia takes forever to start, syntax was wonderful pre 0.4 days, now the syntax just looks absolutely jarring to my eyes. Julia's speedups can be offset by using many many far better technologies - Numba, Numpy, Cython, PyPy etc.

Julia's ecosystem of libraries is a deserted land - that's expected for a new language. I hope it improves, but the core Julia experience needs to improve first.

On the other hand, I've been following Rust development and the developers made absolutely sure from the get-go to focus on debugging/errors that show what the actual problem is, provide a stack trace and figure out where the problem started. Julia absolutely sucks at this.

> Not even at a gunpoint

Really? I'd recommend assembler at gunpoint; even at stick-point for that matter. You either spin a mean hyperbole or you're one serious programmer.

I've had the same two complaints as you when I tried Julia ~2 years ago. I was told that the startup situation improved to the point of it not being a problem anymore, but even then people were just reusing Julia's processes. Don't know about the state of errors, but I'd expect noticable improvements there as well, as with most new languages.

I'm anxious to one day come back to Julia, due to its focus on numerical computation, but at the moment it's somewhat counter-balanced by Python ecosystem's maturity.

Just wanted to jump in here and say that for numerical computing Julia’s ecosystem of libraries are absolutely fantastic! DifferentialEquations.jl alone was worth switching to Julia!

Although I spent a long time optimising numerical code with Numba the speedup I got (whilst significant) wasn’t really comparable to the speed of a Julia implementation.

I love some of their libraries as you mentioned. Just that when the language itself is painful to write code and debug it, the whole value proposition is diminished.

The core developers are of Julia are very smart folks, they want to develop a great language that's fast and easy to use. They missed the opportunity to restrict syntax, provide useful exception message (just look at Rust! it is a thing of art when you get an exception, it is beautiful), and generally provide good documentation.

For example, just creating a Julia local registry requires significant overhead and time investment. Spinning up a registry should not take more than 30 mins.

All these aspects of Julia are prohibitive and in my opinion, Julia should not be used in any company or production use until perhaps version 2. People who have dealt with Julia issues will tell you the truth - not the academicians or researchers. The people that maintain infrastructure/maintenance support for Julia apps are almost ready to quit their jobs. It sucks so bad.

>The people that maintain infrastructure/maintenance support for Julia apps are almost ready to quit their jobs. It sucks so bad.

Any specific examples?

To which version of Julia are you referring? All of those things are much improved now.

I took a program that used to take multiple days to run in python/numpy/scipy/pandas moved it to julia... and got it down to just a few hours. Something like a 20x speed improvement. It seems the biggest speedups are gained if you happened to have implemented custom functions in python since those cannot take as much advantage of C numpy/scipy/pandas cores. That said, I still find numpy/scipy/pandas more natural. But that's probably simply because I'm more familiar with them and python.

> I have a python/pandas script that take 3 days to run.

Pandas .map() and .apply() get real slow on big datasets. I found it quicker to solve a problem with a million line dataset by just using base python iterables instead, so nothing needed to fit into my RAM and I didn't have to work with slow pandas mapping.

Yes, i was thinking about removing the Pandas code and using python iterables. The issue is I did a lot exploration with pandas, which it was good at. If I started with Julia, i wouldn’t need to refactor seeking performance by removing pandas, or for numba, dask, etc.

For the existing project i’m thinking the switch to Julia + DataFrames library (despite it being a completely different language) is more of a 1:1 port. In contrast would need to use brain power to build arrays or dicts to mimic Pandas (and probably get it wrong and introduce bugs)

I hear you can use python from Julia, have you tried this as, possibly temporary/transition, approach?

Yes, thanks. That ability definitely does give me a bit of comfort in case Julia is missing something. I plan not to use it though. When you use it, it’s actually running python, so same speed and such.

> yet the API is all over the place

This so much. I've answered quite a bit of pandas on SO, and I have to say the APIs are a mess. There are always multiple ways to do things, there are hidden traps that can lead to huge run times, and stuff that are just wildly un-pythonic.

It's still the best general data processing has to offer. But a smaller, cleaner package might just take the cake.

I can see a future for pytonic interface to pandas. It would run pandas in the backend but would be just as easy as manipulating list, dicts, sets and tuples...we'd have a 5th type - tables.

The danger with Pandas’ bizarre API is that it isn’t obvious when you are doing something stupid, because the right way to say things also looks weird and unpythonic.

On the other hand, I’ve gotten tremendous value from it, and I can’t aggressively criticize an open source project I can use without paying.

Couldn't agree more. Fantastically powerful library, but my goodness the syntax is often unfriendly, inconsistent, etc.

Would it be useful to have an API wrapper library around pandas that forces you to use the "right way" of doing things?

pd.read_csv(filepath, sep="\t", header=None)

>>TypeError: Passing a bool to header is invalid.

Fuck, every time!!!!

pd.read_csv(filepath, sep="\t", header=False)

And I've been using this library almost every day for years

I know I'm not the only one, but it's hard to imagine doing my job the last several year without Pandas. Even though Pandas has been used in production by many people as basically a 1.0.0 release for a long time, this an amazing milestone and I think everyone in my office smiled when they saw the release news.

I think it's worth it to acknowledge the great stewardship of the community by all the Pandas developers (and the rest of people in the PyData ecosystem). It has been an inspiration for me as I create and contribute to open source libraries for data science [0][1].

[0] https://github.com/FeatureLabs/featuretools/ [1] https://github.com/FeatureLabs/compose/

I basically owe my career to pandas - it made me the go-to resource for any data analysis that couldn't be done in Excel. Once I became useful in that respect, I was strongly encouraged to further develop my programming knowledge.

I am looking forward to a decade of fewer API breaking changes. However, 1.0 introduces a new column type for strings, recommends its use over the old "object" column type, yet says it is "Experimental and may change at any time."

How are we supposed to interpret this in light of the promise that there will be no more API breakages until 2.0? It reads as if this promise does not apply to string data, which impacts rather a lot of use cases.

It states in the deprecation policy that "API-breaking changes will be made only in major releases (except for experimental features)".

That seems like a rather large carve-out. Fortunately a lot of use cases won't really need to worry about something like that (e.g. exploratory analysis), in the context of it breaking your company.

So Pandas 1.0's recommended API for strings is unstable. Seems strange.

Could we collect some recommendations for really good books, online guides, tutorials, and recipes for current Pandas?

There are quite a few complaints here about the interface being confusing and difficult to use, and I feel like some of this is due to there being significant differences between versions. I would love to read a medium-length online free tutorial on Pandas 1.0, but it seems like most of what turns on up google are short idiosyncratic tutorials on specific tasks in various versions.

Pandas is my least favorite necessary evil. It's always changing, far too expansive API costs me about an hour a week.

Whenever one can use a utc epoch column for time indexed data in a raw numpy array instead, one should.

"We’ve added to_markdown() for creating a markdown table"

That's awesome!

No mention of vaex?


It has a cleaner, leaner API + the ability to use memory-mapped files.

I've been waiting for this release for years and I hoped for one thing and one thing only - for pandas to have a proper way of dealing with NULLs. And it does have it... OPTIONALLY.

It's great that the whole thing with extension arrays, custom types etc. has lead to this, but when the devs have, after 10+ years, the biggest chance for a backward incompatible change, this is the one to make. By making it optional, they are fixing it for the very few that know of its existence.

I love pandas and a sizeable part of my career depended on it - and while I don't use it anymore (partly because of the NULLs), I wish it the best and I hope there will be a future release with this breaking change.

Assuming you want to use python. What’s the alternative to pandas? Kind of a brew up your own code kind of thing? Csv module I guess?

Dask is one option.

Wildly off topic, for which I apologize, but whenever I see “Dask”, I think of this: https://en.wikipedia.org/wiki/DASK . You’re not going to manage any large datasets on that! :-)

Well, depends on what you want to achieve.

one of the best python | data libraries out there. For analysis alone, no tool comes close.

This is a huge accomplishment. The maintainers work very hard at keeping data science running for a substantial chunk of the field. Congratulations!

Congratulations to the Pandas team. You lot have saved my bacon so many times over the last few years, I owe you many breakfasts.

Long live the King.

"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

If anyone else is wondering what this is. (Source: project homepage

The only true part is that it’s open source and BSD licensed. It’s one of those unfortunate libraries (like matplotlib) that are both very useful and not very fun to use.

I have 30K line of codes in production using pandas 0.23.4. Should I consider updating.

Do you plan to stay at 0.23.4 forever? Then no. Else, yes.

Any word for when the RC will be "properly" released?

convert_dtypes is pretty nice. I wonder how soon the new dtypes will be the defaults.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact