
Pandas 1.0 - kylebarron
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html
======
closed
I've had to dive into the pandas code over the last year for a project [0],
and my attitude has shifted dramatically from...

    
    
      * old attitude: why does pandas have to make things so hard
      * new attitude: pandas has a crazy difficult job
    

I think this is most apparent in the functions that decide what "[d]type" a
Block--the most basic thing that stores data in pandas--should be.

[https://github.com/pandas-
dev/pandas/blob/4edcc5541ff3f6470f...](https://github.com/pandas-
dev/pandas/blob/4edcc5541ff3f6470f5e3c083cb83136119e6f0c/pandas/core/internals/blocks.py#L2973)

And then, for the ubiquitous Object dtype, often figure out which of the many
possible more specific types to cast it to.

If you think that is easy, ask yourself what this outputs:

    
    
      import numpy as np
      np.array([np.nan, 'a'])
    
    

Lo and behold--it produces an array where the np.nan has been converted to the
string "nan".

And yet

    
    
      import pandas as pd
      pd.Series([np.nan, "a"])
    
    

Knows this, has your back, and does not stringify it.

It also has a pathological fixation on _when_ it tries to convert dtypes,
since avoiding all the bad conversion outcomes is a relatively time intensive
process (compared to e.g. creating a numpy array).

I realize things could be much easier in pandas user facing interface, but
really appreciate the sheer amount of effort that has gone into its dtype
wrangling.

[0]: [http://github.com/machow/siuba](http://github.com/machow/siuba)

~~~
tel
I really, really dislike all the dtype wrangling and how those choices
resonate throughout the API. I understand that a lot of work has been done to
make that API "work", but in practice it feels like that effort would have
been better avoided by changing expectations and interfaces.

Now, to be clear, that's a hard problem. Heterogenous named bags of homogenous
columns with a variety of data types, storage patterns, and ideas about
missingness isn't an easy domain... but instead of just trying to make
everything work through hammering 6+ semi-coherent interfaces (indices,
databases, mutability, immutability/chaining, numpy, dataframes) together, I'd
be willing to pay a lot more in verbosity and explicitness for something
simple.

    
    
        pd.Series(str, [np.nan, "a"]) => ["nan", "a"] # or even an exception!
        pd.Series(nullable(str), [np.nan, "a"]) => [nan, "a"]
    

Indexing is vastly over-designed. GroupBy is a very common API and is poorly
documented and just _weird_ in no small part due to attempts at dtype
inference. Foundational useful concepts like categories feel bolted on.
There's join, merge, pivot, pivot_table.

I'd chalk this all up to just being "hard", but at the same time I can go pick
up R's dplyr library and get a very nice existence proof of how a nice
interface could work. Not to say dplyr has it all figured out, but it's a
night-and-day improvement to Pandas.

Pandas is great. It makes doing data science in Python so vastly much less of
a chore than working with straight Numpy. It steals some great ideas and tries
out a few interesting ones of its own... but it is far from a joy to work
with.

~~~
antipaul
Hadley had the discipline to let go and start from scratch 2 if not 3 times
before getting it perfect with `dplyr`.

plyr came before and I think there was something else

That’s how we got the amazing `dplyr`

I think pandas is well liked by those who move from C++ or Java, but is
disliked by those who move from R

~~~
dm319
I agree - R has performant and robust dataframe functions. dplyr is great for
small-medium sized datasets, data.table seems to be really performant for
larger sets.

~~~
bart_spoon
And now there is dtplyr, which simply creates a data.table back end with dplyr
syntax on the front end.

------
_coveredInBees
Great accomplishment and kudos to the dedicated maintainers. That being said,
I've always had a love-hate relationship with pandas. It is a very powerful
library and does a ton, but yet the API is all over the place and unless you
use it regularly for a long period of time, it is almost impossible to get
fluent with it. Every time I am away from it for a couple of months, I find
even doing the most basic things to be complicated/confusing and find myself
on stackoverflow way too often.

By comparison, the API of something like Pytorch is an absolute pleasure to
use and even though I'm not using it all the time, I almost have no trouble
every time I begin training models/trying out new things in Pytorch.

All that being said, this is definitely a step in the right direction and
hopefully the API gets a bit more coherent over time.

~~~
appleiigs
I started using Julia recently. It seems like Julia has been able to take the
good parts of Python and iron out the quirks. For example, I'm guessing the
Julia DataFrame library is a knock off of Pandas, but the syntax more
intuitive and concise - and I can remember it.

For Julia itself, the syntax is very similar to Python but doesn't have the
weird lambda functions. It has the Javascript style arrow for short anonymous
functions and the Ruby style "do" for longer functions. And finally, Julia is
fast. I have a python/pandas script that take 3 days to run. Moving it over to
Julia now.

~~~
spectramax
Julia doesn't feel production ready at all. Its fine to mess around in
notebooks but I would never recommend it for production use. Not even at a
gunpoint.

Debugger support is almost non-existent. Using Atom/Juno IDE is a D-grade
experience. Julia offers little help to debug problems - errors are almost
always without failure - completely tangent to what the real issue is.

Julia takes forever to start, syntax was wonderful pre 0.4 days, now the
syntax just looks absolutely jarring to my eyes. Julia's speedups can be
offset by using many many far better technologies - Numba, Numpy, Cython, PyPy
etc.

Julia's ecosystem of libraries is a deserted land - that's expected for a new
language. I hope it improves, but the core Julia experience needs to improve
first.

On the other hand, I've been following Rust development and the developers
made absolutely sure from the get-go to focus on debugging/errors that show
what the actual problem is, provide a stack trace and figure out where the
problem started. Julia absolutely sucks at this.

~~~
endoftime5
Just wanted to jump in here and say that for numerical computing Julia’s
ecosystem of libraries are absolutely fantastic! DifferentialEquations.jl
alone was worth switching to Julia!

Although I spent a long time optimising numerical code with Numba the speedup
I got (whilst significant) wasn’t really comparable to the speed of a Julia
implementation.

~~~
spectramax
I love some of their libraries as you mentioned. Just that when the language
itself is painful to write code and debug it, the whole value proposition is
diminished.

The core developers are of Julia are very smart folks, they want to develop a
great language that's fast and easy to use. They missed the opportunity to
restrict syntax, provide useful exception message (just look at Rust! it is a
thing of art when you get an exception, it is beautiful), and generally
provide good documentation.

For example, just creating a Julia local registry requires significant
overhead and time investment. Spinning up a registry should not take more than
30 mins.

All these aspects of Julia are prohibitive and in my opinion, Julia should not
be used in any company or production use until perhaps version 2. People who
have dealt with Julia issues will tell you the truth - not the academicians or
researchers. The people that maintain infrastructure/maintenance support for
Julia apps are almost ready to quit their jobs. It sucks so bad.

~~~
byt143
>The people that maintain infrastructure/maintenance support for Julia apps
are almost ready to quit their jobs. It sucks so bad.

Any specific examples?

------
kmax12
I know I'm not the only one, but it's hard to imagine doing my job the last
several year without Pandas. Even though Pandas has been used in production by
many people as basically a 1.0.0 release for a long time, this an amazing
milestone and I think everyone in my office smiled when they saw the release
news.

I think it's worth it to acknowledge the great stewardship of the community by
all the Pandas developers (and the rest of people in the PyData ecosystem). It
has been an inspiration for me as I create and contribute to open source
libraries for data science [0][1].

[0]
[https://github.com/FeatureLabs/featuretools/](https://github.com/FeatureLabs/featuretools/)
[1]
[https://github.com/FeatureLabs/compose/](https://github.com/FeatureLabs/compose/)

~~~
alexpetralia
I basically owe my career to pandas - it made me the go-to resource for any
data analysis that couldn't be done in Excel. Once I became useful in that
respect, I was strongly encouraged to further develop my programming
knowledge.

------
jzwinck
I am looking forward to a decade of fewer API breaking changes. However, 1.0
introduces a new column type for strings, recommends its use over the old
"object" column type, yet says it is "Experimental and may change at any
time."

How are we supposed to interpret this in light of the promise that there will
be no more API breakages until 2.0? It reads as if this promise does not apply
to string data, which impacts rather a lot of use cases.

~~~
jellyksong
It states in the deprecation policy that "API-breaking changes will be made
only in major releases (except for experimental features)".

~~~
s_Hogg
That seems like a rather large carve-out. Fortunately a lot of use cases won't
really need to worry about something like that (e.g. exploratory analysis), in
the context of it breaking your company.

------
ppod
Could we collect some recommendations for really good books, online guides,
tutorials, and recipes for current Pandas?

There are quite a few complaints here about the interface being confusing and
difficult to use, and I feel like some of this is due to there being
significant differences between versions. I would love to read a medium-length
online free tutorial on Pandas 1.0, but it seems like most of what turns on up
google are short idiosyncratic tutorials on specific tasks in various
versions.

~~~
squaresmile
I find pandas tutorials great

[https://pandas.pydata.org/pandas-
docs/stable/getting_started...](https://pandas.pydata.org/pandas-
docs/stable/getting_started/index.html)

[https://pandas.pydata.org/pandas-
docs/stable/getting_started...](https://pandas.pydata.org/pandas-
docs/stable/getting_started/10min.html)

------
alpineidyll3
Pandas is my least favorite necessary evil. It's always changing, far too
expansive API costs me about an hour a week.

Whenever one can use a utc epoch column for time indexed data in a raw numpy
array instead, one should.

------
iandinwoodie
"We’ve added to_markdown() for creating a markdown table"

That's awesome!

------
snicker7
No mention of vaex?

[https://vaex.readthedocs.io/en/latest/](https://vaex.readthedocs.io/en/latest/)

It has a cleaner, leaner API + the ability to use memory-mapped files.

------
drej
I've been waiting for this release for years and I hoped for one thing and one
thing only - for pandas to have a proper way of dealing with NULLs. And it
does have it... OPTIONALLY.

It's great that the whole thing with extension arrays, custom types etc. has
lead to this, but when the devs have, after 10+ years, the biggest chance for
a backward incompatible change, this is the one to make. By making it
optional, they are fixing it for the very few that know of its existence.

I love pandas and a sizeable part of my career depended on it - and while I
don't use it anymore (partly because of the NULLs), I wish it the best and I
hope there will be a future release with this breaking change.

------
mrfusion
Assuming you want to use python. What’s the alternative to pandas? Kind of a
brew up your own code kind of thing? Csv module I guess?

~~~
pottertheotter
Dask is one option.

~~~
Pinus
Wildly off topic, for which I apologize, but whenever I see “Dask”, I think of
this: [https://en.wikipedia.org/wiki/DASK](https://en.wikipedia.org/wiki/DASK)
. You’re not going to manage any large datasets on that! :-)

------
dzonga
one of the best python | data libraries out there. For analysis alone, no tool
comes close.

------
natalyarostova
This is a huge accomplishment. The maintainers work very hard at keeping data
science running for a substantial chunk of the field. Congratulations!

------
lapnitnelav
Congratulations to the Pandas team. You lot have saved my bacon so many times
over the last few years, I owe you many breakfasts.

Long live the King.

------
mmahemoff
"pandas is an open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming
language."

If anyone else is wondering what this is. (Source: project homepage

~~~
smabie
The only true part is that it’s open source and BSD licensed. It’s one of
those unfortunate libraries (like matplotlib) that are both very useful and
not very fun to use.

------
louis8799
I have 30K line of codes in production using pandas 0.23.4. Should I consider
updating.

~~~
eesmith
Do you plan to stay at 0.23.4 forever? Then no. Else, yes.

------
this_is_not_you
Any word for when the RC will be "properly" released?

------
squaresmile
convert_dtypes is pretty nice. I wonder how soon the new dtypes will be the
defaults.

