
An Introduction to Scientific Python – Pandas - Jmoir
http://www.datadependence.com/2016/05/scientific-python-pandas/
======
mojoe
Pandas is certainly excellent -- be aware of it's NA type promotion behavior
before you start designing data analysis programs, however. I learned this the
hard way:

[http://pandas.pydata.org/pandas-
docs/stable/gotchas.html#nan...](http://pandas.pydata.org/pandas-
docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions)

~~~
IndianAstronaut
Another gotcha is variable type inference. Reading csv files can often produce
varying column types. This can be a pain for any consistent data pipeline.

------
lottin
As an R user I noticed a couple of oddities. First,

    
    
      len(df)
    

returns the number of rows rather than the number of columns. This strikes me
as a bad idea, because data-frames are better thought of as a collection of
columns. Typically you want to loop over the columns of a data-frame and not
so much over its rows, which is performance-wise much more costly.

Second, the apply method seems totally redundant. Why call a method that calls
a function when you can simply call the function directly

    
    
      df['year'] = base_year(df.water_year)
    

Probably I'm missing something here.

~~~
bicubic
> This strikes me as a bad idea, because data-frames are better thought of as
> a collection of columns

The dataframe is a collection of _records_ then len operator tells you how big
the dataset you're dealing with. You also have len(df.columns) and df.shape

> Second, the apply method seems totally redundant

df.water_year refers to a column. You can certainly use the syntax you wrote,
provided you crafted a function that manipulate a column in some way. E.g. if
you had a function that returns the first 2 elements of what was given,
passing a column to that function would return a view into that column with
only the first 2 rows. Passing the same function into apply would process
every element in the (string) column and return the first 2 letters, finally
returning a brand new column where each row is the first 2 letters of the
corresponding row of the input.

Both of these behaviours make perfect sense if you think about them in terms
of expected Python and Numpy which Pandas is built on.

~~~
lottin
Thanks for clearing that up, now it does make sense. In R most functions
handle vectors as well as scalars without distinction, so normally one would
use the function directly. Whereas if you wanted to process each element of a
vector individually _then_ you'd use apply(). It works the other way around.

~~~
_Wintermute
Well that's because R doesn't have scalars, just vectors containing a single
value.

------
visarga
I usually do this kind of processing by linux pipes, head, tail, cut, sort,
uniq, and inline Perl. It is kind of similar to using monads, but you have to
handle the formatting to and from text. A few ones of my own creation are a
tool for counting and a tool for generating histograms in text. I often chain
5 or 10 of these commands together. My basic data type is similar to CSV, but
using "|" instead of comma as separator because it tends not to appear in text
as much. On the other hand, not being put in a binary format, my data is very
accessible.

~~~
jeffwass
It's really too bad that the ASCII codes 29, 30, and 31 (Group, Record, and
Unit separators) never took off, as this is exactly what they were designed
for.

When implemented, they'd let you include commas, line feeds/carriage returns,
etc within your data records.

~~~
stinos
_they 'd let you include commas, line feeds/carriage returns, etc within your
data records_

And there would also be less ambiguity as to what seperator to use. I
understand the popularity of CSV, but it's really not so nice to share data
with. German customers want semicolons as a seperator, the US ones claims they
are right 'because after all it is called comma-seperated and else I cannot
import it in Excel' (sic). Etc.

------
kozikow
My blog post about the most popular pandas methods:
[https://kozikow.wordpress.com/2016/07/01/top-pandas-
function...](https://kozikow.wordpress.com/2016/07/01/top-pandas-functions-
used-in-github-repos/).

Pandas is a big library and it's hard to distinguish between necessary and
nice to have methods. I have written 1000s of lines in pandas and I have been
doing some things "around" rather than using the proper API call.

~~~
nzjrs
I don't trust your data. scipy.org is not a function

~~~
kozikow
I will explain the methodology better. My goal was to avoid false negatives.

See the methodology description:
[https://kozikow.wordpress.com/2016/07/01/top-pandas-
function...](https://kozikow.wordpress.com/2016/07/01/top-pandas-functions-
used-in-github-repos/#Methodology) .

------
cavnerj
pandas is very good for scientific computing and data analysis, but beware,
the documentation quite frankly sucks. Stack overflow seems to be the best way
to learn things

~~~
manish_gill
Been using Pandas for a few weeks and I...kind of agree. The 10 minute
tutorial etc is fine but as soon as you start doing more complicated stuff,
you need the API docs. And they leave much to be desired.

~~~
robochat42
I also use Pandas for some of my data analysis and I found that it took me a
long time to learn how to use it. Unlike numpy, I just couldn't remember how
to do things and had to keep looking things up. Maybe this is just because
Pandas has a lot of functionality. But I might waste half an hour trying to
write one line of code although that line would do most of my analysis.

------
frumiousirc
Pandas are a reinvention (be it a conscious one or not) of PAW "ntuples" which
have been around for at least a quarter of a century. ROOT has further evolved
them into "trees" which allow structure beyond simple tables. Both provide a
selection language akin to Panda's filtering.

I have nothing against Pandas, but the ebullience that always comes with blog
posts about them seems to be ignorant of existing systems used every day in
scientific data analysis for the past few decades.

~~~
Jmoir
Whether they are a reinvention or not, can't I be ebullient about them? I love
new technology for example, and I get pretty ebullient about whatever new
things there are. It doesn't mean that I'm ignorant of the past that has led
up to them and it certainly shouldn't effect my thoughts on them either.

------
collyw
Curious what this will give me over a relational database.

------
thanatropism
This is typical of what passes for 'exploratory data analysis' these days:

> You can also see that UK’s rainfall is significantly less than Japan’s, and
> people say UK rains a lot!

Well duh, Japan is about 50% larger than the UK.

~~~
Jmoir
Yeah it's 1.5 times larger. It was a joke though because Japanese people
always say it never stops raining in the UK, it's like the first thing they
always say.

~~~
mercurysmessage
That's what everyone seems to say about the UK.

~~~
Jmoir
That's true haha. It rains a fair amount but it's definitely over exaggerated.

