

Pandas 0.4 (Python data analysis library) released - wesm
http://pandas.sourceforge.net

======
ims
This looks really well done.

I wonder what the motivation is to do this when R is so mature (especially in
the availability of specialized packages), and available through RPy.

~~~
wesm
Here's an article I wrote about some of my motivations on the Python side of
things: <http://news.ycombinator.com/item?id=2790762>

The bigger picture reason "why not R" is that R is not very suitable for
building production systems. I started building this library while working for
AQR, a quant hedge fund, and needed to have statistical computing building
blocks integrated with a much larger system. R is a mediocre programming
language and has very weak general purpose libraries. But amazingly good data
visualization and mature statistics libraries indeed. Using R as a black box
(e.g. via RPy, Rcpp, or RJava) is a good idea in theory, but recovering from
and dealing with errors/exceptions with real world data is a very thorny
problem. Plus maintaining a big pile of R code is kind of a nightmare (believe
me, been there, done that!).

~~~
pmiller2
On a related note, I've said _exactly_ the same thing about Matlab to people
before. It (Matlab) is good for getting the algorithms and calculations
correct, but as a programming language, it's pretty terrible.

------
colanderman
Claiming "DataFrame" to be two-dimensional almost caused me to discount the
entire project. On the contrary, it seems like they are actually multi-
dimensional (since they are essentially OLAP relations) and you should
advertise them as such. That they can be viewed as two-dimensional tables is
incidental.

~~~
wesm
Thinking of DataFrame as a 2D structure just aids mental visualization. With
hierarchical indexing they can be arbitrarily highly-dimensional so maybe I
should sell it a bit more like that. In an OLAP setting the "sparse" format
may often be better than the "dense" (truly N-D) format.

It would be an interesting avenue to pursue building a "big data" on-disk OLAP
engine with pandas-like semantics (e.g. expressing groupby operations with the
same syntax but operating on big data on disk or across a cluster of
computation nodes).

~~~
colanderman
No, they are multidimensional even without hierarchical indexing. e.g.:

x y z value 0 0 0 32 0 0 1 64 0 1 0 23 0 1 1 3.14 1 0 0 4.3 etc.

This is essentially a 3D cube of data, no hierarchical indexing involved. The
benefit of hierarchical indexing is that you can wrap your spatial dimensions
into a single real dimension for e.g. code abstraction.

I have actually been developing a similar library for OCaml (even with
hierarchical indexing). It is good to see our libraries share many of the same
ideas! I wonder though, have you considered GPU acceleration? AFAIK neither
Matlab nor R do this natively yet.

------
achompas
You can also catch wesm talk about the pandas package at AOL HQ in New York
this Wednesday. I'll be there, ready to find an alternative to R.

[http://www.meetup.com/nyhackr/events/28880161/?hidePromoBar=...](http://www.meetup.com/nyhackr/events/28880161/?hidePromoBar=true)

------
phren0logy
Thanks for your work on this, it looks great. I'd love to use python end-to-
end. Any plans to support some parallelism?

