

A Roadmap for Rich Scientific Data Structures in Python - wesm
http://wesmckinney.com/blog/?p=77

======
aklein
Scientific computing is dominated by low-level or system languages like
Fortran, C, and C++; and by domain-specific languages like Matlab and R.

There is no good high-level, GENERAL computing language that can do scientific
computing well. This leads to a huge gap between "research" and "production"
implementations.

Python stands to shine in this space if things are done right. pandas is a gem
of a library.

------
epistasis
It's great that somebody is looking to R for inspiration. I've tried to like
NumPy and SciPy on a couple occasions, but find it lacking. That said, there's
a to of ways to improve R, mostly to do with cleaning up and homogenizing
style.

I do hope that they avoid the BioConductor style of opaque objects and storing
data in hidden attributes.

------
monk_the_dog
I was happy to see "Hierarchical columns" on the features wish list. My app
uses a R data frame like structure (table of heterogeneous columns with
possible missing data). One "cute" thing I implemented was hierarchical
columns. Super useful for what I'm doing, but hard to expose to other
languages. The plan is to flatten the columns to 'parent.child' strings when
using python.

I'm not quite at the point where I'll be wrapping this into python. When I do
I'll take a close look at pandas. Any other recommendations? I took a quick
look at pytables, and pandas looks better for my app.

~~~
wesm
To be honest I haven't really broached the hierarchical columns issue inside
pandas. If someone can look and suggest an implementation strategy that
doesn't interfere with the rest of the API I would be all for it.

If you have heterogeneous columns with possibly missing data basically pandas
is the only game in town (did you see my diagram?? :) ). It's possible to get
a numpy MaskedArray with structured dtype to function like you want but it's
relatively tricky to do.

------
changhiskhan
I've been using the pandas library for a long time for financial applications.
The data alignment and missing data handling features in pandas are far above
and beyond anything else I've used in similar applications. I think from a
data structure point of view it's already better than R (and FAR better than
Matlab). R/Matlab should fear the day that the pandas statistical packages
gains roughly equivalent features.

------
rcthompson
Data frames (and the associated functionaly for reading and writing csv files)
are one of the main features that make me use R over python. Whenever I need
to manipulate, merge, slice, and dice table-like data, I turn to R. I would
love to have something feature-equivalent in Python. Although maybe rpy2 and
rnumpy are sufficient for now.

~~~
wesm
It _is_ my project but I think pandas.DataFrame is already at 90+% feature-
equivalency (especially if you're using the current git version...new release
forthcoming). You don't have the integration with a million CRAN libraries.
pandas.DataFrame actually does a lot more for you in many places than
data.frame does-- for example data alignment is deeply intrinsic whereas it's
very much a DIY affair in R.

~~~
rcthompson
Yes, I do intend to try out pandas at some point.

------
ameasure
Wow, fantastic work and what an important project. I can't believe I haven't
used these libraries before.

------
rch
Without looking too closely, would h5/pytables suffice?

~~~
wesm
HDF5/PyTables is fantastic as a binary IO format. This is really all about in-
memory data manipulation / computation and how data and metadata get passed
around to functions that can take advantage of it.

------
pnathan
WebSense blocks this at work.

Is there a nice mirror somewhere? :-/

