

R and pandas and what I've learned about each - rouli
http://blog.yhathq.com/posts/R-and-pandas-and-what-ive-learned-about-each.html

======
dlib
I like Pandas and I'd like to move to Python for my data analysis. Python is a
beautiful language I can use for many other purposes than just data analysis.
I find it more readable and self-documenting than R.

Nonetheless I use R, ggplot2 is excellent and its graphs are better looking
than matplotlib. I really like R Markdown for intermingling R and Markdown in
your code. I suppose Ipython is the python equivalent. R especially shines
with its numerous libraries.

Anyway, for every project I debate whether to use R or Python, perhaps I
should look into rmagic/rpy2 for iPython as a go between.

~~~
dagw
Take a look at <https://github.com/ContinuumIO/Bokeh> It's a project to create
a ggplot equivalent for python. It's still quite beta and not feature
complete, but it's under active development by a company working in the python
data analysis field and definitely worth a look.

~~~
hernamesbarbara
Very cool. I'd love to see ggplot grammar come to Python, so i'm excited to
check this out.

I found this link on porting ggplot2 styles into python, but it doesn't focus
on grammar.
[http://tonysyu.github.com/mpltools/auto_examples/style/plot_...](http://tonysyu.github.com/mpltools/auto_examples/style/plot_ggplot.html)

------
teoruiz
To install the whole set of Python modules needed and iPython in a virtualenv
(trick: there is no "pylab" module to install):

    
    
      % virtualenv --distribute --no-site-packages pandas_venv
      [blahblah]
      % . pandas_venv/bin/activate
      (pandas_venv) % easy_install readline # Probably only needed in Mac OS X for iPython to behave 
      [blahblah]
      (pandas_venv) % pip install ipython
      [blah blah]
      (pandas_venv) % pip install numpy
      [lots of blahblah]
      (pandas_venv) % pip install matplotlib
      [quite a bit of blahblah]
      (pandas_venv) % pip install pandas
      [some more blah blah]
      (pandas_venv) % pandas_venv/bin/ipython --no-banner
      
      In [1]: import numpy as np
      
      In [2]: import pandas as pd
      
      In [3]: import pylab as pl
      
      In [4]:

~~~
takluyver
Installing numpy with pip isn't recommended: it might not work (if you don't
have the necessary development headers to compile it), the resulting numpy
might be slower (if it hasn't managed to compile against properly optimised
libraries) and compiling from source isn't a very quick way to install it.

For most users, the easiest way to get set up is a complete Python
distribution, like Anaconda, EPD or Python(x,y). See the Scipy Stack
installation page:

<http://scipy.github.com/install.html>

~~~
jmmcd
I have wasted so many hours failing to get everything I need to work on OSX
10.6. I tried pip, easy_install, install from source, brew, Enthought. Now
Anaconda has just saved my life, thank you! It was the best moment of my day
when I was able to type all this without error:

    
    
        >>> import numpy as np
        >>> import scipy as sp
        >>> import statsmodels as sm
        >>> import matplotlib as mpl
        >>> import pandas as pd
        >>> import networkx as nx
        >>> import sklearn as sk
        >>> import nltk

------
pav3l
I am in the process of migrating all of my analyses from Matlab and R to
Python. I have been meaning to do this for quite some time and finally pandas
is mature enough to be able to completely replace both Matlab and R for
straightforward tasks. If I need something Python doesn't offer, it's still
fairly simple to do isolated tasks elsewhere. For me the biggest reasons for
change are easy integration with the web and better language features (for
Matlab, R language is great, just terribly slow for intensive tasks).

What I miss the most:

\- Matlab <\--> Excel link (on Windows) - an excel add-on that lets you send
back and forth arrays very easily. You need a spreadsheet when you work with
datasets, and interchanging data through files just isn't that convenient.

\- Matlab's IDE features (debugging, documentation, publishing, variable
inspection).

\- ggplot2

~~~
hernamesbarbara
Thanks for the comment! We've actually been thinking about some of these ideas
too. There's a Yhat Excel plugin-in in the works, so stay tuned! Should be
available shortly.

------
jmduke
The creator of pandas wrote a book, _Python on Data Analysis_ , which covers
NumPy and Pandas. I found it an excellent primer.

<http://oreilly.com/shop/product/0636920023784.html>

~~~
xradionut
I'm enjoying working through this book. I admit I got it to fulfill a need, to
quickly analyze huge text files before squeezing them into a RDMS and data
warehouse for "proper" analysis and reporting. (Which is a more time consuming
process require all sorts of meetings, approvals and effort.) A couple of
fellow MS-DBAs threw a "shit-fit" when they saw it on my desk. (NIH Stockholm
syndrome?)

------
darkxanthos
One huge thing pandas does damn well is Google Analytics integration. It has
been a god send in this way.

------
crayola
Interesting analysis, but it would really benefit from a section about
data.table. For me and many others, data.table has almost completely replaced
data.frame (of which data.table is a subclass) and completely replaced plyr.
The speed and ease of use of data.table are much more favorably comparable
with pandas than the R tools mentioned here.

~~~
wesm
Comparisons with data.table on performance are much more favorable than with
vanilla R or plyr; a lot of progress has been made last couple years, too. I
personally find the data.table syntax to be a bit obtuse at times but it's a
great library.

~~~
oddthink
Aside from the performance differences, data.table makes it very easy to do
interactive manipulation, at the cost of making it hard to program. Pandas
currently goes in the opposite direction.

I'd rather have R/data.table at the prompt and python/pandas in my script, but
if you have to err on one side, the python/pandas "low magic" is the side to
err on. Pandas does have its own strange corners, though. For example, it
seems like it tries hard to stick similar-typed columns into contiguous
matrices, which leads to some unexpected casting, and I have no idea what the
supposed benefit is over just keeping distinct columns.

~~~
takluyver
I'd guess the benefits are related to performance - Wes is known as something
of a speed junkie (see also his vbench project). I know there's quite a bit of
code in pandas that makes it much faster than a naive implementation of a
similar interface.

That said, if it causes unexpected behaviour, check to see whether it's a bug.

------
aidos
That's useful. Hadn't really looked at Pandas before.

Slightly OT:

I'm using in-memory sqlite3 with rtree to find objects within bounds in a 2D
space. Is there a different library people would recommend for this in Python?

------
bsg75
Are there any comments as to the maturity of Pandas as compared to R?

I am used to the Python syntax, and while R is another language to learn, my
assumption is that for data analysis its age compared to Pandas implies
stability.

I could of course be wrong.

~~~
takluyver
I've not had any problems with pandas' stability. Where the age difference
shows is in the availability of specific statistical methods. The Python
package 'statsmodels' is working on that, but it's some way behind R.

------
xyjprc
Learning data mining this semester and find it useful. Thanks!

------
sonabinu
Thanks for posting

