Hacker News new | comments | show | ask | jobs | submit login
R and pandas and what I've learned about each (yhathq.com)
103 points by rouli 1333 days ago | hide | past | web | 23 comments | favorite

I like Pandas and I'd like to move to Python for my data analysis. Python is a beautiful language I can use for many other purposes than just data analysis. I find it more readable and self-documenting than R.

Nonetheless I use R, ggplot2 is excellent and its graphs are better looking than matplotlib. I really like R Markdown for intermingling R and Markdown in your code. I suppose Ipython is the python equivalent. R especially shines with its numerous libraries.

Anyway, for every project I debate whether to use R or Python, perhaps I should look into rmagic/rpy2 for iPython as a go between.

Take a look at https://github.com/ContinuumIO/Bokeh It's a project to create a ggplot equivalent for python. It's still quite beta and not feature complete, but it's under active development by a company working in the python data analysis field and definitely worth a look.

Very cool. I'd love to see ggplot grammar come to Python, so i'm excited to check this out.

I found this link on porting ggplot2 styles into python, but it doesn't focus on grammar. http://tonysyu.github.com/mpltools/auto_examples/style/plot_...

To install the whole set of Python modules needed and iPython in a virtualenv (trick: there is no "pylab" module to install):

  % virtualenv --distribute --no-site-packages pandas_venv
  % . pandas_venv/bin/activate
  (pandas_venv) % easy_install readline # Probably only needed in Mac OS X for iPython to behave 
  (pandas_venv) % pip install ipython
  [blah blah]
  (pandas_venv) % pip install numpy
  [lots of blahblah]
  (pandas_venv) % pip install matplotlib
  [quite a bit of blahblah]
  (pandas_venv) % pip install pandas
  [some more blah blah]
  (pandas_venv) % pandas_venv/bin/ipython --no-banner
  In [1]: import numpy as np
  In [2]: import pandas as pd
  In [3]: import pylab as pl
  In [4]:

Installing numpy with pip isn't recommended: it might not work (if you don't have the necessary development headers to compile it), the resulting numpy might be slower (if it hasn't managed to compile against properly optimised libraries) and compiling from source isn't a very quick way to install it.

For most users, the easiest way to get set up is a complete Python distribution, like Anaconda, EPD or Python(x,y). See the Scipy Stack installation page:


I have wasted so many hours failing to get everything I need to work on OSX 10.6. I tried pip, easy_install, install from source, brew, Enthought. Now Anaconda has just saved my life, thank you! It was the best moment of my day when I was able to type all this without error:

    >>> import numpy as np
    >>> import scipy as sp
    >>> import statsmodels as sm
    >>> import matplotlib as mpl
    >>> import pandas as pd
    >>> import networkx as nx
    >>> import sklearn as sk
    >>> import nltk

The normal convention is to avoid using pylab, and instead use matplotlib directly.

Pylab is handy if you're just transitioning from Matlab, but otherwise, there's no reason to use it. It's a gigantic namespace, and all but a couple of functions are from numpy and matplotlib.pyplot.

Just do:

    import matplotlib.pyplot as plt
Instead of:

    import pylab as pl
Of course, in the end it's personal preference. As long as you don't need to know where things come from, then using pylab is fine.

I am in the process of migrating all of my analyses from Matlab and R to Python. I have been meaning to do this for quite some time and finally pandas is mature enough to be able to completely replace both Matlab and R for straightforward tasks. If I need something Python doesn't offer, it's still fairly simple to do isolated tasks elsewhere. For me the biggest reasons for change are easy integration with the web and better language features (for Matlab, R language is great, just terribly slow for intensive tasks).

What I miss the most:

- Matlab <--> Excel link (on Windows) - an excel add-on that lets you send back and forth arrays very easily. You need a spreadsheet when you work with datasets, and interchanging data through files just isn't that convenient.

- Matlab's IDE features (debugging, documentation, publishing, variable inspection).

- ggplot2

Thanks for the comment! We've actually been thinking about some of these ideas too. There's a Yhat Excel plugin-in in the works, so stay tuned! Should be available shortly.

The creator of pandas wrote a book, Python on Data Analysis, which covers NumPy and Pandas. I found it an excellent primer.


I'm enjoying working through this book. I admit I got it to fulfill a need, to quickly analyze huge text files before squeezing them into a RDMS and data warehouse for "proper" analysis and reporting. (Which is a more time consuming process require all sorts of meetings, approvals and effort.) A couple of fellow MS-DBAs threw a "shit-fit" when they saw it on my desk. (NIH Stockholm syndrome?)

One huge thing pandas does damn well is Google Analytics integration. It has been a god send in this way.

Interesting analysis, but it would really benefit from a section about data.table. For me and many others, data.table has almost completely replaced data.frame (of which data.table is a subclass) and completely replaced plyr. The speed and ease of use of data.table are much more favorably comparable with pandas than the R tools mentioned here.

Comparisons with data.table on performance are much more favorable than with vanilla R or plyr; a lot of progress has been made last couple years, too. I personally find the data.table syntax to be a bit obtuse at times but it's a great library.

Aside from the performance differences, data.table makes it very easy to do interactive manipulation, at the cost of making it hard to program. Pandas currently goes in the opposite direction.

I'd rather have R/data.table at the prompt and python/pandas in my script, but if you have to err on one side, the python/pandas "low magic" is the side to err on. Pandas does have its own strange corners, though. For example, it seems like it tries hard to stick similar-typed columns into contiguous matrices, which leads to some unexpected casting, and I have no idea what the supposed benefit is over just keeping distinct columns.

I'd guess the benefits are related to performance - Wes is known as something of a speed junkie (see also his vbench project). I know there's quite a bit of code in pandas that makes it much faster than a naive implementation of a similar interface.

That said, if it causes unexpected behaviour, check to see whether it's a bug.

If your use case fits data.table then you probably want to use pytables in python. It's much faster than pandas when dealing with very large data sets, at the cost of some features you may or may not need.

The benefits from data.table are not as much processing very large data (anything above 10M observations is mostly outside of R's comfort zone on a reasonable machine, anyway), as much as the ease of performing operations such as indexed joins, aggregations, and so on.

That's useful. Hadn't really looked at Pandas before.

Slightly OT:

I'm using in-memory sqlite3 with rtree to find objects within bounds in a 2D space. Is there a different library people would recommend for this in Python?

Are there any comments as to the maturity of Pandas as compared to R?

I am used to the Python syntax, and while R is another language to learn, my assumption is that for data analysis its age compared to Pandas implies stability.

I could of course be wrong.

I've not had any problems with pandas' stability. Where the age difference shows is in the availability of specific statistical methods. The Python package 'statsmodels' is working on that, but it's some way behind R.

Learning data mining this semester and find it useful. Thanks!

Thanks for posting

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact