Hacker News new | comments | ask | show | jobs | submit login

This is just a series of incredibly generic operations on an already cleaned dataset in csv format. In reality, you probably need to retrieve and clean the dataset yourself from, say, a database, and you you may well need to do something non-standard with the data, which needs an external library with good documentation. Python is better equipped in both regards. Not to mention, if you're building this into any sort of product rather than just exploring, R is a bad choice. Disclaimer, I learned R before Python, and won't go back.

Exploring the data is maybe 99% of what data analysis is about. It's very much a trial and error process that can't be planned in advance, and R is in my opinion much better suited for that, with a better interactive interface, plotting system and statistical libraries.

On the other hand, if you know the exact calculations that you need to do and the results you're gonna get, then Python might be a better tool.

Personally I learned R after Python, and I use both languages, but I prefer R for anything involving statistics.

"R is in my opinion much better suited for that, with a better interactive interface"

Have you tried IPython/Jupyter?

Yes, I used IPython a lot.

What I meant by better interactive interface is that the language itself is designed with interactive use in mind.

For instance compare

  func(x$a, x$b)
  func(a, b, data=x)
  func(a=1, b=2)

  func(x['a'], x['b'])
  func('a', 'b', data=x)
  func([1, 2], ['a', 'b'])
The R versions are easier to type and read.

If the Python version is Pandas you could replace the brackets with dot notation. (x.a, x.b)

the last one is, of course, a result of the atrocious handling of the default arguments in Python :)

I think there are lots of good R libraries for getting data from various places: DBI (databases), haven (SPSS, Stata, SAS), readxl (xls & xlsx), httr (web apis), readr/data.table (flat files). (Disclaimer: I wrote/contributed to a lot of those).

I think tidyr also currently has the edge over pandas for making [tidy data](http://vita.had.co.nz/papers/tidy-data.html).

I'm currently using both R and Python, having previously only used Python. At first I didn't like R for general purpose data munging and web scraping. That was before I discovered a few R packages that make it a breeze. And now it's a toss up for me. If it's an interactive data product that I'm buiding I probably go with R. If I need data from an API and the supplier gives me only a Python sample script for accessing it I'll go with Python.

Could you list some of these packages?

Recently started using rvest for web scraping. Sweet bejeezus that's a pleasure. I would've never considered R for scraping before. It was always Python with BeautifulSoup.

I would also check out dplyr for data munging. Since most of the code in dplyr is written in C++ it is much faster than the munging capabilities you probably used when you were using R years ago.

Indeed. I use dplyr frequently.

data.table is another handy one here :)

Sweezyjeezy asks which are breezy.

Sorry, couldn't resist!

Hmm I am curious, how would you do data cleaning without doing data exploration first -- and in what way do you find Python superior to R for that purpose?

Also I assume that by "something non-standard" you mean something other than a way to analyze it? Because there is really no comparison wrt available analysis packages between the two...

Not trying to say that R is perfect and great for everything, definitely not, I just have a hard time imagining a data-processing task for which I would choose Python over R (I might pick SAS over either one of them though...)

How does R compare to SAS? I work in Engineering and we use SAS pretty heavily for a lot of stuff (simple modelling, time series forecasting, multiple regressions that type of thing). One thing I really like is how well integrated SQL is does R have something similar to PROC SQL? That is really the killer feature of SAS for me.

I use SAS professionally at my job, and R in all my academic/hobby work. R has a couple packages that give similar functionality as PROC SQL (about 95% of my SAS workflow, since it's far nicer than data steps for a lot of things). There's an ODBC package (RODBC), as well as SQLDF, which allows you to use SQL queries to manipulate data frames in R.

While there is (almost?) always a way to do a SQL query using idiomatic R, I have to admit that sometimes my brain thinks up a solution in SQL faster (a product of upbringing).

I agree. Once you incorporate the other necessary work and preparation, a well-documented, object oriented language is a better way to go.

I have to agree that Python is more powerful, and I am indeed doing more and more in Python. Python was my first language, before R.

However when the dataset is medium sized (i.e.: fits into your computer's memory / 2) R crushes Python (and Pandas) for the 80% of the time you'll be spending wrangling. The reason is that R is vector-based from the ground up. Pandas does everything that R does, but does it in a less-consistent, grafted-on way, whereas the experienced R person who "thinks vectors" is way ahead of the Python guy before the analysis has even started (i.e., most of the work). I know both really well. I use Python when I want to "get (semi) serious" production wise (I qualify with "semi" because if you're really serious about production, you're probably going to go to Scala).

But when it comes to taking a big chunk of untidy data and bashing it around till it's clean and cube-shaped, will parse, and has no no obvious errors, R is miles ahead of Python. R is where you do your discovering. Python can do it too, but I would estimate the cognitive overhead as double.

By the way, that's why people who "think time series" all day long (i.e., vectors, not objects), and who want to implement their algos, not think CS, will first typically build it in R, which is why CRAN beats Python all the time and every time for off-the-shelf data analysis packages. Data people go to R, computer-people go to Python (schematizing).

R is slow. That's its main problem. And that's saying something when comparing it to Python! But the gem of vector-everything makes it a much more satisfying language than imperative, OO, Python, when it comes to the world of data first, code second.

Finally I'd add that Python 3.x is arguably distancing itself from the pragmatism which data science requires, and 2.x provided, towards a world of CS purity. It's not moving in a direction which is data science friendly. It's moving towards a world of competition with Golang and Javascript, and Java itself.

If you haven't already, you might want to take a look at Julia. It's extremely fast, and has more native support for vectors than Python. It's still immature, but I think it has great potential as the truly great language for scientific/data computing.

I heard that the vector operations were very slow though. Has this changed?

It seems that though with in Julia, vectorized code is typically slower than non-vectorized, it is still faster than in R [1] / Python [2].

[1] http://www.johnmyleswhite.com/notebook/2013/12/22/the-relati...

[2] http://blog.rawrjustin.com/blog/2014/03/18/julia-vs-python-m...

Vector operations are not slow - they are basically the same as python/R (compiled down to C).

However, devectorization (i.e. replacing vector ops with a for-loop) is sometimes a performance improvement because Julia can usually provide C-like speeds in for-loops and avoid creating intermediate arrays.

Julia's for loops are comparable to C in performance, and its vectorized operations are comparable to Numpy/R, although some cases can be optimized using https://github.com/lindahua/Devectorize.jl (see the benchmarks table)

Okay. This post:


had worried me a couple of years ago. JMW shows that vectorized was much slower also in Julia (though still both faster than R - but that's not difficult).

Glad to see Julia is very fast in both cases, though it's still somewhat perplexing the extent to which vectorized code is necessarily slower. I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.

it's still somewhat perplexing the extent to which vectorized code is necessarily slower

The vectorized code typically allocates all kinds of intermediate results (more GC, more memory accesses). Apparently, turning it into loops is less trivial than it seems.

I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.

I share that concern. Julia has some libraries to support GPU programming, but I don't know of any plans to have the core compiler take advantage of it.

I think you may have misinterpreted that post. Look at the table under "Comparing Performance in R and Julia" again.

Algo people use R because it's faster, nothing to do with being 'data people'.

I am a data person, and I have to deal with a lot of text in my job. If I had to do it in R, I would quit.

Can you explain why you think it is easier to wrangle data in R? My experience is the opposite.

Do you mean they use Python because it's faster? yes sure. But then, just use scala. 10x faster again. With a REPL.

Perhaps I should clarify, I'm talking mainly time series and/or data which is vectorizable. Python is better if you're scraping the web. If there's a lot of if else going on. Ie imperative programming.

R's native functional aspects (all the apply family) and multilevel vector/matrix hierarchical indexing is better built from the ground up for large wrangling of multivariate datasets, in my opinion.

Working with text data in R is painful, but it's not due to limitations of the language.

I agree with your critiques of Python... Could you please post some example of code/operations which are very natural in R but unnatural in Python/Pandas? I'm curious to see what I'm missing out on.

Well, I use both, and I can do everything in Python that I can do in R. However here are some things which will give you a flavour of R's more consistent, data-first nature:

  > rollapply(some1000x10matrix, 200, function(x) eigen(cov(x))$values[1], by.column = FALSE) # get the first eigenvalue rolling 200x10 window. 
  >>> # impossible in Python unless using ultra-complex Numpy stride tricks.

  > dim(someMatrix)
  >>> someMatrix.shape
  > head(someMatrix)
  >>> someMatrix.head() # notice consistent function application in R, whereas in Python, mixed attribute / function? So we're on OO land and I must know if it's an attribute or a function.... 
  > rollapply(some1000x2matrix, 200, function(x) {linmod <- lm(x[, 1] ~ x[, 2]); last(linmod$residuals) / sd(linmod$residuals)}, by.column = FALSE) # get the z score in one multi-step function. 
  >>> Impossible in python without For loop as lambdas cannot be multi-statement. 

  > native indexing using [] brackets by index number, or index value, or boolean. All vectors.
  >>> pandas loc/iloc/ix mess.

  > ordered lists (python dict) by default, so boolean or index subsection easy even when data is hierarchical, not tabular
  >>> easy bugs due to unordered nature of dicts; must import some different module and then still can't vector index it. 
It's all summed up by this:

  > c(1, 2, 3) * 3
  [1] 3 6 9

  >>> [1, 2, 3] * 3
  [1, 2, 3, 1, 2, 3, 1, 2, 3] # wrong! Need rescuing by Numpy!

And then there's CRAN. Just last night someone told me about "nowcasting" which uses "MIDAS regression". A relatively new technique. Google it for R (full package available), Google it for Python (Matlab comes up ;-).

And I'm not even going to start on graphics. Seaborn and bokeh are valiant efforts, but they're still 80% of what ggplot and base graphics can do, especially, at the multidimensional scale. That last 20% is often all the difference between meh and wow. That said, I do appreciate Matplotlib's autos rescaling of axes when adding data. Python charts aren't as pretty nor capable of complexity (for similar effort), but they're arguably more dynamic.

Now don't get me wrong. The converse list for Python would be much longer, because it's more general purpose, and it kills R outside of data science. I wrote 10k loc in R for a semi-production and it was horrible because it does not have the CS tools for managing code complexity, and it really is slow at certain things. R is more focused on iterative, exploratory data science, where it excels.

I think this numpy successor may put some weight in favor of python: https://speakerdeck.com/izaid/dynd

R _is_ object oriented. But it uses generic function style of OO, rather than message passing, which you're probably more familiar with. (Interestingly Julia also uses generic function style OO)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact