Hacker News new | past | comments | ask | show | jobs | submit login

Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.

Untrue about the speed of R. R and Python are always around the same speed, but there are always other options specially with R, where there is always more than one way to do anything.

We have data.tables and dplyr which data.tables is maybe on average 50% faster and on some points multiple faster than Python [http://datascience.la/dplyr-and-a-very-basic-benchmark/]

  > mm <- matrix(rnorm(1000000), 1000, 1000)
  > system.time(eigen(mm))
     user  system elapsed 
     5.26    0.00    5.25 

   IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000)

   IPy [2] >>> %timeit(np.linalg.eig(xx))
  1 loops, best of 3: 1.28 s per loop
But where R really stinks is memory access:

  > system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1)
     user  system elapsed 
     1.09    0.00    1.11 

   IPy [7] >>> def do():
          ...:     for x in range(1000):
          ...:         for y in range(1000):
          ...:             xx[x, y] = 1
   IPy [10] >>> %timeit do()
  10 loops, best of 3: 134 ms per loop
Growing lists in R is even worse with all the append nonsense. Exponential time slower.

That's why you never ever grow lists with R. do.call('rbind',...) or even better data.table::rbindlist(). You can't blame R for being slow if you don't know how to write fast R code.

obviously I use do.call all day long because R is my primary weapon, but even if I say so myself, a happy R user, Python with Numpy is faster. I would invite you to show me a single instance where R is faster at bog-standard memory access, than Numpy. My example demonstrates exactly this. Can there be anything simpler than a matrix access? And if that's (8x) slower, everything built on this fundamental building block of all computing (accessing memory) will be slow too. It's R's primary weakness and everybody knows it. Let me make it abundantly clear:

  > xx <- rep(0, 100000000)
  > system.time(xx[] <- 1)
     user  system elapsed 
    4.890   1.080   5.977 

  In [1]: import numpy as np
  In [2]: xx = np.zeros(100000000)                                               
  In [3]: %timeit xx[:] = 1
  1 loops, best of 3: 535 ms per loop
If the very basics, namely changing stuff in memory, is so much slower, then the entire edifice built on it will be slower too, no matter how much you mess around with do.call. And to address the issue of (slow, but quickly expandable) Python lists, recall that all of data science in Python is built on Numpy so the above comparisons are fair.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact