Hacker News new | past | comments | ask | show | jobs | submit login

More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0. Spreadsheet is all you need.

1. Python script is good enough.

2. Java/Scala is way to go.

3. Need to manage memory (gc doesn't cut), some custom organization.

4. Actually needs a cluster.




> 0. Spreadsheet is all you need.

I HATE when people use Spreadsheets to do anything besides simple math.

http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...

TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.

Also

> 1. Python script is good enough

You mean Python with pandas and numpy?

I use R which is also a great choice

> 2. Java/Scala is way to go.

For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.

> 3 & 4 are good points.


Sadly, I recall those arguments against spreadsheet computation being made in the early 1990s. People simply won't learn.

Ray Panko, University of Hawaii.

http://panko.shidler.hawaii.edu/SSR/

http://panko.shidler.hawaii.edu/SSR/

Goes back to 1993:

http://panko.shidler.hawaii.edu/SSR/auditexp.htm


There is even a European Spreadsheet Risks Interest Group: http://www.eusprig.org/


Ad 0. I agree. Your article got valid point. I wouldn't do serious research based solely on complicated spreadsheet.

Though in many non-techies things, like daily sales transactions it is a way to go.

Ad 1. pandas/numpy would put it on par with 2.

Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.

In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.

Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.


Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.


Untrue about the speed of R. R and Python are always around the same speed, but there are always other options specially with R, where there is always more than one way to do anything.

We have data.tables and dplyr which data.tables is maybe on average 50% faster and on some points multiple faster than Python [http://datascience.la/dplyr-and-a-very-basic-benchmark/]


  > mm <- matrix(rnorm(1000000), 1000, 1000)
  > system.time(eigen(mm))
     user  system elapsed 
     5.26    0.00    5.25 



   IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000)

   IPy [2] >>> %timeit(np.linalg.eig(xx))
  1 loops, best of 3: 1.28 s per loop
But where R really stinks is memory access:

  > system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1)
     user  system elapsed 
     1.09    0.00    1.11 

   IPy [7] >>> def do():
          ...:     for x in range(1000):
          ...:         for y in range(1000):
          ...:             xx[x, y] = 1
          ...:
   IPy [10] >>> %timeit do()
  10 loops, best of 3: 134 ms per loop
Growing lists in R is even worse with all the append nonsense. Exponential time slower.


That's why you never ever grow lists with R. do.call('rbind',...) or even better data.table::rbindlist(). You can't blame R for being slow if you don't know how to write fast R code.


obviously I use do.call all day long because R is my primary weapon, but even if I say so myself, a happy R user, Python with Numpy is faster. I would invite you to show me a single instance where R is faster at bog-standard memory access, than Numpy. My example demonstrates exactly this. Can there be anything simpler than a matrix access? And if that's (8x) slower, everything built on this fundamental building block of all computing (accessing memory) will be slow too. It's R's primary weakness and everybody knows it. Let me make it abundantly clear:

  > xx <- rep(0, 100000000)
  > system.time(xx[] <- 1)
     user  system elapsed 
    4.890   1.080   5.977 

  In [1]: import numpy as np
  In [2]: xx = np.zeros(100000000)                                               
  In [3]: %timeit xx[:] = 1
  1 loops, best of 3: 535 ms per loop
If the very basics, namely changing stuff in memory, is so much slower, then the entire edifice built on it will be slower too, no matter how much you mess around with do.call. And to address the issue of (slow, but quickly expandable) Python lists, recall that all of data science in Python is built on Numpy so the above comparisons are fair.


> you mean Python with pandas and numpy?

Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it

Like for calculating the square root of a number. Or the average of a short list




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: