TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.
Also
> 1. Python script is good enough
You mean Python with pandas and numpy?
I use R which is also a great choice
> 2. Java/Scala is way to go.
For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.
Ad 0. I agree. Your article got valid point. I wouldn't do serious research based solely on complicated spreadsheet.
Though in many non-techies things, like daily sales transactions it is a way to go.
Ad 1. pandas/numpy would put it on par with 2.
Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.
In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.
Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.
Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.
Untrue about the speed of R. R and Python are always around the same speed, but there are always other options specially with R, where there is always more than one way to do anything.
> mm <- matrix(rnorm(1000000), 1000, 1000)
> system.time(eigen(mm))
user system elapsed
5.26 0.00 5.25
IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000)
IPy [2] >>> %timeit(np.linalg.eig(xx))
1 loops, best of 3: 1.28 s per loop
But where R really stinks is memory access:
> system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1)
user system elapsed
1.09 0.00 1.11
IPy [7] >>> def do():
...: for x in range(1000):
...: for y in range(1000):
...: xx[x, y] = 1
...:
IPy [10] >>> %timeit do()
10 loops, best of 3: 134 ms per loop
Growing lists in R is even worse with all the append nonsense. Exponential time slower.
That's why you never ever grow lists with R. do.call('rbind',...) or even better data.table::rbindlist(). You can't blame R for being slow if you don't know how to write fast R code.
obviously I use do.call all day long because R is my primary weapon, but even if I say so myself, a happy R user, Python with Numpy is faster. I would invite you to show me a single instance where R is faster at bog-standard memory access, than Numpy. My example demonstrates exactly this. Can there be anything simpler than a matrix access? And if that's (8x) slower, everything built on this fundamental building block of all computing (accessing memory) will be slow too. It's R's primary weakness and everybody knows it. Let me make it abundantly clear:
> xx <- rep(0, 100000000)
> system.time(xx[] <- 1)
user system elapsed
4.890 1.080 5.977
In [1]: import numpy as np
In [2]: xx = np.zeros(100000000)
In [3]: %timeit xx[:] = 1
1 loops, best of 3: 535 ms per loop
If the very basics, namely changing stuff in memory, is so much slower, then the entire edifice built on it will be slower too, no matter how much you mess around with do.call. And to address the issue of (slow, but quickly expandable) Python lists, recall that all of data science in Python is built on Numpy so the above comparisons are fair.
I HATE when people use Spreadsheets to do anything besides simple math.
http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...
TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.
Also
> 1. Python script is good enough
You mean Python with pandas and numpy?
I use R which is also a great choice
> 2. Java/Scala is way to go.
For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.
> 3 & 4 are good points.