
Alluxio is getting attention from Baidu and other data giants - StreamBright
http://readwrite.com/2016/02/22/new-fast-sql-project
======
nl
What a stupid headline.

They use Alluxio (formerly named Tachyon). People who know anything about
Spark will already know what that means. For everyone else it is an out of JVM
memory layer that allows Spark to keep more data in memory instead of disk.

The key quote is that it is 100x than Spark _alone_. It isn't instead of Spark
SQL, it's as well as.

~~~
rxin
It's also not 100X faster than "Spark alone". This article is plain wrong.

~~~
eranation
To those who don't know rxin, (Reynold Xin) he is a cofounder of databricks
and a spark prominent maintainer, I would listen to him.

------
jkot
I work on something similar. Spark is still fairly unoptimized, and there is
more room for performance improvements.

~~~
eranation
I saw a presentation from a company named parallel machines, they claim x800
faster than Spark performance on pagerank, and you know what, working with
GraphX, I am not utterly disbelieving them. Spark has a long way to go on
terms of performance.

Having that said, this article is completely misleading and incorrect
technically. Tachyon is documented in Spark docs as a off heap storage
solution since Spark SQL was called Shark. Nothing new and nothing "competing"
with spark. It's still catalyst that runs the spark SQL queries. Just a better
memory store. This article is simply wrong.

~~~
nl
_I saw a presentation from a company named parallel machines, they claim x800
faster than Spark performance on pagerank, and you know what, working with
GraphX_

800 times faster or 800%? 800% (ie, 8 times) is believable, but I'd need a lot
of convincing if they are claiming 800 times.

Frank McSherry's work on graph computation is pretty good, and he get's 16
times speed up over Spark[1]. Maybe if they are claiming 800 times less
resource usage or something?

[1]
[http://www.frankmcsherry.org/pagerank/distributed/performanc...](http://www.frankmcsherry.org/pagerank/distributed/performance/2015/07/08/pagerank.html)

~~~
eranation
I'll double check, you are probably right. In any case they are coming from
HPC background and squeezing hardware to the limit. Also they don't use the
JVM they compile a popular data science language to native code (I didn't sign
an NDA but not sure if they want more info to be revealed yet)

Running native vs JVM, using hardware capabilities and utilizing every piece
of performance they can think of (one of the co founders I think was a CPU
designer at Intel or something) lets them get to the theoretical limit of
linearity. Spark might get there eventually... But it will take time

