
A comparative analysis of SQL-on-Hadoop systems for interactive analytics - luu
https://arxiv.org/abs/1804.00224
======
maxxxxx
Is this generally a good idea? We have data in multiple SQL databases and one
of our departments wants to run analytics on the data. My plan was to export
the data and then import it into a big consolidated SQL database. But our IT
department talked them into using their Hadoop instance so now we export from
SQL to JSON which then gets moved to Hadoop and queried with SQL and Hive. I
still don't see the advantage of this over using a SQL database. Queries that
should run in milliseconds take minutes in Hive. The data size is only a few
terabytes so I wouldn't call it "big data".

Is there any advantage in using Hadoop for such a use case?

~~~
glogla
Depends on the data size. It's definitely not worth it for gigabytes, and for
petabytes, there's not much else. Terabytes are in the middle.

~~~
maxxxxx
Aren't there petabyte SQL databases in existence? You can't use Hadoop for
transactional data so I suppose that data still goes into SQL.

~~~
Triffids
It depends. There are Cloudera kudu engine, it quite easy can handle millions
atomic inserts per second. But it can't support true acid transaction. On
other hand SQL also can't use true ACID in high load system. Serialazible
transactions replaced by Read committed transactions, "banking day", queues +
background process, etc

------
DandyDev
A pity that the paper doesn't include Hive LLAP in the comparison. With LLAP
enabled, Hive becomes more like Impala and Drill, with persistent daemon
processes providing much shorter response times to queries.

I'm curious how Hive LLAP does, compared to Impala and Drill.

~~~
wenc
I found that a curious omission too -- it's hardly possible to have a
meaningful study of SQL-on-Hadoop interactive analytics in 2017
(conservatively assuming there is a 1-year lag between the study and the
paper) without including Hive LLAP and Presto.

But having been part of the academic machine, I kind of understand. These
things happen.

EDIT: I notice that a short paper was submitted to IEEE in 2017, so the study
was probably done in 2016, which may explain the omission.

------
billman
I would have liked to see Presto in the analysis.

~~~
disgruntledphd2
I am willing to bet that Presto would have done better than all of the others.

I was also really surprised that neither Hive nor Presto were included.
Clearly the author is biased against FB originated projects \s.

------
c789a123
My comments ( I am experienced in using spark, so could be biased): 1. as the
worker machine as 16 vcpu, and 122G ram, setting spark.executor.memory to 8GB
seems too small, I would be interested to see how it works with 32GB setting
per 8 cores. 2. CDH is not so updated with spark releases, new spark releases
2.3 can be used together with CDH hadoop for test. 3. in table 4, spark is the
only system without fails, which confirms it is a very robust system. 5. It is
a performance test, but is the result verified?

------
sgt101
EC2 vs on prem is a big jump; does anyone have a view as to what this study
means for on prem implementations?

~~~
dajohnson89
What else could it be, besides a local machine(s)?

~~~
sgt101
I expect that performance on prem will be rather different...

