
Impala: a modern, open-source SQL engine for Hadoop - mrry
http://blog.acolyer.org/2015/02/05/impala-a-modern-open-source-sql-engine-for-hadoop/
======
wazokazi
Apache Phoenix: [http://phoenix.apache.org/](http://phoenix.apache.org/) is
another SQL engine on top of Hbase/Hadoop. If you have an existing HBase
cluster, Phoenix is a much easier way to put a SQL layer on top by simply
mapping existing hbase tables to Phoenix tables. The icing is the ability to
use JDBC tooling to query the system (Squirrel, DbVisualizer).

~~~
mping
There's also Presto from fb, and Apache Drill. Phoenix seems to be on top of
Hbase, which is more funkier that plain HDFS + SQL engine.

Besides, right now there's a huge need for analytical engines on top of HDFS
(data lake or whatever you call it) - that's why there are all of these:
(shamelessly stolen from an email from drill users):

\- Apache Hive <[https://hive.apache.org>](https://hive.apache.org>) (SQL-
like, with interactive SQL thanks to the Stinger initiative) \- Apache Drill
<[http://drill.apache.org>](http://drill.apache.org>) (ANSI SQL support) \-
Apache Spark <[https://spark.apache.org>](https://spark.apache.org>) (Spark
SQL <[https://spark.apache.org/sql>](https://spark.apache.org/sql>), queries
only, add data via Hive, RDD
<[https://spark.apache.org/docs/latest/api/scala/index.html#or...](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>)
or Parquet <[http://parquet.io/>](http://parquet.io/>)) \- Apache Phoenix
<[http://phoenix.apache.org>](http://phoenix.apache.org>) (built atop Apache
HBase <[http://hbase.apache.org>](http://hbase.apache.org>), lacks full
transaction
<[http://en.wikipedia.org/wiki/Database_transaction>](http://en.wikipedia.org/wiki/Database_transaction>)
support, relational operators
<[http://en.wikipedia.org/wiki/Relational_operators>](http://en.wikipedia.org/wiki/Relational_operators>)
and some built-in functions) \- Cloudera Impala
<[http://www.cloudera.com/content/cloudera/en/products-and-
ser...](http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/impala.html>) (significant HiveQL support, some SQL language
support, no support for indexes on its tables, importantly missing DELETE,
UPDATE and INTERSECT; amongst others) \- Presto
<[https://github.com/facebook/presto>](https://github.com/facebook/presto>)
from Facebook (can query Hive, Cassandra
<[http://cassandra.apache.org>](http://cassandra.apache.org>), relational DBs
&etc. Doesn't seem to be designed for low-latency responses across small
clusters, or support UPDATE operations. It is optimized for data warehousing
or analytics¹ <[http://prestodb.io/docs/current/overview/use-
cases.html>](http://prestodb.io/docs/current/overview/use-cases.html>)) \-
SQL-Hadoop <[https://www.mapr.com/why-hadoop/sql-
hadoop>](https://www.mapr.com/why-hadoop/sql-hadoop>) via MapR community
edition <[https://www.mapr.com/products/hadoop-
download>](https://www.mapr.com/products/hadoop-download>) (seems to be a
packaging of Hive, HP Vertica <[http://www.vertica.com/hp-vertica-
products/sqlonhadoop>](http://www.vertica.com/hp-vertica-
products/sqlonhadoop>), SparkSQL, Drill and a native ODBC wrapper
<[http://package.mapr.com/tools/MapR-
ODBC/MapR_ODBC>](http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)) \-
Apache Kylin <[http://www.kylin.io>](http://www.kylin.io>) from Ebay (provides
an SQL interface and multi-dimensional analysis [OLAP
<[http://en.wikipedia.org/wiki/OLAP>](http://en.wikipedia.org/wiki/OLAP>)], "…
offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It
depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large
data-sets though maintains low query latency) \- Apache Tajo
<[http://tajo.apache.org>](http://tajo.apache.org>) (ANSI/ISO SQL standard
compliance with JDBC
<[http://en.wikipedia.org/wiki/JDBC>](http://en.wikipedia.org/wiki/JDBC>)
driver support [benchmarks against Hive and Impala
<[http://blogs.gartner.com/nick-heudecker/apache-tajo-
enters-t...](http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-
sql-on-hadoop-space>) ]) \- Cascading
<[http://en.wikipedia.org/wiki/Cascading_%28software%29>'s](http://en.wikipedia.org/wiki/Cascading_%28software%29>'s)
Lingual
<[http://docs.cascading.org/lingual/1.0/>²](http://docs.cascading.org/lingual/1.0/>²)
<[http://docs.cascading.org/lingual/1.0/#sql-
support>](http://docs.cascading.org/lingual/1.0/#sql-support>) ("Lingual
provides JDBC Drivers, a SQL command shell, and a catalog manager for
publishing files [or any resource] as schemas and tables.")

------
gallamine
I started using Impala this summer on AWS Elastic Mapreduce. Previously I'd
been using Hive. We had at _least_ a 100x speedup in most of the work we do
because of it. Things that would have been impossible before (near realtime
classification) are now feasible with mini-batches. My only (major) gripe is
that AWS only supports several versions back. This makes it difficult to
troubleshoot problems and find documentation. That said, the benefits far far
outweight the cost.

~~~
supergirl
impala is good but it's no match for the best analytics databases out there

~~~
anonymousDan
In terms of performance or features? They had a paper at CIDR this year that
claims they are competitive with a well known analytics DB (presumably oracle
although it wasn't named explicitly).

~~~
supergirl
afaik impala doesn't have multicore joins while almost any other serious db
has. so if it uses 1 core out of 10, 20 to do the join then it might be 10x,
20x slower than the best

~~~
mping
...but throughput shouldn't be a problem if you have N queries at same time
(N=nr of cores)

~~~
supergirl
yeah, I guess if you want throughput then it's fine

------
smegel
I have started using Impala/Parquet in combination with Solr for real-
time/batch analytics. Impala is very fast - it basically bypasses the entire
Hadoop stack to perform its own distributed processing and IO, and Parquet is
a great data-at-rest format. I have found the best way to update data holdings
is by landing data in Avro format (which is a great streaming format as it
carries its schema around with it), referencing it as an external table, then
scheduling an Impala query to convert the current Avro bucket into Parquet. I
am also looking at using Spark Streaming for real-time map/reduce style
processing to value add to the data flowing into Solr/Impala. Even with a
moderate sized cluster I feel like I have a lot of processing power at my
fingertips, and it all just works. Hadoop is really starting to shine.

