

Real-time SQL on Hadoop using Postgres foreign tables - spathak
http://citusdata.com/blog/63-real-time-sql-queries-on-hadoop

======
dude_abides
Not a single mention of Cloudera Impala in the article? Competition in this
space is great! Woud be great to know how this offering compares to Impala.

~~~
ozgune
(Ozgun @ Citus Data)

Thanks for bringing this up. A lot of what we say in the FAQ for "How does
CitusDB's feature set compare against Apache Hive?" also applies to Impala,
and we'll update that question shortly. The fundamental difference is that
Citus builds on top of Postgres, and leverages its many features and
performance optimizations.

We are also working on getting performance numbers that compare Hive, Impala,
and Citus thoroughly; and we'll share our methodology and results in the
upcoming months.

~~~
monstrado
Disclaimer: Clouderan Here :D

I wouldn't necessarily agree that the same feature set against hive would also
apply to Impala. For example, Impala utilizes HDFS short-circuit reads and can
read data directly from disk which results in full disk throughput, this
combined with highly effecient parallel reads yields some impressive numbers.

I've seen queries speed up anywhere from 2x-100x (especially when data sets
can fit in memory). Since it's designed for low latent queries, results can be
returned within the sub-second range.

With that being said, Impala does not currently support UDFs (slated for post-
GA).

Hive does do JOIN order optimizations after 0.7.0 though
(<https://issues.apache.org/jira/browse/HIVE-1642>), you can set
"hive.auto.convert.join = true" to enable it. I believe this will be enabled
by default eventually. By GA, Impala will have a cost-based optimizer for
optimizing JOINS as well.

PS: Congrats on the release, I'm looking forward to giving it a go :)

------
mwexler
This seems very similar to Hadapt (<http://hadapt.com/>). Anyone familiar
enough to comment on the differences?

~~~
verily
This is postgres-style query execution "above" hdfs/hadoop style storage.

Hadapt is hadoop-style job execution "above" postgres-style storage.

~~~
cbsmith
Is that really accurate? I had perceived things as the other way around, as
Hadapt has a mixed storage model and CitusDB uses external tables for
everything...

------
dcraw
Congrats to the Citus Data team on a big release. These guys know distributed
databases backwards and forwards. Excited to see how this product stacks up
against Hive.

------
DEinspanjer
It definitely seems interesting. The big problem I've been looking for the
right tool to fix is simple document based storage with solid secondary
indices supporting aggregate queries. A SQL syntax is a big plus for this
because it is very easy for many people to write a SQL group by statement to
get the aggregates they want but much harder to write an ElasticSearch or Solr
facet query or a MapReduce job. Especially if you want relatively fast
results.

