
Druid: fast column-oriented distributed data store - samueladam
http://druid.io/
======
tschellenbach
Druid is quickly becoming the leading open source solution for building highly
scalable analytics. We evaluated it for getstream.io. Unfortunately the setup
and maintenance is still very labour intensive. For startups that's a concern.
Many larger companies we spoke to were extremely happy about running Druid in
production though.

~~~
threeseed
Druid isn't even close to being the leading solution. The overwhelming
majority of places doing big data analytics will be Hadoop/Spark using core
HDFS.

Then most places will augment this with a range of database solutions
depending on how structured/clean the data is and the various workloads.
Cassandra, HBase, MongoDB, Teradata, Oracle, ElasticSearch, Greenplum are all
pretty common place in most enterprises.

And of course the ridiculous array of SQL engines on top of HDFS/S3/whatever
else e.g. Hive, Spark SQL, Presto, Drill, SAP etc.

~~~
fangjin
Druid's main value add to the data infrastructure space is around power user-
facing data applications at scale. The queries it is best at are OLAP/business
intelligence style queries. It isn't really designed to be a general
processing tool such as Hadoop or Spark. The open source data space is very
complex, and there are many different solutions targeted towards many
different use cases. Druid is better than other solutions at some of these use
cases, and worse than other solutions at others.

I wrote my interpretation of the current open source data landscape here for
anyone interested: [http://imply.io/post/2015/11/04/big-data-
zoo.html](http://imply.io/post/2015/11/04/big-data-zoo.html)

~~~
deepanchor
As someone who hasn't yet had the opportunity to use many of these systems,
this was a great high-level overview of how the various systems fit together.
Thanks for writing it!

------
NamTaf
The realtime ingestion is interesting especially if I can still batch import.
When processing machine data, I've found that a quantity of sources come in
chunks (logfiles written out every 24 hours for exmaple) but the eventual aim
is to migrate to realtime (i.e.: a data point every n seconds/minutes/etc.
where you instantly consume that data point) streaming.

If this transition is easy without reworking infrastructure, the solution is
far more attractive.

------
jnordwick
Every open source column database I've seen is very poor: text, no decent
array oriented ability (give me the prevoius row), slow, json output, etc.
When will somebody get it right?

~~~
ktamura
This is simply false. PostgreSQL + cstore_ftw, Presto, Impala, etc. all
support "array oriented ability" via window functions.

~~~
jnordwick
I mean like give me the preceding row from the current. Very useful for
calculating anything from a simple rate of change to a properly doing a
temporal database. Arrays as a column type doesn't get you that.

~~~
teraflop
No, but window functions do, unless I'm not understanding what your goal is.

[http://www.postgresql.org/docs/current/static/functions-
wind...](http://www.postgresql.org/docs/current/static/functions-window.html)

~~~
jnordwick
I actually was using that yesterday. So so so so very very very very slow.

------
techwizrd
A friend of mine who interned with me at eBay used Druid and Angular to great
success to build a tool for analysts to look at trends in our data. Druid is
some seriously cool stuff.

------
Exuma
Our 2-man team set up Druid........ i took 5+ months and was excruciating to
configure and get running smoothly (things were slightly more complicated
because we decided to use docker). It also took ~30 servers to make a truly
fault-tolerant setup.

With that said, it works very well, but it definitely came at the cost of a
good dose of sanity.

~~~
fangjin
FWIW, Imply repackaged Druid in such a way that should make it much much
easier to set up and evaluate. We've been porting our docs over to Druid for
0.9.0:
[http://imply.io/docs/latest/quickstart](http://imply.io/docs/latest/quickstart)

There's also a production-ready docker distribution:
[https://hub.docker.com/r/imply/imply/](https://hub.docker.com/r/imply/imply/)

------
whitegrape
Has anyone done a meaningful private benchmark comparison with
[http://www.scylladb.com/](http://www.scylladb.com/) ? I didn't find one
online.

~~~
ddorian43
It is a 100% different type of database. Druid is olap while scylladb is oltp.
They have nothing in common (except for the "columnar" name)

~~~
whitegrape
Yes, but that doesn't mean you can't benchmark them anyway. And I think you
could probably find some meaningful comparison. Certainly it would be more
useful than the Druid whitepaper's benchmarks against _MySQL_. (I used to work
on a DB project, we too had a benchmark against MySQL even though our DB was
OLAP-focused.)

~~~
ddorian43
No they can't because scylladb can't do the stuff druid can (and vice-versa).
While mysql can (even though it's a completely different way).

------
mrweasel
What's the advantage of a "column-oriented" data store/database?

~~~
gglanzani
Analytics like sum(column) are much faster if data is saved column-oriented.
Also every time you don't do select *, but select a, b, it's much easier on
disk because you're not even touching some columns.

------
whitenoice
How does it compare to Vertica?

------
sspring1
Great, just what we need, a Druish database.

~~~
ciconia
Funny, it doesn't look druish...

~~~
brightball
It takes only what it needs to survive...

~~~
brightball
We're down voting Spaceballs references? Who are you monsters!?

~~~
thecatspaw
people who come here for serious discussions. Reddit is more the place for
memes, references and stuff like that, not HN

------
rosalinekarr
Yeah, more database solutions that's what we need.

[https://xkcd.com/927/](https://xkcd.com/927/)

~~~
Exuma
It actually is needed. It's a great enterprise-level timeseries-centric
database.

