

Metamarkets Open Sources Druid, A Real-Time Analytics Data Store - brianm
http://metamarkets.com/2012/metamarkets-open-sources-druid/

======
jandrewrogers
This is great news for the broader community, even if you do not have a use
for Druid per se.

Database engines designed for real-time analytics have a significantly
different internal structure than either traditional OLTP or popular
analytical systems like Hadoop. Most people just try to (badly) fit real-time
analytic workloads into a database engine not designed for it. Druid is the
first open source example I am aware of that has internals designed for these
types of workloads.

Few software engineers know what the inside of a real-time analytical database
looks like. This will provide a great starting point. (The only major missing
component is the non-trivial, custom I/O scheduling engine required to back
these engines to disk instead of in-memory.)

------
scanr
This is great. I've been keeping an eye on the meta markets blog for entries
about Druid because we've been doing something similar and it's always useful
to see how other folk are solving a shared problem.

Regarding what's on github: a lot more documentation / examples would make
working out how to use it a lot easier.

~~~
vosper
I too have been following these blog posts. I'm happy to see they've open-
sourced Druid but I agree that some more documentation and examples would go a
long way. Presumably this will be coming in the future, for now it's nice just
to be able to grab the code and play.

I'd be interested to hear about how you tackled the realtime analytics
problem? We're doing a shootout between HBase, Riak (w/ map-reduce),
Hypertable and Postgres at the moment - so far there's no clear winner.

~~~
scanr
We went old school. We have a whole bunch of MySQL instances that we've
sharded (using consistent hashing). Queries are sent to all of the shards and
then aggregated on one of our query servers.

We also aggregate as we go which means there's just less data around to have
to store (the disadvantage being that we need to decide up front what we want
to aggregate but that has been less of an issue). We can't do accurate unique
counts on pre aggregated data but we've added hyperloglog into the mix as most
of our use cases can tolerate a small amount of error.

We partition our MySQL tables, which means we can archive old data easily and
don't need to run any repair jobs.

Disadvantages:

* Column oriented is a better fit for this kind of data (somewhat mitigated if the data isn't sparse and we store id's instead of values e.g. city_id rather than city_name).

* Schema changes are not pleasant. We found some deadlock with partitions and running alter table statements which locks up the entire server.

Advantages:

* It performs well enough for our needs

* Having SQL as a query language is very pleasant

* MySQL is well understood, which makes looking after it quite easy

~~~
vosper
Thanks for sharing this info! Of all the points you listed I think the most
important for me is "MySQL is well understood, which makes looking after it
quite easy". Personally I think that our analytics requirements can be
satisfied by at least half a dozen different systems out there, so management,
optimisation and maintenance will be major factors. For my part, these are my
thoughts on the systems we've looked at:

1\. Riak: K/V stores are conceptually simple, secondary indexes look nice, and
the consensus from RICON was that scaling by adding nodes pretty-much Just
Works. However, no-one at RICON seemed to use MapReduce, particularly not for
real-time analytics. No SQL-like querying.

2\. HBase: Built on popular technologies. Amazon provides it as a service, but
altogether it's very complex (a lot of components, and we're not a Java shop)
and we have had trouble achieving the performance we need - although it seems
from various blogs and books that it must be possible.

3\. Hypertable: Still testing it, hard to find much information about it
outside of the main site.

4\. Postgres: Everyone knows and loves Postgres. Scaling (and performance
after scaling) are concerns - we might have to go down the sharding route, and
may never be able to store raw events data as it comes in.

Haven't looked at Cassandra yet, but most likely will do that soon.

~~~
karterk
The trick to hbase performance almost always relies on getting the row key and
column qualifier design right. Your keys and qualifiers should be chosen to
exploit bulk scans. If you have any specific questions, shoot me an email I
can give you some pointers (email in profile).

------
amalag
Infobright made waves as a column store analytics database. It's performance
was awesome when dealing with billions of rows. Having druid use distributed
machines with the data in memory is a different approach though.

------
pyrotechnick
<https://github.com/metamx/druid>

