
Introducing Druid: Real-Time Analytics at a Billion Rows Per Second - Anon84
http://metamarketsgroup.com/blog/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/
======
dsl
The architecture boils down to:

    
    
      - A framework for running a query across a lot of machines (a pretty much solved problem)
      - Throw a metric fuckton of hardware at the problem
    

1b rows/sec isn't all that impressive when you find out they have 1.3 Tb of
RAM and a $30,000 a month EC2 bill [1].

1\. "Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1
billion rows in 950 milliseconds."

~~~
shazow
For comparison, tineye.com (a reverse image search engine that I worked on a
few years ago) queries well over 1 billion images (and thus rows of metadata)
in under 0.5s (our goal was sub-250ms).

I also know that the size of the cluster is a fraction of 40 instances
(especially if you don't include the subcluster of web crawlers).

By comparison, it doesn't seem super impressive. Tineye's metadata is stored
in a sharded PostgreSQL database, the image index itself is a separate
proprietary engine (also sharded).

Then again, we probably dealt with a smaller set of query types so we could
optimize those scenarios much more heavily. Nonetheless, from my experience it
feels like they could do better with any of the options they've tried. It's
important to be aware where the bottleneck is. In our case, the IO of the EC2
storage disks was a bottleneck (which we mitigated by sharding). Once we moved
the cluster off of AWS and shoved some nice SSD drives into the machines, the
performance increase was dramatic (half the shards for twice the speed, pretty
much).

~~~
StavrosK
The difference is that you're scanning, they're aggregating. Scanning is
O(logn), aggregating is O(n).

~~~
barrkel
It depends on the commutitvity and associativity of your aggregation operator.
With sufficient parallelism, aggregation may also be logarithmic.

~~~
StavrosK
Are you saying you can somehow sum numbers without reading all of them?

~~~
barrkel
I'm saying that in terms of real time, if you have enough parallelism, you
don't have to wait for n units, but rather log n units. As a function of CPU
time, sure it's linear, but ignoring what parallelism gives you is just not
being realistic, IMHO.

As to the sibling comment, that's another good point. You say it's because it
knows about the queries ahead of time, but you don't get excellent real-world
performance without what an academic might consider cheating. Special-case
your data structures to expected queries, where those queries are either
empirically discovered and adjusted for manually, or learned automatically
over time. A closed-minded approach that simply shuts down creative thinking
by parroting out "linear time, no can do better, sorry" is poor form.

~~~
StavrosK
This doesn't change the fact that searching is at least as fast as
aggregating, no matter how much you cheat. Besides, how is precomputing
relevant to the discussion here? By that reasoning, I can convert any
algorithm to O(1) time and O(1) space.

The topic is about how search was faster than aggregation, and I'm saying that
searching is faster than aggregation _generally_. You can parallelize all you
want, it's still going to be faster to search than to aggregate.

------
mmaunder
It's always tempting to build it yourself. Initially we built our own data
store too, partly because it was a great geek challenge, and then went back to
InnoDB and NoSQL solutions once we figured we can't compete with 1000's of
open source devs and battle hardened products.

Our hardware bill for our entire server cluster, which we own, is only
slightly more than Druid's monthly cloud bill. [My company is the largest
real-time analytics provider on the web]

Few people realize that Redis was created by Salvatore to provide a real-time
analytics product - and then it grew into so much more. Also InnoDB's
clustered indexes are spectacular when it comes to answering questions like
the one's the OP has posed.

~~~
beagledude
chartbeat?

~~~
mmaunder
On their home page CB currently claim 1,977,762 visitors across all sites. We
see the following across our network:

397,727,080 visits per month.

1,002,792,862 impressions per month.

209,727,427 absolute uniques per month.

We use a third party to report this data to ensure objectivity when we're
doing reports for outsiders. This is from the period March 30 to April 29
2011.

~~~
benologist
ChartBeat's numbers are _concurrent_ users. No idea how that translates on web
but I'd expect it's more than 200m uniques - we do 120ish off 100k - 200k
concurrents.

~~~
mmaunder
Absolute uniques per month is probably a good objective measure of size.

------
martincmartin
Disclaimer: I work at Endeca, which just launched Latitdue, a mostly in-memory
OLAP product. I work in the performance and scale group, and I've been
experimenting with the speed of these things lately.

As an experiment, I wrote a simple, stand alone program that takes a
compressed database column and a list of which rows you care about, and does a
group by with count. (We're a column store, so every column is stored
separately.) For 1 billion records, it took about 250 ms.

This is on a 4-socket Nehalem, so I had 32 cores and used 32 threads. If
they're using Nehalems as well, and have 1 socket per box, they're using 320
cores. So they're about a factor of 40 slower than my simple experiment.

To be sure, theirs was a general system that presumably runs on their actual
data, whereas mine was just an experiment. Still, it seems quite possible to
get the same result with significantly less hardware.

------
IgorPartola
Isn't that what VoltDB is for? <http://voltdb.com/>

Edit: Also, I see no mention of whether Druid is going to be open sourced or
not.

~~~
MichaelSalib
Actually, I think this is what Vertica is for (VoltDB was a Vertica spin-off).
Volt is intended for OLTP workloads whereas Vertica is designed for OLAP,
especially giant star-schema workloads like the post described.

~~~
beagledude
Vertica is legit, we tested it against a billion rows on a single server and
it didn't blink and eye. It reads compressed data, and optimizes compression
based on the column's data type for that ec2 price you could get a few years
of vertica licenses

~~~
amalag
Vertica has EC2 instances available, i am trying to figure out pricing. They
are a distributed DB, but not all in RAM.

------
mrich
SAP's analytic appliance HANA (formerly called Business Warehouse Accelerator)
has been doing this for at least 5 years now. And people think SAP is old-
fashioned...

------
bbulkow
It took them 40 servers to run 1B aggregations per second? Great ---
Citrusleaf recently did 250M aggregations per second with one amazon large.
And, oh, we're fully clustered and reliable, so you can snap in new servers
with no maintenance.

~~~
benologist
How does pricing based on database size work? It doesn't look like you're a
hosted service which makes it feel like a really odd method.

------
yid
Or we can use modern statistics and/or randomized/sublinear/approximation
algorithms...

------
AlexC04
I'd love to see a demo. Ever since I saw hummingbird's demo of real time
traffic analysis I've wanted to write real time analytics.

I'd love to see this one rolling (even a video)

I'll be looking forward to their "part two" where they go on about
architecture. They're really managing 1B hits/s ? Or they have a historical
data of a billion hits and can parse it in a second ?

(for anyone who hasn't seen hummingbird <http://demo.hummingbirdstats.com/>
and it's on github <https://github.com/mnutt/hummingbird>)

~~~
benologist
I think they're just _reading_ 1b rows/second in their db, not writing or
receiving. Can't wait for part 2 though.

~~~
AlexC04
Oh yes! I bet you're excited to compare to playtomic

~~~
benologist
And play with, maybe even use... although I don't know if they're actually
_releasing_ the software either.

Definitely a lot of interesting stuff missing from part 1.

------
foobarbazetc
Lost after "Over the last twelve months, we tried and failed to achieve scale
and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL
offerings (HBase)."

I know for a fact that Greenplum can easily cover this load. Did they actually
try running all these things on the same size cluster with the same amount of
RAM? Were they all tuned properly?

------
choxi
Why couldn't you use other distributed in-memory stores like Redis or
Memcached?

~~~
rorrr
Because they don't support complex queries.

------
lordmatty
They should have just used QlikView - its proprietary but blazing fast..

~~~
jens_187
This is distributed though, which I think makes it quite interesting.

~~~
beagle3
Doesn't make it interesting for me; OLAP queries are very easy to distribute.

Also, kdb+ achieves a "billion rows per second" scan on a similar dataset with
just one or two instances (rather than 40), if you know what you are doing.

Color me unimpressed.

------
mdp
I'm more impressed by those UK click through rates on bieberfever.com

------
fauigerzigerk
Sounds like a job for Vertica, but of course that's not cheap.

