- A framework for running a query across a lot of machines (a pretty much solved problem)
- Throw a metric fuckton of hardware at the problem
1. "Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds."
I also know that the size of the cluster is a fraction of 40 instances (especially if you don't include the subcluster of web crawlers).
By comparison, it doesn't seem super impressive. Tineye's metadata is stored in a sharded PostgreSQL database, the image index itself is a separate proprietary engine (also sharded).
Then again, we probably dealt with a smaller set of query types so we could optimize those scenarios much more heavily. Nonetheless, from my experience it feels like they could do better with any of the options they've tried. It's important to be aware where the bottleneck is. In our case, the IO of the EC2 storage disks was a bottleneck (which we mitigated by sharding). Once we moved the cluster off of AWS and shoved some nice SSD drives into the machines, the performance increase was dramatic (half the shards for twice the speed, pretty much).
As to the sibling comment, that's another good point. You say it's because it knows about the queries ahead of time, but you don't get excellent real-world performance without what an academic might consider cheating. Special-case your data structures to expected queries, where those queries are either empirically discovered and adjusted for manually, or learned automatically over time. A closed-minded approach that simply shuts down creative thinking by parroting out "linear time, no can do better, sorry" is poor form.
The topic is about how search was faster than aggregation, and I'm saying that searching is faster than aggregation generally. You can parallelize all you want, it's still going to be faster to search than to aggregate.
Somewhere there's an HN discussion for it too.
Our hardware bill for our entire server cluster, which we own, is only slightly more than Druid's monthly cloud bill. [My company is the largest real-time analytics provider on the web]
Few people realize that Redis was created by Salvatore to provide a real-time analytics product - and then it grew into so much more. Also InnoDB's clustered indexes are spectacular when it comes to answering questions like the one's the OP has posed.
397,727,080 visits per month.
1,002,792,862 impressions per month.
209,727,427 absolute uniques per month.
We use a third party to report this data to ensure objectivity when we're doing reports for outsiders. This is from the period March 30 to April 29 2011.
As an experiment, I wrote a simple, stand alone program that takes a compressed database column and a list of which rows you care about, and does a group by with count. (We're a column store, so every column is stored separately.) For 1 billion records, it took about 250 ms.
This is on a 4-socket Nehalem, so I had 32 cores and used 32 threads. If they're using Nehalems as well, and have 1 socket per box, they're using 320 cores. So they're about a factor of 40 slower than my simple experiment.
To be sure, theirs was a general system that presumably runs on their actual data, whereas mine was just an experiment. Still, it seems quite possible to get the same result with significantly less hardware.
Edit: Also, I see no mention of whether Druid is going to be open sourced or not.
I'm not sure HStore/VoltDB was a "spin off" of Vertica - they are quite different models fundamentally.
I'd love to see this one rolling (even a video)
I'll be looking forward to their "part two" where they go on about architecture. They're really managing 1B hits/s ? Or they have a historical data of a billion hits and can parse it in a second ?
(for anyone who hasn't seen hummingbird http://demo.hummingbirdstats.com/ and it's on github https://github.com/mnutt/hummingbird)
Definitely a lot of interesting stuff missing from part 1.
I know for a fact that Greenplum can easily cover this load. Did they actually try running all these things on the same size cluster with the same amount of RAM? Were they all tuned properly?
Also, kdb+ achieves a "billion rows per second" scan on a similar dataset with just one or two instances (rather than 40), if you know what you are doing.
Color me unimpressed.