Hacker News new | past | comments | ask | show | jobs | submit login

At previous employer, we built a system using Druid as the primary store of reporting data. The setup worked amazingly well with the size/cardinality of the data we had, but was constantly bottlenecked at paging segments in and out of RAM. Economically, we just couldn't justify a system with RAM big enough to hold the primary dataset. As the result, we had to prioritize data aggressively, focusing on the more recent transactions and locating them on the few servers with very high RAM that we did have. Historic data segments had to go through a lot of paging in/out of RAM. User experience on YTD (year-to-date) or YOY (year-over-year) reports really suffered as the result.

I don't have access to the original planning calculations anymore, but 375GB at $1520 would definitely have been a game changer in terms of performance/$, and I suspect be good enough to make the end user feel like the entire dataset was in memory.




Make sure you're looking at updated prices for ram too. 16x16GB of registered ECC DDR3 is about the same price and enormously faster.


Sure, but I believe we were limited by the available chassis to a lot lower than 16 slots.


Well the first google result for "1u 16 dimms" is a refurbished chassis+motherboard+PSU for a hundred bucks. Brand new costs more but not terribly so; the main cost is the ram whether you go 8 slots or 16.

These SSDs have situational uses but unless you want 10+ TB in one server you can get a system with >50% as much actual RAM for the same price.


It's not the cost. We ran standardized chassis, so whatever our ops had is what they had...


Would you choose to run with druid again?


For that use case, absolutely! We made do with the version that could not even support label appends (limited joins). The current version would allow us a lot fewer workarounds.

The probabilistic hyperloglog data type is also a game changer compared to say redshift, but again it's only viable if you are dealing with counting (estimating) unique entities across billions of rows and super-wide dimension sets.

If you are doing a general purpose analytics store, Redshift is hard to beat because of reliability and ease of implementation.

Druid is a purpose-built race car. Redshift is a good cross-over - far less headache and can do almost any job good enough, but you won't have the tuning or performance (when tuned right) at scale. Although, I'm continuously impressed with what redshift actually can do, dispite the humble feature set.

Druid's main weakness is lack of SQL support, so it's not a great analyst datastore. You pretty much have to wrap it into a reporting app.


Hi sologoub - can you elaborate a bit on the tuning for Redshift you're referring to? What's the pain there? Asking because we're building a performance management product for Redshift, I'd love your input! lars at intermix dot io


What do you think of ClickHouse vis-a-vis Druid and Redshift?


Don't have any experience with that tech, but from reading the marketing landing page it sounds more akin to memSQL than Redshift, in that it seems to include options for streaming ingestion.

If I'm going to take on a similar project, I may POC memSQL or Citus DB, and possibly Big Query (if the project is built on Google Cloud as opposed to AWS or raw iron).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: