Hacker News new | past | comments | ask | show | jobs | submit login

Hi, I work on Apache Kudu (incubating).

Analytics is what Kudu was designed for (it's not just marketing), so some tradeoffs were made. You'll get the biggest bang for the buck if your use cases are heavy on inserts and big scans with selective filters. Other use cases will perform ok or meh VS other storage engines. Also note that Kudu is still pre-1.0.

Thanks for the explanation. We're currently using Elasticsearch for doing ad-hoc aggregations (e.g. count per month filter by a few dimensions) in real time. Indexes are some tens of million items per year per column; not much by big-data standards, but enough that we need a whole bunch of nodes, and small enough that even wide queries (like getting a total for the entire year) finish in a couple of seconds. A large part of ES's speed comes from being able to perform the aggregation function locally on each node that has the shard, in parallel, rather than scanning the full data the data into the client first. Is this a kind of use case where Kudu would perform well? How does Kudu solve data locality? Would you actually do queries like these, or would you precompute periodic rollups?

You seem to have a pretty typical use case that we're targeting. One thing to understand about Kudu is that it doesn't run queries, it only stores the data. You can use Impala or Drill, they'll figure out the locality and apply the aggregations properly/push down the filters to Kudu.

Did you initially pick ES over systems like Impala because of the lack real time inserts/updates when used with HDFS?

BTW, here's a presentation that might help you understand Kudu: http://www.slideshare.net/jdcryans/kudu-resolving-transactio...

Thanks, that's helpful. We picked ES for several reasons. We're not a Java shop, and the Hadoop ecosystem is heavily biased towards JVM languages.

Secondly, ES is easy to deploy and manage. Being on the JVM, it admittedly has a considerable RAM footprint, but at least it's just one daemon per node. With anything related to Hadoop, it seems you have this cascade of JVM processes that inevitably need management. And lots and lots of RAM.

Thirdly, as you point out it's easy to do real-time writes.

I do like the fact that Kudu is C++.

Well TBH you do get to pick which Hadoop-related components need to run, HDFS's Datanode itself is happy with just a bit of RAM. I do understand the concern though.

You're probably happy with what you have in prod but if you get some time to try out Kudu feel free to drop by our Slack channel for a chat! http://getkudu.io/community.html

Thanks — I will definitely be keeping an eye on the project.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact