Apache Kudu – Fast Analytics on Fast Data

jaipilot747 · on Jan 19, 2017

Is there an infographic that organizes the Apache big data and ML projects based on their use case? There seem to be many projects with seemingly overlapping use cases.

buryat · on Jan 19, 2017

Hortonworks has some infographic of Apache projects that they support http://hortonworks.com/apache/

And here's an attempt to keep track of all Hadoop-related projects https://hadoopecosystemtable.github.io

arnon · on Jan 19, 2017

Oh god, /another/ Apache fast data analytics platform.

Apache is like a powerhouse in creating these short-lived projects that force companies to get the most expensive, esoteric talents.

ihodes · on Jan 19, 2017

I have definitely have shared that sentiment. However, Apache didn't create this project; it's from Cloudera originally, and they've got a pretty good track record. In this case, Kudu is more of a database/store on top of which analytics products (for example, this is a best-in-class engine for Impala, allowing fast read/write) can run. I wouldn't quite say it's like HDFS, which is more of a document store/distributed file-system.

buryat · on Jan 19, 2017

All big-data related projects were donated to Apache by companies and not developed by ASF and they're still active https://projects.apache.org/projects.html?category

vorg · on Jan 21, 2017

Not just big-data projects but many other projects move from companies to the ASF. Apache Groovy was previously developed by 2 programmers and a project manager on VMware's dime, until they dropped it. But to say VMware "donated" Groovy to the ASF would be inaccurate though.

batoure · on Jan 19, 2017

kudu is meant to be a file system replacement for HDFS

kudu::hdfs

as

spark::pig

Additionally unlike most pieces of the Hadoop ecosystem it is implemented in pure c++ instead of java.

catawbasam · on Jan 19, 2017

Fine. But all the other myriad Apache projects in this space were also different! better! The sentiment in the previous post still applies.

batoure · on Jan 19, 2017

Actually no, there is currently no alternate project to HDFS. The only comparable compatible platform for the ecosystem is MapR-FS which is closed source. Kudu is an interesting experiment, c++ introduces a level of memory management and performance in a space where the JVM typically falls down.

Alternately here is a list of the storage engines for MySQL

InnoDB

MRG_MYISAM

BLACKHOLE

CSV

MEMORY

FEDERATED

ARCHIVE

MyISAM

Hadoop used to just have HDFS... now it also has Kudu

arnon · on Jan 20, 2017

That's not a fair comparison

espeed · on Jan 19, 2017

Todd Lipcon (Kudu Founder)'s talk on Kudu - New Hadoop Storage for Fast Analytics on Fast Data...

https://www.youtube.com/watch?v=32zV7-I1JaM

hapless · on Jan 19, 2017

As far as I can tell, Kudu is dominated by Cloudera. They open sourced it as a marketing strategy.

Is anyone who isn't paying for CDH using this?

tlipcon · on Jan 19, 2017

Apache Kudu project founder here:

It's true that the project was initially developed at Cloudera, and employees continue to be the main driving force behind development. That said, we have committers and contributors from other companies as well. Roughly half the people who contributed a patch in the last 3 months have been non-Cloudera. Additionally we are very strict about doing all development upstream (eg with the first open source release we spent a lot of effort to open the entire development history going back to 2012, including JIRA, git, etc).

As for users, here are a couple examples off the top of my head who aren't currently paying for any support:

- Xiaomi (world's 4th largest smartphone maker) collects ~2TB/day of event data from >5million phones into a cluster which simultaneously runs analytics workloads (SQL, Spark, etc) - CERN is looking at using Kudu to store high energy physics experiment data from the ATLAS detector at the LHC. You can find some code at https://gitlab.cern.ch/zbaranow/kudu-atlas-eventindex and a poster here: https://indico.cern.ch/event/505613/contributions/2230964/at...

(of course lots more too whose names I dont have permission to mention)

Feel free to join our slack if you're interested in chatting with more - usually plenty of people online here: https://getkudu-slack.herokuapp.com

-Todd

laserson · on Jan 19, 2017

The Hail project at the Broad Institute makes use of Kudu I believe.

https://hail.is

Not sure if that group is paying for CDH, but this tool is definitely being built for others to use.

tlipcon · on Jan 19, 2017

For those interested in a technical overview, http://kudu.apache.org/kudu.pdf is our academic-style (but not submitted to any journal) paper.