
Apache Kudu – Fast Analytics on Fast Data - espeed
http://kudu.apache.org/
======
jaipilot747
Is there an infographic that organizes the Apache big data and ML projects
based on their use case? There seem to be many projects with seemingly
overlapping use cases.

~~~
buryat
Hortonworks has some infographic of Apache projects that they support
[http://hortonworks.com/apache/](http://hortonworks.com/apache/)

And here's an attempt to keep track of all Hadoop-related projects
[https://hadoopecosystemtable.github.io](https://hadoopecosystemtable.github.io)

------
arnon
Oh god, /another/ Apache fast data analytics platform.

Apache is like a powerhouse in creating these short-lived projects that force
companies to get the most expensive, esoteric talents.

~~~
buryat
All big-data related projects were donated to Apache by companies and not
developed by ASF and they're still active
[https://projects.apache.org/projects.html?category](https://projects.apache.org/projects.html?category)

~~~
vorg
Not just big-data projects but many other projects move from companies to the
ASF. Apache Groovy was previously developed by 2 programmers and a project
manager on VMware's dime, until they dropped it. But to say VMware "donated"
Groovy to the ASF would be inaccurate though.

------
espeed
Todd Lipcon (Kudu Founder)'s talk on Kudu - New Hadoop Storage for Fast
Analytics on Fast Data...

[https://www.youtube.com/watch?v=32zV7-I1JaM](https://www.youtube.com/watch?v=32zV7-I1JaM)

------
hapless
As far as I can tell, Kudu is dominated by Cloudera. They open sourced it as a
marketing strategy.

Is anyone who isn't paying for CDH using this?

~~~
tlipcon
Apache Kudu project founder here:

It's true that the project was initially developed at Cloudera, and employees
continue to be the main driving force behind development. That said, we have
committers and contributors from other companies as well. Roughly half the
people who contributed a patch in the last 3 months have been non-Cloudera.
Additionally we are very strict about doing all development upstream (eg with
the first open source release we spent a lot of effort to open the entire
development history going back to 2012, including JIRA, git, etc).

As for users, here are a couple examples off the top of my head who aren't
currently paying for any support:

\- Xiaomi (world's 4th largest smartphone maker) collects ~2TB/day of event
data from >5million phones into a cluster which simultaneously runs analytics
workloads (SQL, Spark, etc) \- CERN is looking at using Kudu to store high
energy physics experiment data from the ATLAS detector at the LHC. You can
find some code at [https://gitlab.cern.ch/zbaranow/kudu-atlas-
eventindex](https://gitlab.cern.ch/zbaranow/kudu-atlas-eventindex) and a
poster here:
[https://indico.cern.ch/event/505613/contributions/2230964/at...](https://indico.cern.ch/event/505613/contributions/2230964/attachments/1346598/2039266/poster-200.pdf)

(of course lots more too whose names I dont have permission to mention)

Feel free to join our slack if you're interested in chatting with more -
usually plenty of people online here: [https://getkudu-
slack.herokuapp.com](https://getkudu-slack.herokuapp.com)

-Todd

------
tlipcon
For those interested in a technical overview,
[http://kudu.apache.org/kudu.pdf](http://kudu.apache.org/kudu.pdf) is our
academic-style (but not submitted to any journal) paper.

