
Kudu – Fast Analytics on Fast Data - strlen
http://getkudu.io/
======
tlipcon
Todd from the Kudu team here. If anyone has any questions, feel free to ask
them here, will try to check back throughout the day.

~~~
eanews
The faq mentions there are no security features at the moment, but do you have
any thoughts as to what the security goals are? In particular, will there be
cell level security a la accumulo or support for data at rest?

~~~
tlipcon
We haven't scoped out the security features. Cell level security can be
difficult to implement efficiently, but if we see enough demand for it, I
could imagine it happening.

My guess is that the first pass will be table and column level authorization,
plus of course strong authentication. Row/cell/predicate-based security could
be added in a later release, but it's a feature that's less commonly required.

As for encryption at rest, I imagine that will also be fairly high priority as
we move towards GA or the first few releases after GA. But again, we haven't
done the scoping exercise yet, so I'm cautious to throw out dates :)

If you're interested in helping to contribute either feature, let us know!
kudu-dev@googlegroups.com

------
bankim
Curious what's the reason for implementing Kudu in C++ and not Java/Scala?

~~~
tlipcon
I spent a lot of time in 2011 or so struggling with GC on the JVM:
[http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-
in-h...](http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-
with-memstore-local-allocation-buffers-part-1/) has some of the gory details.
Even hacked a bit on G1: [http://mail.openjdk.java.net/pipermail/hotspot-gc-
dev/2011-A...](http://mail.openjdk.java.net/pipermail/hotspot-gc-
dev/2011-April/002452.html)

With a lot of effort by many folks in the community, HBase has mostly tackled
the full-GC problem, but still has occasional issues with some workloads.

So, GC was definitely one factor - not having GC means we can give 99th
percentile numbers in the single-digit milliseconds, which is pretty nice. Our
master process actually has shown <1ms 99.99th percentile for tablet location
requests on an 80 node cluster. So again, numbers that are super difficult to
get on the JVM unless you take an allocation-free approach like the HFT guys
do.

Another factor was ease of integration of platform-specific code for
performance reasons. For example, we make use of SSE prefetch instructions to
improve scan speed in our concurrent B-tree by 30% or so. The b-tree itself
would be difficult to implement in Java due to lack of control over object
layout, etc. While you can eventually get the same performance with enough
off-heaping and sun.misc.Unsafe, my feeling is that, by the time you've gone
down that road, you might as well be using C++.

I'll admit that, after many years of not writing native code, I was a bit
nervous of diving back in. Segfaults are never fun. But, we soon realized that
the native code tooling has improved a _ton_ in the last decade. We run all of
our tests precommit using the excellent Sanitizer tools from Google
(ThreadSanitizer, AddressSanitizer, LeakSanitizer) and those make it nearly
trivial to diagnose a leak or crash. We also have pretty strict guidelines
around use of pointers, based on the Google C++ guidelines. Many will complain
that this is a neutered form of C++, and they're right. But it's also a
relatively safe form of C++.

I could probably write a lengthy blog post on our experiences of C++ vs Java,
but hopefully the above gives you a taste. Overall I've been happy with the
decision. Slightly more time spent on crashes. Less time spent on chasing
hard-to-reproduce performance or memory consumption issues. And the thread
checking tools are actually far superior, so I'd say less time spent chasing
races.

~~~
random3
I think writing a long post about your experiences of C++ vs Java would be
great (I'd pay you in [choose your drinks and count] for it ;)).

------
nfa_backward
Kudu is being positioned as filling the gap between HDFS and HBase. After
reading the overview I see this more as bringing features from
HDFS+Parquet+HBase. Does that sound reasonable?

Super excited about this and even more so since it is open source. Thank you!

~~~
tlipcon
Yep, that's correct. HDFS+Parquet is more accurate but doesn't fit quite as
well on slides and short descriptions.

The idea is to get the analytic scan performance of Parquet while still
allowing for in-place updates and row-by-row access like HBase.

HDFS (with Parquet or other formats) will still be better for unstructured or
fully immutable datasets. HBase will still be better when your top priority is
ingest rate, random access, and semi-structured data. Kudu should be good when
you've got tabular data as described above.

~~~
nfa_backward
Impala has an in-memory columnar format on its road map for 2016. Is that
format being design with Kudu in mind?

Edit: I understand that the formats, while both columnar, serve different
purposes. I am more curious about overlap if any between the two.

~~~
tlipcon
Yep, I've been taking part in those design discussions. We hope to have Kudu
tablet servers support generating this in-memory format in shared memory as
the result of scans, so the Impala server (client from Kudu's perspective) can
directly operate on the data. We're expecting a 20-30% speed boost from this
for some queries, though haven't done any tests at scale of the prototype.

------
vvladymyrov
Todd Any plans to add user defined functions? Will it be only UDFs written in
C(++)? I'm curious how do you think UDF support can be designed for the native
code implementation.

