
Apache Kafka, Samza, and the Unix Philosophy of Distributed Data - gmalay
http://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data
======
module0000
Kafka is represented as a very unix-friendly message tool, when it is no such
thing. Kafka is (one of) the worst of the collection of message bus daemons.
It imposes small(3-6mb) message size limits, or else performance goes to hell.
Kafka also fails to "do one thing well", and instead tries to provide
everything-and-the-kitchen-sink (clustering, compression, etc) built-in, but
poorly.

For some contrast, take a language like python, which will allow you to create
example programs to test out Kafka VS RabbitMQ VS zeromq. You will quickly
find that one of the message bus daemons constantly gets in your way, fails to
reliably deliver messages, and consumes a ton of system resources compared to
the other two. Hint: It's kafka!

I really can't say enough bad things about kafka. Having used it and
unfortunately been forced to implement it at a number of "big"(juniper, cisco,
vmware) companies, it has been a horrible and disappointing experience for
both the end-users and the developers, every single time.

tldr; don't use Kafka, use a real messaging bus.

~~~
schmidtc
Can anyone provide a legitimate criticism of Kafka? "It sucks, doesn't use it"
isn't very helpful or productive. The closest thing I can find is
[http://engineering.onlive.com/2013/12/12/didnt-use-
kafka/](http://engineering.onlive.com/2013/12/12/didnt-use-kafka/) which is
more critical of zookeeper then anything else.

I'm evaluating Kafka for a new project and it seems to be a perfect fit. I've
contemplated building something from scratch in python, as my reliability and
performance demands are pretty minimal. However, it seems that a lot of
thought went into Kafka's design and it's feature set is perfect match for my
problem. Specifically the unlimited buffering, log compaction and the ability
to replay logs from arbitrary offsets.

If there are any viable alternatives to Kafka what are they? Bonus points if
the JVM isn't involved.

------
toomanybits
there's a live stream of this talk happening now
[https://www.youtube.com/watch?v=m-0cSdxiLLY#t=222](https://www.youtube.com/watch?v=m-0cSdxiLLY#t=222)

------
faragon
The problem of "Unix Philosophy for everything" is forgetting that involves
multiple buffer copy and parsing. So if you're able to reduce the space
problem at first step(s), you can reduce the problem, being that method
efficient enough so the buffer copy + parsing become not significant (cheap
step(s) for finding needle(s) in a haystack, and then, apply the higher cost
operations).

However, for operations not involving filtering, i.e. when processing every
line/record, you can increase the throughput avoiding IPC memory copies and
parsing on every step (the copy + parsing could take an important portion of
processing time).

~~~
icefox
I often play a game where I first write some shell script in some way that was
quick and easy to do. I start it running and see if I can in another terminal
figure out the faster "correct" method. I rarely ever win.

------
rch
> you can’t just pipe one database into another, even if they have the same
> data model

> awk doesn’t know about the format of nginx logs

So the complexity of the implementation is a direct function of the dimension
and heterogeneity of the data[-], but it still reduces to:

pgsql notify | client daemon | process | search engine

[-] high throughput is another matter

~~~
dkhenry
Also you can pipe from one database into another, you just need to have a
common format (SQL), or be able to convert to a common format. every database
I know has a tool which will take text from stdin and can return text on
stdout.

------
kragen
It's true that database servers in the RDBMS ecosystem are not analogous to
gzip or wc or even awk in the Unix ecosystem; they are analogous to Unix
itself, and they live in an uneasy détente with it. They are the "ground of
being" upon which various components can be composed, the role Kleppman
proposes here for Kafka (and Hadoop, and all the other nonsense underpinning
Kafka). You could argue that in fact database servers are terrible at
composability even of the things you can put into them (scripting languages,
data types, triggers), and I think that is true. More lower down about why
that is.

The question of how to build an "operating system for the internet" is really
interesting. How can we make it as easy to write a networked program that can
be composed with other networked programs, in the way that you can compose
tail and grep?

Kafka and Samza aren’t it; they're cluster-centric, not internet-centric. They
assume a single administrator, Kafka's publish-and-subscribe model is
inherently unreliable in the presence of spammers or other sources of system
overload, and spammers are unavoidable if you can't kick them off the system.
(If you're going to run Hadoop in "secure mode", then it uses Kerberos for
authentication; I need say no more.)

With regard to system overload, this is exactly backwards:

    
    
        As long as the buffer
        fits within Kafka’s available disk space,
        the slow consumer can catch up later.
        This makes the system
        less sensitive to individual slow components,
        and more robust overall.
    

You can reliably feed terabytes of data through a Unix pipeline consisting of
a fast process feeding that data to a slow process (e.g. rot13 | some slow
Perl script). Without backpressure, you have to buffer the terabytes of
intermediate data, which will fill up your disk, causing your system to fail;
or you can drop data, causing your system to fail. Systems without
backpressure cannot provide this kind of composability reliably. (Similarly,
systems that require you to name your data outputs in a global mutable-object
namespace, like Kafka topics, pose obstacles to composability. This, more than
anything else, is what limits the composability of queries in traditional SQL
databases.)

Don't get me wrong. I think pub-sub is great, and especially for loosely-
coupled integration of different systems. I use pub-sub systems every day.
I've even written a few. But don't make the mistake of thinking that they're
the networked equivalent of Unix.

IPFS might be it. There's a whole panoply of other projects which, like IPFS,
are trying to build a decentralized internet OS (MaidSafe, Ethereum, and so
on) and probably sooner or later one of them will succeed.

One more minor quibble, which to my mind shows that the author wasn’t very
interested in writing true things instead of false things:

    
    
         you can pipe
         the output of gunzip
         to wc
         without a second thought,
         even though the authors of those two tools
         probably never spoke to each other
    

David MacKenzie and Paul Rubin are in fact listed among the contributors to
gunzip.

------
contingencies
TLDR; author likens message oriented distributed architectures to the classic
unix pipe, while lambasting us for forgetting (in RDBMS design, etc.) the
composability lessons of the 1970s.

------
vegabook
It would be even closer to the Unix philosophy if a competitive distributed
graph processing framework such as these could became available on something
_other_ than the JVM / Java ecosystem. I personally am not using either for
this exact reason and hope that Golang's goroutines will get their own
distributed processing analogue, or that Mirage OS will take off.

I am amazed that these systems, Storm / Samza / Spark / even the new Flink are
_all_ Java / Scala based.

~~~
teacup50
Why do you find this surprising?

What other language and runtime platform provides:

1) Productive, high-level programming languages.

2) Portability to any OS.

3) High runtime performance for the server-side usage model.

4) Real shared-state SMP (necessary for memory and core efficiency).

5) Standalone, single-directory distribution of the applications and all
library dependencies.

6) A broad, stable, mature base of well-documented, well-tested libraries.

The CLR has some advantages over the JVM, but it also shares most of the JVM's
attributes.

~~~
vegabook
A mature, reliable, performant, but unadventurous platform which attracts
corporates. I am not knocking it - I think it's obvious that the JVM is well
suited, but isn't there room for at least one or two competitors on other
platforms? Do any exist that you know of? I tend to be a bit more adventurous
in my coding projects and given that embarking on projects requires one to be
enthusiastic about one's tools, I cannot use this as Java leaves me cold. This
sounds flippant but really I don't think it is. I love the idea of DAG
workflows but I don't want to do it in Java, even with all its strengths.

~~~
dm3
The other benefit of the JVM is the multiple languages that run on it - did
you try Clojure or JRuby already? Or do you mean Java as in JVM and you don't
want that in your stack for some reason?

~~~
vegabook
I am not a fan of any compatibility layer technology as I am a bit of a bare-
metal purist, and I believe Linux does this already. The original post waxes
lyrical about the Unix philosophy, but part of the Unix philosophy is supreme
efficiency and this is violated by the JVM. I know I know - performance is
comparable but then coding is a bit like art - you have to love what you are
doing.

As for using other languages - I played with Python on Storm but I very
quickly found that I had to basically use Java because the entire community
and documentation was java-centric (last I checked - a few months ago).

I may be being obtuse here - in fact I know I am - but my point is that there
is a large community out there which is not Java friendly (logically or not)
and so I lament the fact that all the advancement in this new, very exciting,
field, is JVM-based. My dislike for Java was sealed by the Bloomberg terminal
API, some of which's basic functionality is 5 to 6 dots deep. Seriously I had
80+ character function calls. This.that.that.this.this.finally()!

Perhaps we'll just have to kick in, leave our biases behind and start running
with the JVM.

FWIW my use case is streaming fixed income (bond pricing) data analysis.
Python/Numpy is running out of steam fast for me and we're using quite a bit
of C. I basically come from the scientific computing set. We're applying
advanced statistical analysis to pricing (not HFT - we're operating in the 5
minute thru 1-week horizon - not 5ns).

~~~
teacup50
In one breath "I am a bit of a bare-metal purist" and in the next "As for
using other languages - I played with Python on Storm".

So which one is it? :-)

~~~
srean
I am not the OP but it is not necessarily a contradiction.

The underlying goal in Pythonic data-science is that one pushes as much of the
computation into pre-packaged loops that have already been written in a close
to the metal language (C, C++, Fortran). Well, that's as far as the intent
goes, practice deviates from it by degrees.

This does not work quite as well in Java (or JVM) because JNI is just
supremely god-awful. JVM semantics are overly strict, this over-specification
kills optimization opportunities. Is heavy on memory, I would rather use the
memory for loading more data than fill it with overhead. Finally there is this
OOP culture that gets in the way.

That said, JVM is one of the most mature, and well engineered VMs out there
(Java is another story), but not very well suited for number crunching,
because there is more to number crunching than calling BLAS APIs.

