
Python Kafka Client Benchmarking - boredandroid
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
======
dpkp
kafka-python maintainer here. Our library is designed to be correct first,
easy to use second, and fast third. It should not be surprising to anyone that
using C extensions improves python performance. I have avoided requiring C
compilation in kafka-python primarily because I've found that very few python
users care about processing >10K messages per second per core (remember in
python w/o C extensions you are generally bound to a single CPU, so spinning
up multiple processes usually improves performance. see multiprocessing). I've
also found the python infrastructure for distributing C extensions to be not
easy (see goal #2 above). But that is changing! I would definitely consider
leveraging C extensions for wire protocol decoding given the recent
improvements to wheel distribution on linux. I'm not sure whether I would go
so far as to delegate the entire client to a C extension. Part of the fun of
python is that you can play with all of the guts at runtime. I've found users
are very willing to hack up kafka-python internals to help debug issues. I
dont think I could expect the same community involvement if it was all
distributed as a complied C extension. But I could be wrong.

Anyways, always fun to read benchmarks. I hope kafka-python makes someone out
there smile. That's the best benchmark in my book.

~~~
pwang
Distributing Python +C extensions are easy with Conda.

[https://conda-forge.github.io/](https://conda-forge.github.io/)

------
pixelmonkey
My team at Parse.ly also did a benchmark comparing pykafka (pure Python) to
pykafka with the librdkafka C extension enabled. That C module is clearly a
huge win for Kafka consumer/producer performance on Python and other dynamic
languages.

[http://blog.parsely.com/post/3886/pykafka-
now/](http://blog.parsely.com/post/3886/pykafka-now/)

Unfortunately, as the OP illustrates, there are now 2 widely-used Python +
Kafka drivers (pykafka and kafka-python), and as of recently, a third,
confluent-kafka-python, which is a thin wrapper over librdkafka.

The reason there's all this fragmentation is because Kafka was quite the
moving target for non-JVM languages for the past three years. We have used it
in production since Kafka 0.7, so we've had to live through it all blow-by-
blow. I'm hoping that with Kafka 0.10 recently released, we can finally unify
the community around a single driver (somehow).

~~~
pixelmonkey
@dkfp Apologies for that, I did not mean to mis-characterize. PyKafka also
goes from 0.8 => 0.10. I had assumed kafka-python recently switched to be
0.9-only due to all the changes related to consumer groups.

~~~
dpkp
No apology required. Though note that pykafka requires >=0.8.2 , and is only
forwards compatible w/ newer brokers. This means that pykafka implements the
0.8.2 feature set. Newer brokers support that feature set, but you are not
taking advantage of 0.9 or 0.10 features if you connect to them. kafka-python,
on the other hand is both forwards _and_ backwards compatible. It supports all
feature sets: from no offsets in 0.8, to zk offests in 0.8.1, to kafka offsets
in 0.8.2, to group management in 0.9, to message timestamps and relative-
offset compressed messages in 0.10. The feature set to use is chosen based on
the broker version we're connected to. As far as I know, no other client
supports this approach -- not python, not java, etc. [Though KIP-35 should
open this up to other clients for backwards compatibility starting at 0.10]

~~~
emmett9001
Pykafka does currently have support for 0.9 group management, and we intend to
add support for message timestamps and the other new 0.10 features. We're not,
however, detecting the broker version and turning on features on that basis.
Instead we prefer to let the user explicitly enable the features they're
interested in using.

------
iamspoilt
I ran a couple of Kafka client benchmarks using Python, Jython and Java and
got pretty interesting results. Check them here:
[http://mrafayaleem.com/2016/03/31/apache-kafka-producer-
benc...](http://mrafayaleem.com/2016/03/31/apache-kafka-producer-benchmarks/)

~~~
yahyaheee
Would have been interesting to add the c-wrappers in there, but still cool.
Thanks

------
willvarfar
Ah this reminds me of one of the very most tricky bugs I ever tracked down:
[https://github.com/dsully/pykafka/pull/15](https://github.com/dsully/pykafka/pull/15)

~~~
DanWaterworth
You have my condolences.

------
fluential
After a quick glance, first thing that strikes me is using docker for
measuring network bound application performance. Across different versions
docker handles networking differently and by default it may have quite
significant impact on your results, good example comes from percona guys
[https://www.percona.com/blog/2016/02/05/measuring-docker-
cpu...](https://www.percona.com/blog/2016/02/05/measuring-docker-cpu-network-
overhead/) I wonder what would results be without using docker, or using
docker with --net=host

~~~
StreamBright
I guess some performance testers just don't know what they are measuring, in
this case: the overhead of docker of the performance of the Python code. To be
fair it is hard to understand a whole system performance. I would love to see
a test without Docker though.

~~~
jdennison
Original author here. The docker network point is a good one, I'll give it a
try with host network.

There is still value with comparing different clients with the same network
constraints. Yeah it is a contrived setup(noted in the post), but at least is
the same contrived setup for each test.

------
nerdwaller
Has anyone tried much with the aiokafka library for asyncio
([https://github.com/aio-libs/aiokafka](https://github.com/aio-
libs/aiokafka))?

------
sheeshkebab
>I ran these tests within Vagrant hosted on a MacBook Pro 2.2Ghz i7.

Good ole laptop benchmarks

