
Show HN: KQ – Simple Job Queue for Python Using Kafka - joohwan
https://github.com/joowani/kq
======
Xorlev
One of the biggest problems with treating Kafka as a job queue is that you
suffer from head-of-line blocking. Kafka doesn't expose per-message
visibility/acknowledgement semantics like RabbitMQ/Redis PUSH+POP/SQS does.
Each consumer group tracks offsets into the partitions of a log (aka a topic).
This offset is just a number that points to a specific message in the Kafka
partition. If you get stuck on message 123, you either can't proceed to 124,
proceed and don't commit your offset but risk replaying 124, or skip 123.

A great many of our services publish to Kafka, those consuming services which
seek to treat individual records as tasks (or bundles of tasks) as opposed to
a linear log must either skip failures or push them onto SQS for background
retry. Our batching consumers have to track out-of-order completion of work
and commit up to the lowest completed offset, meaning a slow task can delay
offset commits. If a consumer is stopped before finishing that slow task, we
have to replay work which means all work has to be idempotent. In practice, it
works well enough, but it's still some gymnastics.

I suspect this is why Google invested so much into making PubSub scalable
despite per-message semantics. It's considerably simpler in many ways, even if
you have to bake in your own ordering/monotonicly increasing identifiers.

~~~
joohwan
Very true. I indeed found the lack of visibility into per-message information
very painful when I was building this. One way I tried to alleviate the issue
was providing a consumer "callback" to make it easier for users to plug their
own code in to handle job failures (like your example of using SQS).

I've also thought about reserving a topic + consumer group specifically for
failed jobs and bake the retry logic into KQ itself. But that's an area I must
explore more.

I'm not sure if I understand what you are saying about batching consumers.
What do you mean by batching in this context? Thanks for your input.

~~~
Xorlev
We have some consumers which treat log entries as tasks, and often it's handy
to debounce some of the work into larger chunks that can be executed in
parallel. The chunks can be linear or they could be grouped by some property
of the message (e.g. account id). In that case, we have batches of messages
with multiple non-consecutive offsets, e.x. [123, 145, 155], [122, 124, 144].
In practice, that means inserting each message offset into a per-partition
sorted set of pending work. When a batch completes, all the offsets in that
batch are marked as "complete" and we commit the lowest safe offset. Using the
example above, if the batch [122, 124, 144] completed, we'd still have [123,
145, 155] outstanding which means the lowest safe offset is 122* even though
124 and 144 also completed in batch 1. Until that second batch completes, 123
is still outstanding making it the barrier to commiting a higher offset.

Our batching consumers provide pluggable behavior for handling a failing
batch, but usually it's pushed onto SQS since those can cycle around a few
times until we notice and fix whatever condition is preventing progress on
that work.

* - 123 actually, as if you commit offset 123 the consumer will fetch offset 123 again on start, but that's implementation esoterica

------
pfarnsworth
Anyone have any opinions on the long-term viability of Kafka? I've been
lurking on the kafka dev mailing lists and I'm fairly turned off by the
attitude from Confluent's employees that I've read from. There was a recent
thread about their open source status and how they backhand Apache's open
source philosophies, I'm wondering if they are thinking of moving away from
open source in the future.

~~~
winteriscoming
I am not associated with the Kafka dev team or Confluent in any way but have
been using Kafka since their 0.7.x days. Speaking of their open source
commitment so far, I haven't seen any kind of problems in the way they deal
with the community. They have been open to bug fixes, feature enhancements and
other contributions.

Many of the Confluent employees you see currently, started off by contributing
to the open source code and still do, from what I can see.

>> There was a recent thread about their open source status

I am not sure which thread you are talking about, but if it is that thread
which involved adding REST server within Kafka core then I completely back
what many of the Confluent employees and other community members decided on
that topic. From what I could see in that thread, the whole reason of "we
should bring in REST server within Kafka core" was not related to technical
reasons but hypothetical reasons like "there's a project out there which
already supports this REST feature but what if they don't like my
contributions and don't allow me to push features that I like into that repo".
That proposal of bringing in the REST server within Kafka core was, IMO,
rightly rejected but at the same time, the users were allowed to state the
technical reasons why they want that feature within core.

Given any production usable project that has a large user base, discussions
and decisions like these are common and that doesn't essentially mean they are
moving away from open source. Overall, I have high respect to many of the
members of Kafka dev team, many of whom are currently employed at Confluent,
for the way they have so far dealt with suggestions to enhancements in the
project.

Of course, Confluent builds on top of Kafka and has/will have they own
commercial interests, so some of the features that they develop might/will be
commercial.

~~~
pfarnsworth
Yes, I think that's the one but my biggest takeaway from what I remember was
"Apache is okay but..." I remember reading about half a dozen Confluent
employees towing the party line and parroting the exact same argument, about
how being under Apache was stifling their innovation. And yet Kafka is
thriving under Apache, so their arguments smelled fishy as hell.

It sounds like they are setting up the foundation to pull something like what
Oracle did with MySQL, and try to take control after countless outside helpers
turned the product into something rock solid.

~~~
SEJeff
There are plenty of outside helpers, but the heavyweights doing the
overwhelming majority of the work are either Jay, Jun, or Neha and/or work for
one of those 3... All at Confluent. I've met both Jay and Jun in person, and
watched Neha present @ MesosCon last year. If they're doing something with
tech, it is going to be good.

------
smegel
> It is backed by Apache Kafka and designed primarily for ease of use.

That's a bold claim.

~~~
rads
He means the software requires Kafka as a dependency, not that the foundation
is sponsoring it. The claim of "designed primarily for ease of use" is a
personal one, not hard to make.

~~~
mperham
I read the comment as "simple and ease of use are not terms often associated
with Kafka".

~~~
smegel
You would be correct :)

------
stratospherein
Don't you need zookeeper to use Kafka? Perhaps you should mention that in your
dependencies.

------
jondot
To me, Kafka as a job queue is a painful impedance mismatch. To achieve that,
you need to:

1\. Figure out which Kafka broker you're using. The concept of a consumer and
consumer APIs 0.7 is different from 0.8 which is different from 0.9 and
different still from 0.10. Ranging from non-existing, to quirky, to finally a
good design that's working - but you need to make sure offsets are committed
in time.

2\. Offset management is a thing. If you're lucky and using recent brokers
you're good, but you'll still have to make sure timings for submitting offsets
and heartbeats interleave in a way that Kafka doesn't think consumers are
dead. If you had problems imagining this scenario - exactly. This was a very
hard race condition to find, that's solely up to you and your motivation to
fix it and not scratch it of in favor of "a random glitch".

3\. Kafka clients are still radically different, supporting different
versions. You need to be lucky to use a language and a platform that is in
harmony with latest consumer APIs. However - I'm certain it'll converge. The
ecosystem will converge slower.

4\. Out of reasons Xorlev mentioned, you will find yourself against the wall,
making sure each job task is idempotent. Suddenly - this becomes a people
management problem too.

All of these can (and probably _will_) be solved, however I feel that (2) and
(4) will always be there, because that's part of why Kafka is so great.

In addition, I think Kafka is one product which you _must_ read the
"whitepaper"[0] for before you want to build consumers for it. The first
reason - because it's an innovative design, that might come in handy in every
day life if you're an engineer, and the second reason - is to understand the
founding context in which it was created - logs and why there are so many
tradeoffs that were done for it to be amazing at that, and to realize that
this original founding context was _not_ transactional jobs.

Switching gears now. Many organizations find Kafka as a much needed cure for
data processing pipelines, and pushing events and messaging as a first class
citizen in the organization from a _data_ point of view. For that, Kafka is
amazing. With it, you can realize the dream of having an "event mart" where
groups, teams, consume and publish their view of the world, processed, as a
message stream, and someone can pick up that stream and build a completely
different product on top of it (not a perfect example but one we can all
relate to - think about Twitter's firehose).

The perception problem is, that once this floods the organization, there's
little to do, to use the same mindset to build _operational_ and transactional
queue systems, where you don't process events or data, but perform tasks.
Unfortunately that's not true. I'd be happy if there were stronger education
about this from Kafka's side.

For the kq project - I wish best of luck and I'd be interested to see it
unfold. Code is very clean and I feel it's inviting to just read and learn
from it - kudos!

[0]
[http://www.longyu23.com/doc/Kafka.pdf](http://www.longyu23.com/doc/Kafka.pdf)

~~~
joowani
Thank you for the excellent feedback and insight. I will definitely give the
pdf a read. I agree with all of your points, which admittedly I was not fully
aware of when I first embarked on this project. As you've already implied,
there are some nuances that may forever be inevitable due to the inherent
design of Kafka. But I wouldn't want to dismiss it as unsuitable for job
queues so early. It would depend strongly on the use case of course (e.g. jobs
that are idempotent or without hard requirement to be processed), but as I am
hoping that Kafka's API matures further with finer control over messages and
that this could work fairly well for the most part. For now I will take note
and update the documentation to clearly explain what KQ is (and what it is
not), and what the best practises and use cases must be taken into account
before using it. Thanks again!

------
reubano
Why should I use this over the simplier RQ?

~~~
joowani
RQ is certainly useful (with finer control over messages), but we've been
having a lot of problems with it lately in production due to it being memory
bound (Redis). Improper code deploys or insufficient/stuck workers would
quickly explode the queues and make the broker go oom in matter of hours.
Memory is also very expensive. With KQ/Kafka I was hoping it would provide us
with a lot more to buffer for human errors and scale better.

~~~
reubano
Ok, that makes sense. I've never dealt with the scale you are talking about so
have yet to run into these issues with redis. It may be best to put this info
in the readme so people can decide for themselves when to use RQ or KQ.

------
dozzie
What kind of jobs do people put into such queues? Because I've only seen job
queues used as a poor man's RPC brokers and (poor?) replacement for crontab. I
think I may miss some use cases where this is a valid choice.

~~~
chrishacken
If you don't like it don't use it. I read through your profile's comments, all
you do is complain and bash other peoples ideas.

~~~
dozzie
> If you don't like it don't use it.

For our web application, we're already in the middle of transition from Celery
to a proper RPC system, so your comment (aimed to, I don't know, make me
ashamed? make me shut up and go doing some real work?) somewhat missed.

Though I really want to know if there are any sensible uses for task queue,
specifically with web applications in mind, as most people seem to use task
queues for that.

> [...] all you do is complain and bash other peoples ideas.

Only the stupid or incomplete ones, like somebody boasting about their system
without describing what does it do or how to install and use it.

------
tobych
Typo: lightweight in the README should not be hyphenated :-)

------
crucialsnippet
Pretty clean code. Nice job.

------
azundo
Getting a Permission Denied page for the docs at
[http://kq.readthedocs.io/en/master/](http://kq.readthedocs.io/en/master/)

~~~
joohwan
Fixed!

