
Pulsar vs. Kafka - ceohockey60
https://streamnative.io/blog/tech/pulsar-vs-kafka-part-1
======
sciurus
For more details on pulsar and comparisons to similar systems see this series
of blog posts from a rabbitmq developer.

[https://jack-vanlightly.com/blog/2018/10/2/understanding-
how...](https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-
pulsar-works)

[https://jack-vanlightly.com/blog/2018/10/21/how-to-not-
lose-...](https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-
messages-on-an-apache-pulsar-cluster)

[https://jack-vanlightly.com/blog/2018/10/25/testing-
producer...](https://jack-vanlightly.com/blog/2018/10/25/testing-producer-
deduplication-in-apache-kafka-and-apache-pulsar)

[https://jack-vanlightly.com/blog/2019/9/4/a-look-at-multi-
to...](https://jack-vanlightly.com/blog/2019/9/4/a-look-at-multi-topic-
subscriptions-with-apache-pulsar)

------
doctor_eval
> Lower end-to-end latency helps enterprises gain business insights faster.

They lost me here. I can think of plenty of situations where reduced latency
is beneficial, but not many situations where shaving a few milliseconds would
make a difference to “business insight”!

Although I suppose it is strictly correct, in the tautological sense...

~~~
willvarfar
So I've developed an in-house fancy thoroughbred 'real time' data warehouse, a
very rare beast indeed, and its awesome.

Of course, our business is still running on nightly reports. But the tech is
cool!

So I wanna say you're wrong, but I've got man-years invested in a system that
hasn't been utilized to its full potential yet :(

Time will tell. Somewhere, some competitor will be using real-time insights to
out-compete us.

~~~
nikhilsimha
Nightly to seconds is definitely great. Seconds to milliseconds is what is
questionable I think.

~~~
throwaway_pdp09
> Nightly to seconds is definitely great

Why? Can a business mobilise in anything less than days? If a report is
minutes out of data, is that any loss? Given that some largish proportion of
reports are never used, perhaps better management is key.

not disagreeing but efficiency is not just a matter of quickness.

~~~
atomicity
Nightly to seconds = developers can see the data generated within seconds.

This generally has implications on data quality, because you aren't fixing
data quality issues once they occur in prod.

This also makes it far easier to develop ETL pipelines. You don't need complex
tooling to see whether your ETL pipeline works.

You can technically fix data quality and dev velocity issues without low data
freshness, but a quick glance at the data engineering landscape tells you they
aren't being solved enough.

------
polote
All those articles of kafka vs pulsar are always biased (this one is from a
company selling pulsar). There are so many of them that I can't get an opinion
on which one is good for what.

~~~
akerro
Pulsar is more flexible and fault-tolerant. For me the most important thing is
client can request log queue starting from specific log by id, it has better
retrying mechanisms for logs that failed to be processed. But it has absurdly
bad documentation. I had to learn many things about Pulsar by downloading src
of their java library and just reading the code. Documentation on starting
bookie, zookie, pulsar, pulsar-proxy cluster was non-existent, and making it
work like in their architectural diagram was a week of work and experiments,
compared to literally 2 hours spent in Kafka.

Bookkeeper has/had even worse documentation and setup. When I used it bookie
couldn't have mounted directory for log storage, so it's not really that
persistent as they say. Before restarting bookie node it has/had to be
reformatted to allow bookie instance to re-use logs saved on disk. The whole
"log are persisted" is true, but they don't say that you can't simple mount
them in docker and restart your PC to it.

Pulsar is good when you get it working. Documentation is really bad and it's
hard to make it work. All the super-positive articles about pulsar are
sponsored (streamnative and yahoo) and biased to make pulsar look much, much,
MUCH SIMPLER than it actually is.

~~~
z9e
> For me the most important thing is client can request log queue starting
> from specific log by id

Kafka clients can start at a particular offset ID within a partition.

------
vp8989
AWS should fork Pulsar and put out a v2 streaming product. Kinesis is kind of
crappy (IMO) and doesn't seem to be improving much. If you look at the Pulsar
architecture and feature set you can tell that it was designed very much with
this in mind (something that large scale cloud providers can integrate with
their infinitely scalable storage and compute systems).

It's not all hype either, according to this post [https://jack-
vanlightly.com/blog/2018/10/21/how-to-not-lose-...](https://jack-
vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-pulsar-
cluster) it seems like a solid piece of tech.

~~~
oweiler
Instead of Kinesis we you AWS MSK (Managed Kafka), which is expensive but
works quite well.

~~~
isugimpy
Calling MSK expensive is maybe not being clear enough. It's more than 2x the
cost of the raw EC2 instances. Then on top of that, to get metrics at the per
broker level there's an upcharge, and a further one for per topic metrics.
These are things that are actually extremely important at scale, and it's
absurd how expensive it gets.

Then you combine that with how immature the UX is, and it really just doesn't
feel good to deal with. It's not that much work to run a Kafka cluster with
the sort of design that MSK provides, and it can be done better than that
without much cost.

------
trengrj
With Pulsar vs Kafka, I don't see a huge argument between either one
functionality wise as they have so much in common (distributed log, Java
based, avoid copying memory, use Zookeeper). Because Kafka is more supported
and well-known it seems Pulsar needs to be an order of magnitude more
performant to capture developer mindshare.

I see the same with Spark vs Flink in that similarities outweigh differences.
I wonder if this is some sort of emergent pattern in open source software.

~~~
majidazimi
There are real differences among them. Here is some painful aspects of Kafka:

1\. A single partition is stored in one node (replicas on another nodes). With
this, introducing new nodes takes very long time to replicate large
partitions, because it can replicate one partition from only one node (leader
of the partition). On Pulsar each segment of partition is stored in a
different bookkeeper node.

2\. Because of 1, if two consumers read different parts of a partition that
are far from each other, they will compete over disk bandwidth. In Kafka
consumer can not read from replica node. If a topic is really popular and many
consumers try to read from it (from different parts of the file which makes OS
page cache useless), total consumption rate is limited to disk bandwidth of a
single node. But in Pulsar each consumer can read from different brokers.
Catch up consumers won't trash streaming consumers in Pulsar.

These are not problems that can be fixed easily. Additionally, in the realm of
streaming the difference between Flink and Spark is day and night. The low
watermark feature that Flink offers makes them behave fundamentally different.

~~~
toomanybits
1\. is true, but if you want that data to move to a new node, it still needs
to be replicated. Kafka's approach is to use tiered storage (which I believe
is close to completion).

2\. Kafka can read from a replica node. It's relatively new but it's there.

~~~
majidazimi
That's true but still limitation is not fully resolved. In order to increase
consumption rate, we need to add replicas. In pulsar Brokers are merely cache
nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.

~~~
kevstev
How in pulsar do they get around the fact that adding a new broker, data needs
to be moved over before that broker can start serving data? This seems like a
basic law of physics type limitation to me.

~~~
majidazimi
Network is faster than disk. Once cached, then you are only bound by network
IO for subsequent uses.

~~~
kevstev
Sure- but how is this different than kafka's caching?

------
toomanybits
This smacks of being heavily one-product-focussed to me. Being a Kafka user
it's hard enough managing and understanding one system, nevermind three or
four joined together.

Maybe it's a bit faster or a bit more elastic, or whatever, who knows. What I
really care about is whether I get called at 3am and in that regard the
argument seems pretty weak. Kafka for all its woes is a solid system you know
you can count on.

I'd much rather see someone come up with a truly innovative alternative that
actually pushes the boundaries, rather than just copying what's there already,
and adding a few window dressings.

~~~
Jedd
What kind of (lower level) surrogate metrics would you be interested in that
could translate to '3am phone calls' when comparing messaging systems?

~~~
toomanybits
Being used by at least one company of significant size that (a) i've heard of
and (b) isn't directly connected to the project would be a good start.

~~~
Jedd
'Directly connected' to a project might mean a user of - but I assume you mean
a major contributor to (as even small time users of free software often
contribute something - bug reports, feature requests, code contributions,
money, etc).

The page: [https://pulsar.apache.org/powered-
by/](https://pulsar.apache.org/powered-by/) suggests there's quite some number
of corporate users who are happy to confirm they use this suite. I don't know
how many of those _you 've_ heard of, though.

I suspect many private & government agencies around the world would decline to
formally attach their name to any list like this, lest it be (mis)interpreted
as an endorsement.

~~~
toomanybits
I've heard of Comcast, but that's Yahoo. Not heard of the others.

------
jarym
I’ve enjoyed using Pulsar but ZooKeeper... arghh. It’s an excellent component
but a pain to manage.

Looking forward to trying Kafka again when they finally remove ZK

~~~
jpgvm
What difficulties have you had with Zookeeper and Kafka? Zookeeper can be
difficult mostly because developers don't understand it very well. But in the
case of Pulsar/BookKeeper/Kafka the usage of Zookeeper is very minimal so it's
main problem constraint (performance) is mostly mitigated. Availability and
management wise Zookeeper 3.5+ is actually pretty great. You do need to
understand dynamic ensemble management but really it's a small price to pay
for it's rock solid nature. Stuff like etcd is getting close these days but it
took 3 protocol versions and tons of bugs, performance and scalability
problems for it to get close to ZK.

------
dikei
I really like the architecture of Pulsar, it is very elegant with clear
separation of duties between components.

That said, we have used Kafka for so long, and it works well enough at our
scale that we have no reason to even test Pulsar. Kafka also has much more
integrations with other tools due to its popularity.

------
polskibus
Can anyone share their thoughts on whether, in case of a new project is it
worth to start with Pulsar instead of Kafka as a distributed log/pub sub
solution with guaranteed delivery? I heard a lot of stories about Kafka's
operational complexity and TFA seems to be pointing out that Pulsar has a
lower operational upkeep (ie. less manpower needed to keep it running).

~~~
oweiler
Because Pulsar and Kafka both use Zookeeper, the operational complexity will
largely be the same. The best thing you can do to lower op complexity is using
managed Kafka by Confluent or AWS (something similar probably exists for
pulsar).

~~~
jansenmac
Kafka is replacing Zookeeper: [https://www.confluent.io/blog/removing-
zookeeper-dependency-...](https://www.confluent.io/blog/removing-zookeeper-
dependency-in-kafka/)

------
SmooL
Is there any sort of 'single node' version of these frameworks? I'm very
interested in building event-driven solutions, but I don't need the scale
offered by kakfa/pulsar, and I really don't want all the complexity. Is there
any reason nobody has made a smaller, less distributed event-centric DB?

~~~
Tsarbomb
You can run kafka on a single node no problem.

Give this a try: [https://www.digitalocean.com/community/tutorials/how-to-
inst...](https://www.digitalocean.com/community/tutorials/how-to-install-
apache-kafka-on-ubuntu-18-04)

~~~
oweiler
Single node Kafka still needs Zookeeper, though.

------
doonesbury
Can pulsar users here talk about exactly once message delivery from the
consumer side? I use Kafka and needed a cache based on incoming events. But if
the consumer crashes it's not easy to pick up from exactly where it left off
without manually committing offsets which hurts performance. There's also some
hand waving Kafka gossip that it's hard to commit offsets right. Any insight
here is greatly appreciated.

------
x87678r
Another service that "needs" zookeeper. Has anyone figured out a simple way to
manage zookeeper for a small team?

~~~
z9e
What problems have you had with it?

~~~
x87678r
Its just more complicated than it needs to be. Esp I have a hobby project with
tiny usage.

------
SergeAx
Mentioning "Fortune 100 companies" in the first paragraph is a red flag of
enterprise BS for me. Not sold.

------
abledon
I still can't believe kafka doesnt have a good open source GUI managing tool

~~~
z9e
This is great:
[https://github.com/tchiotludo/akhq](https://github.com/tchiotludo/akhq),
using this currently in our environment. There's also Kafka Manager which has
been around for a while.

I still recommend managing Kafka through the CLI tools however.

------
tcbasche
> two of the most favored messaging systems on the market

Give me a break, I'd literally never even heard of Pulsar until this article
popped up.

Of _all_ messaging systems I would have thought Kafka vs SQS, or even RabbitMQ
at the very least

~~~
bsaul
For people currently interested in building an event based architecture,
pulsar is definitely a very well known option.

~~~
ramraj07
I am squarely in the "people interested in building an event based
architecture". Not a CS background, but know tech decently. I typically know
the names of more of these apache projects than most people I've talked to IRL
(though ostensibly I'm not part of the tech elite). Yet pulsar I only came
across on HN a week back. And nothing changes due to the knowledge. It's yet
another apache product which has great design but needs either a genius or an
army of devops to deploy. Like even if I figure out what the hell this thing
is, I'm then tasked with figuring out what the hell zookeeper is (like for
real, I'll buy a beer for someone who can successfully ELI5 wth zookeeper is.
And also pig. Or impala).

~~~
sciurus
If you can't understand what Zookeeper is, I'd recommend reading Martin
Kleppmann's book Designing Data-Intensive Applications
([https://dataintensive.net/](https://dataintensive.net/)).

You don't need a CS degree to work in this field (I don't have one either!)
but there are fundamental concepts you need to understand in order to make
informed decisions when designing distributed systems.

~~~
toomanybits
You don't need a CS degree or Martin Kleppmann's book to work out it's a
GPITA.

~~~
papaf
I am also not a fan of Zookeeper but I have come to respect it. In defense of
Zookeeper, distributed systems are a GPITA. If you need to select some
components why not go for ones that are solid [1].

[1] [https://aphyr.com/posts/291-jepsen-
zookeeper](https://aphyr.com/posts/291-jepsen-zookeeper)

------
RocketSyntax
How does it compare to [https://docs.prefect.io/](https://docs.prefect.io/) ?

~~~
riyadparvez
Prefect is workflow (particularly dataflow) orchestrator. Pulsar and Kafka are
general purpose distributed streaming engines.

~~~
RocketSyntax
ah, yeah. totally forgot. sorry. i deal a lot with workflows and imagery mixes
them up for me.

------
RocketSyntax
There are so many pubsubs. It seems like every company and framework has their
own.

~~~
zackkitzmiller
Kafka is the futherest thing from a library that I can even remotely think of.

