
Open-Sourcing Yahoo's Pulsar, Pub-Sub Messaging at Scale - yarapavan
https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale
======
yarapavan
Github page:
[https://github.com/yahoo/pulsar](https://github.com/yahoo/pulsar)

Pulsar backs major Yahoo applications like Mail, Finance, Sports, Gemini Ads,
and Sherpa, Yahoo’s distributed key-value service.

On the scale front:

\- Deployed globally, in 10+ data-centers, with full mesh replication
capability

\- Greater than 100 billion messages/day published

\- More than 1.4 million topics

\- Average publish latency across the service of less than 5 ms

------
perryh2
I worked at Yahoo but have never heard of "Pulsar" before. Was this known as
"CMS" internally?

~~~
witchking
Yes.

------
jaytaylor
I wonder how this compares to Kafka and what tradeoffs were made.

~~~
nehanarkhede
I'm one of the Kafka authors, so admittedly my view might be slightly biased

Here is a quick comparison of Kafka and Pulsar:

\- Kafka is a complete streaming platform vs a messaging system which is what
Pulsar is. Through Kafka Connect ([http://www.confluent.io/blog/announcing-
kafka-connect-buildi...](http://www.confluent.io/blog/announcing-kafka-
connect-building-large-scale-low-latency-data-pipelines/)), it has support for
connectors to stream data between various sources and systems. Through Kafka
Streams ([http://www.confluent.io/blog/introducing-kafka-streams-
strea...](http://www.confluent.io/blog/introducing-kafka-streams-stream-
processing-made-simple/)), it has support to do stream processing and
transformations over Kafka topics.

\- Broad adoption base: Kafka is very widely adopted across thousands of
companies worldwide.
[https://cwiki.apache.org/confluence/display/KAFKA/Powered+By](https://cwiki.apache.org/confluence/display/KAFKA/Powered+By)

\- Tunable durability and consistency knobs on the producer: The Kafka
producer API allows the application to either wait until a message is fully
committed across all replicas or just the leader. This allows applications to
make the right tradeoffs for throughput vs durability. One size does not fit
all.

\- Performance and efficiency: Kafka supports zero-copy consumption allowing
the consumers to read large amounts of data at high throughput. To the extent
that I understand, Pulsar with its legder-broker model does not support zero-
copy consumption.

\- A lot of the reasons quoted for creating Pulsar are features that exist in
Kafka and are used in production:

\-- Kafka has multi-tenancy support through user-defined quotas (See this
[http://www.confluent.io/blog/sharing-is-caring-multi-
tenancy...](http://www.confluent.io/blog/sharing-is-caring-multi-tenancy-in-
distributed-data-systems))

\-- Kafka has support for authentication, authorization, user-defined ACLs
(See this [http://www.confluent.io/blog/apache-kafka-security-
authoriza...](http://www.confluent.io/blog/apache-kafka-security-
authorization-authentication-encryption/))

\-- Kafka has support for geo replication. In fact, that is the most common
use case for Kafka in several companies. (See this
[https://engineering.linkedin.com/kafka/running-kafka-
scale](https://engineering.linkedin.com/kafka/running-kafka-scale))

\-- Latency: The end-to-end latency from publish to consume can be very low in
Kafka (<10ms).

\- Support for millions of topics: To the extent that I understand, both
Pulsar and Kafka use ZooKeeper for metadata management. That is the main
bottleneck for supporting a large number of topics and likely the same
tradeoffs apply to both Kafka and Pulsar as a result.

\- Storage model: The length of a partition in BookKeeper and hence in Pulsar
is not bounded by the capacity of a server. So you have the ability to add
servers to accommodate a workload spike.

This is merely a quick overview. There might be more aspects of this
comparison that I'm missing.

~~~
babo
How do you compare the geo-replication capabilities of Kafka vs. Pulsar?

~~~
mmerli
Added some info on geo-replication in Pulsar at
[https://github.com/yahoo/pulsar/blob/master/docs/GeoReplicat...](https://github.com/yahoo/pulsar/blob/master/docs/GeoReplication.md)

------
johnlon
I'd like to see a blow by blow comparison of Twitter DistributedLog vs Kafka
vs Pulsar. Particularly focusing on what each solution means by "geo-
replication" ,what geo consistently guarantees are made, how operable the geo
rep features are in practice etc. At first glance those interested in not
losing data and interested in geo-replication and constituency are better off
going with DistLog rather than either of the other two solutions.

However if all you care about is a single DC then K is simpler as long as you
are happy for consumers to track offsets.

But if you want broker side delivery tracking then perhaps go with Yahoo or
kafka with lazy commits of offsets to zk.

------
withinrafael
Interested to find/read any comparisons to TIBCO EMS, software rolling out for
large scale use in government circles.

------
NikolaeVarius
Before Obligatory "What is so different about this from Kafka?"

Edit- Got wrong product.

from the looks of it, it just seems to be a slightly different take on Kafka.
From what I gather, looks like Pulsar allows for scaling of producers/brokers
independently?

~~~
crudbug
There is an another system from eBay named Pulsar [0] for Analytics. They
should do some research before selecting a project name. Quick idea - Quesar ?

[http://gopulsar.io/index.html](http://gopulsar.io/index.html)

~~~
nine_k
Quasar probably? Quesar is cheesy :)

