
Heroku Kafka - sixwing
https://www.heroku.com/kafka
======
mbseid
As a former user of Kafka, this is awesome and it would have been a huge help
for our company if this was available then. I'm glad to hear that a company is
offering Kafka as opposed to other propriety versions(AWS Kinesis etc).

One thing is odd though, there is no mention of disk space at all and only a
configuration of retention time. One of Kafka's best features is the use of
disk to store large amounts of messages, you are not RAM bound. Heroku seems
to only allows you to set retention times? This could be awesome if they are
giving you "unlimited" disk space, but could also be a beta oversight.
Interested to see how this progresses.

~~~
uhoh-itsmaciek
Hi, I'm Maciek and I work on the Heroku Kafka team. You don't have to think
about disk space--it's on us to make sure there's enough to satisfy the
retention settings you configure. We're excited to provide another great open-
source project as a managed service!

~~~
mbseid
Thanks for the update. That is awesome. Excited to see what people do with it.

------
jonahx
> _What is Kafka?_

> Apache Kafka is a distributed commit log for fast, fault-tolerant
> communication between producers and consumers using message based topics.
> Kafka provides the messaging backbone for building a new generation of
> distributed applications capable of handling billions of events and millions
> of transactions

Can anyone translate this into meaningful English for me?

~~~
simonw
You should read this: [https://engineering.linkedin.com/distributed-
systems/log-wha...](https://engineering.linkedin.com/distributed-systems/log-
what-every-software-engineer-should-know-about-real-time-datas-unifying) \-
it's long, but it's one of the most impactful essays I've read on software
engineering in years.

~~~
Jgrubb
I'm not able to find the one that made the light bulb go on in my head, but
Martin Kleppman gives some good conf talks around this topic. This one looks
promising -
[https://www.youtube.com/watch?v=GfJZ7duV_MM](https://www.youtube.com/watch?v=GfJZ7duV_MM)

~~~
rakoo
I liked this one [[http://www.confluent.io/blog/turning-the-database-inside-
out...](http://www.confluent.io/blog/turning-the-database-inside-out-with-
apache-samza/)] a lot, because it highlights the strength of Kafka beyond a
simple distributed message queue.

------
franciscop
I love Heroku and everything they are doing, it's doubtless a push forward for
the web as a whole. However, the pricing for hobby sites (including SSL) is
crazy from a personal point of view so I'm slowly moving my projects out of it
[1][2]. I wish they had some kind of "Hobby Bundle".

[1] [http://umbrellajs.com/](http://umbrellajs.com/) [2]
[http://picnicss.com/](http://picnicss.com/)

~~~
sudhirj
Their pricing for hobby sites is 7$ + 10$ DB, which is very comparable with a
self setup IaaS like DO and AWS. Personally I think the developer experience
is much better on Heroku and quite worth it.

SSL is a pain point, though I do empathize with them - I think they're doing
something expensive for that. What I do is to use AWS Cloudfront and ACM for a
free cert and site speedup - if they are personal projects the CF bill ought
to be in the low few dollars anyway.

~~~
flurdy
The $7 is comparable to one app per DO or AWS micro/nano server. So Heroku
wins on convenience.

If you have say 10 apps then Heroku costs 10*$7, but you might still only have
used 1-3 servers depending on memory use of apps etc so then Heroku looses on
cost.

Naturally I got a total mix of quite a few on Heroku's classic or new free
plan, some on their hobby plan, some on AWS, some on docker cloud, most
proxied behind a SSL certificate running on AWS.....
([https://flurdy.com/docs/letsencrypt/nginx.html](https://flurdy.com/docs/letsencrypt/nginx.html))

------
cachemiss
Kudos to Heroku. As someone who has had to make Kafka into a managed service,
I know what a pain it is (I'm not a Kafka fan for a lot of reasons) to
administer in a cloud environment.

~~~
sethammons
Would love to hear what you don't care for in Kafka and what alterative
solution(s) you prefer.

~~~
cachemiss
To clarify, my feelings towards Kafka are from the POV of someone who has had
to build a managed service on top of it, which is not the common use case (for
which many people seem to be happy with). Other people may have more positive
experiences.

In my experience, Kafka is a solid system when you work in its wheelhouse,
which is a relatively static set of servers / topics, that you add to slowly
and deliberately. If you can't use something like Kinesis, then its a good
choice.

In Kafka, programmatic administration is generally an afterthought. They have
APIs for doing things, but they generally involve directly modifying znodes.
Simple things don't work or have bugs, deleting topics didn't work at all
until 0.8.2, and even now has bugs. We've seen cases where if you delete a
topic while an ISR is shrinking or expanding, your cluster can get into an
unrecoverable state where you have to reboot everything, and even then it
doesn't always get fixed. Most of the time you are expected to use scripts to
modify everything (there's a wide variety of systems out there that try to
build mgmt on top of kafka).

Its dependency on Zookeeper is a pain, and limits scalability of topic /
partition counts. Rebalancing topics will reset retention periods because they
use the last modified ts of the segment files to check for oldness, meaning if
you rebalance often, you need extra disk space laying around. ZK has some bugs
with its DNS handling, which affects Kafka if you try and use DNS.

It has throttling, but its by client id, what you'd like in some cases, is to
say that a node has X throughput, and have the broker be able to somewhat
guarantee that throughput, and create backpressure when clients are
overwhelming it. Otherwise your latency can go through the roof. You also want
replication to play nice with client requests, and it doesn't (if you add a
new broker and move a bunch of partitions to it, you'll light up all your
other brokers while it replicates, and cause timeouts).

Its replication story can cause issues when network partitions come into play.

It's highly configurable like many Apache projects, which is a blessing and a
curse, as your team has to know all the knobs, both consumer / producer /
broker side.

The alternative if you are at a company with the resources to do so (mine is),
is to build something that fits your use case better than Kafka, or to use a
hosted service like this, or Kinesis.

~~~
emfree
Thanks for the insightful comment!

> The alternative if you are at a company with the resources to do so (mine
> is), is to build something that fits your use case better than Kafka

I'd love to hear more about this :) What did you end up doing differently from
Kafka? How's it working out for you?

------
ChartsNGraffs
For anyone wanting to play with Kafka, Spotify's Kafka container was an
invaluable resource for getting me up and running with Kafka. All the
Zookeeper dependencies are taken care of allowing you to just start playing
with Kafka right away. [https://github.com/spotify/docker-
kafka](https://github.com/spotify/docker-kafka)
[https://hub.docker.com/r/spotify/kafka/](https://hub.docker.com/r/spotify/kafka/)

~~~
Jarmo
I never tried spotify's container. Tried wurstmeister's, and was able to run
it on a single server for testing purposes, but kept running into issues while
clustering on different servers. Decided to use Ambari and have it do all the
work for me instead.

------
manigandham
This will be interesting to try out. I've used all the major cloud
event/logging systems (Kinesis, Azure EventHubs, etc) and so far Google PubSub
is the best in features and performance.

Only downside with Google Pubsub can be latency (which I'm working on fixing
by building a gRPC driver) but Kafka has proven to be too complicated to
maintain in-house. If heroku can provide the speed without the ops overhead,
it'll be some good competition to Google's option.

Also want to note that Jay Kreps who helped build Kafka at LinkedIn is now
behind [http://www.confluent.io/](http://www.confluent.io/) which is like a
better/enterprise version of Kafka.

~~~
alexatkeplar
Not sure why you are comparing Google Cloud Pub/Sub to Kinesis - the former is
a MQ system, not a distributed commit log.

When creating a Kinesis consumer, I can specify whether I want to start
reading a stream from a) TRIM_HORIZON (which is the earliest events in the
stream which haven't yet been expired aka "trimmed"), b) LATEST which is the
Cloud Pub/Sub capability, c) AT_SEQUENCE_NUMBER {x} which means from the event
in the stream with the given offset ID, d) AFTER_SEQUENCE_NUMBER {x} which is
the event immediately after c), e) AT_TIMESTAMP to read records from an
arbitrary point in time.

A Kinesis stream (like a Kafka topic) is a very special form of database - it
exists independently of any consumers. By contrast, with Google Cloud Pub/Sub
[1]:

> When you create a subscription, the system establishes a sync point. That
> is, your subscriber is guaranteed to receive any message published after
> this point.

[1]
[https://cloud.google.com/pubsub/subscriber](https://cloud.google.com/pubsub/subscriber)

So the stream is not a first class entity in Cloud Pub/Sub - it's just a
consumer-tied message queue.

~~~
nivertech
Is there something like Kinesis' AT_TIMESTAMP in Kafka?

I think the only way in to replay events in Google Cloud Pub/Sub is to create
multiple subscriptions in advance, right after topic creation. But then I
think you need to pay for the storage and event traversal requests.

------
andreasklinger
For those wondering (all imo and only best guess)

The biggest advantage of kafka is that all of the heroku marketplace all of a
sudden becomes "plug and play"

Essentially it's the "backend data" equivalent of what segment does for
"frontend data".

Example: What's the benefit of having a graphDB service in the marketplace if
most people dont want to / cant invest engineering in keeping the data in
(realtime) sync.

With kafka they can establish standards that all partners can adapt to, they
will simply offer piping of all heroku postgres/redis changes.

------
hmottestad
Does anyone know if Kafka has improved on their data loss issues since tested
by Aphyr? [https://aphyr.com/posts/293-jepsen-
kafka](https://aphyr.com/posts/293-jepsen-kafka)

A quote from the article: "At the end of the run, Kafka typically acknowledges
98–100% of writes. However, half of those writes (all those made during the
partition) are lost."

~~~
lars_francke
Yes, the suggestion discussed by Aphyr has been implemented. You can now set
up a lower bound on the ISR size (min.insync.replicas). Together with
required.acks=-1 you can wait for a message to be committed to at least
min.insync.replicas nodes.

[https://issues.apache.org/jira/browse/KAFKA-1555](https://issues.apache.org/jira/browse/KAFKA-1555)

------
koolba
I've wondered why there isn't a "big player" in the cloud space for this. Felt
like a hole.

My operating theory is that the people who would really make use of something
like this have grown beyond managed offerings and would take it in house. For
smaller operations Redis is more than enough for pub/sub. Ditto for SQS for
externally triggered eventing.

~~~
bjt
> For smaller operations Redis is more than enough for pub/sub.

I didn't find that to be so at my last job, one of those smaller operations.

With Redis you're forced to pick between two severely constrained options:

1\. Use PUBLISH/SUBSCRIBE. This is nice if you want to have several listeners
all receive the same message. But if a listener is down, there's no way for it
to recover a message that it missed. If there is no one listening, messages
are just dropped.

2\. Use LPUSH/BRPOP. This is nice if you want to have several workers all
pulling from the same queue, but isn't sufficient if you want to have several
queues streaming from the same topic. (E.g. one listener is responsible for
syncing to ElasticSearch and another one is syncing to your analytics DB.)

I strongly prefer RabbitMQ. Its model of exchanges and queues supports mixing
and matching these semantics much more flexibly.

~~~
kinkdr
How stable is RabbitMQ? I've been looking into moving from away from redis
pub/sub for a bit now.

~~~
manigandham
RabbitMQ is ok in single server and has lots of flexibility but struggles at
high throughput ( > 100k/sec) and the clustering setup is not great. There are
also lots of edge case bugs.

If you don't need persistence, look at using nats.io which is a much more
stable and reliable pub/sub system. You can build persistence on top of it or
wait a few months until they finish their new project STAN.

~~~
kinkdr
Thanks! 100k is far more than I need, but I couldn't find something that would
fit exactly my needs, so I ended up rolling my own.

------
plunchete
Is the pricing public?

~~~
neovintage
Not yet. We're working it during our early access program. Well be looking for
lots of feedback from customers.

~~~
plunchete
Thanks! Looking forward to be able to try it :)

------
nodesocket
Can somebody provide a real-life use case for Kafka? I've seen comparisons
between Redis, but what specifically does Kafka solve that Redis cannot?

~~~
yolesaber
Let's say you have a CMS which pushes content to your site. You also want to
make the whole site searchable, so you index your content into (e.g)
Elasticsearch. Kafka is great for this because you can put the content onto
Kafka's message queue and then have a service reading from it which then put's
it into Elasticsearch. It scales well, too. So let's say your site takes off
and you have hundreds of articles published a day (not to mention updates,
deletions etc) - these events can all be sent to kafka and it will maintain
the order as well as still be fast. You can also have many many services
reading (consuming) from it simultaneously and it will handle it nicely.

Basically, if you want to get data from one place to another and care about
order, Kafka is a good solution. It acts as a middleman between services.

~~~
balamaci
Hm but why would you not send it directly to ElasticSearch?

~~~
sethammons
Kafka shines when you have multiple services that have data to publish and
multiple services that need to read that data stream. If you have three
services and they write to ES, publish metrics to some other store, and log
events to the db, you could instead write that all to Kafka, and individual
consumers can use the data (for instance, to put into ES). On the origin-
service side, it has one integration point; it does not need to know about ES.
Now let's say that your users want a near real-time dashboard of their data
changes on your multiple services. All you do is make a new consumer from
Kafka. You don't add it to your three services. Kafka simplifies your service
relation graph.

~~~
balamaci
Well I definitely support the example of using Kafka for analytics with a
streaming solution like Flink or Spark, etc. However I asked the "why not
directly to ES" question because the example of using Kafka just as a layer in
front of ES I felt it kinda painted Kafka layer as something "we could do
because we can, not because we need to".

------
jbob2000
The comments in this thread are funny;

Hey, what is Kafka?

"It's a distributed logging system, not a message queue"

Ok, what's the use case?

 _describes a case when its used as a message queue_

------
tibbon
Kafka vs Redis. I've only used Redis... what should I know?

~~~
manigandham
Redis is an in-memory (with persistence) key-value database that also
implements some basic structures like lists, sets and hashes natively.

Kafka is a distributed logging system that can ingest large amounts of data
straight to disk, then allows for multiple consumers to read this data through
a simple abstraction of topics and partitions. Consumers maintain their own
position of where they last read up to (or re-read things if they want) and
everything is sequential I/O which creates very high throughput.

------
mtw
What kind of companies or startups usually use this service?

~~~
rhodin
Companies dealing with large amounts of data. A list with some companies using
Apache Kafka can be found here:
[https://cwiki.apache.org/confluence/display/KAFKA/Powered+By](https://cwiki.apache.org/confluence/display/KAFKA/Powered+By)

~~~
mtw
thanks. I guess my sites are not big enough (yet)

------
tenismyanswer
All kafkaesque to me ;->

------
elcct
My impression of Kafka was that this thing is bloated. How it compares to
something like NSQ?

~~~
kasey_junk
Its a completely different use case. Many times people call Kafka a "message
queue" but its not. It's a distributed log service. Its possible to build a
message queue on top of a distributed log service but there are reasons not
to.

Its better to think of Kafka as a database for events, not as a transport
mechanism for those events.

As for being bloated, Kafka lives in a very empty space, that is it supports
fully ordered events to all consumers (and it has good HA options). The only
other tool that I've come across that gives you the same data guarantees is
Kinesis and it requires AWS.

I've found that yes Kafka is complex, but its complex because its solving a
complex problem, not because its bloated.

That said, if you want a non-ordered message queue, use NSQ instead of Kafka.

~~~
elcct
Thanks for explanation. I didn't know those things.

