
What every software engineer should know about Apache Kafka - aloknnikhil
https://www.michael-noll.com/blog/2020/01/16/what-every-software-engineer-should-know-about-apache-kafka-fundamentals/
======
georgewfraser
This notion of “stream-table duality” might be the most misleading, damaging
idea floating around in software engineering today. Yes, you can turn a stream
of events into a table of the present state. However, during that process you
will eventually confront every single hard problem that relational database
management systems have faced for decades. You will more or less have to write
a full-fledged DBMS in your application code. And you will probably not do a
great job, and will end up with dirty reads, phantoms, and all the other
symptoms of a buggy database.

Kafka is a message broker. It’s not a database and it’s not close to being a
database. This idea of stream-table duality is not nearly as profound or
important as it seems at first.

~~~
jholman
Recently I watched a 50-engineer startup allocate more than 50% of their
engineering time for about two years to trying to cope with the consequences
of using Kafka as their database, and eventually try to migrate off of it. The
whole time I was wondering "but how could anyone have started down this
path?!?"

Apparently the primary reason they went out of business was sales-related, not
purely technical, but if they hadn't used Kafka, they could have had 2x the
feature velocity, or better yet 2x the runway, which might have let them
survive and eventually thrive.

Imagine, thinking you want a message bus as your primary database.

~~~
steve_adams_86
I worked with a short-lived startup which made this exact mistake. I suggested
persisting important events to a postgres db, but I was shot down over and
over. They were positive it would be fine to hold everything in kafka. There
was this notion that kafka functioned fine as a long term data store with the
correct configuration (I'm not arguing that), but there was no solution to the
lack of... Well, everything postgres could offer. As you mentioned, the lack
of velocity really dragged them down. Customers were constantly upset that x
broke or y wasn't delivered yet, and each time it was clear that kafka wasn't
helping them meet those needs.

Kafka is awesome for what it does, but there are a ton of people out there
using it for weird stuff.

~~~
konradb
Do you happen to know what sort of thinking leads a team down this path? It
seems a fundamental mistake. Resume driven development?

~~~
JamesBarney
In addition to what others have answered, I think many devs consistently
underestimate the amount of heavy lifting that relational databases do.

And even worse a lot of this heavy lifting isn't obvious until you are fairly
far into a project and suddenly realize that to solve a bug you need to
implement transactions or referential integrity.

~~~
commandlinefan
Not just Kafka, either. Mongo, Cassandra, couchbase, Redis, SOLR and elastic
search are all mistaken as replacements for an RDBMS.

~~~
411111111111111
Cassandra __can __replace an rdbms in a lot of cases if eventual consistency
works for your data.

It's not a drop in replacement though, that's for sure.

~~~
commandlinefan
Better than the other suggestions, but my experience is that you run into
Cassandra's limitations really quick: you can't query on non-primary columns,
and you can't join tables without pulling all the data down to the client and
merge manually.

~~~
411111111111111
> _can 't query on non-primary columns,_

That's incorrect unless I'm misunderstanding what you mean with primary
columns.. It's just not as efficient.

And missing joins are by design and one of the reasons cassandra is as fast as
it is. And as I said before: it's not a drop in replacement. You need to
architect your application around it's strength to leverage it's performance.

It is however usable as a rdbms replacement if you know what you're doing and
your data is fine with eventual consistency.

And knowing what cassandra does with your data is important as well. It's
actually a key-value store on steroids. Once you get that, it's limitations
and strengths become obvious

~~~
commandlinefan
> That's incorrect unless I'm misunderstanding what you mean with primary
> columns

I'm referring to "Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING"

> It's just not as efficient.

That's what I thought, too, so I "allowed filtering". And crashed the
database. (Apparently the correct solution here is to use Spark).

~~~
411111111111111
haha, yes - thats a real possibility.

cassandra is at its heart a key-value store. for every queryable field, you
need to create a new in-place copy of the data. so you're basically saving the
data n-times, for each way you wish to query the data.

if you however try to query on a field which hasn't been marked as query-able,
the cluster will have to basically select everything in the table and do a for
loop to filter for the 'where' clause :-)

But i haven't used in production yet, so you've got more experience then i do

------
opportune
>With Kafka, such a stream may record the history of your business for
hundreds of years

Do not do this. Kafka is not a database! Kafka should never be the source of
truth for your business. The source of truth should be in whatever consumes
data from Kafka downstream when messages are committed as read. Why? Because
in your middle layer you can do all the data normalization, sanity checking,
processing, and interaction with a REAL database system downstream that can
give you things like transactions, ACID, etc.

Of course Confluent _wants_ you to try to use Kafka as a DB, so your usage of
it is very high and you pay for the top support package and they have you by
the cajones, but that doesn't mean you should do that. You will miss out on
all the benefits of using a real database, with what benefit? Having a simple
client API?

~~~
sixdimensional
So, I've been having a back and forth with a colleague on this and I'm
genuinely interested in why you so strongly suggest this.

For the record, I have good real world experience with all kinds of databases
(relational, NoSQL, and even legacy multivalue and hierarchical ones), and I
don't see why what you say has to be "always true".

One way of looking at Kafka is that is an unbundled transaction log, nothing
more or less, so it could be used to permanently store and replay
transactional activity, if one wishes. Noting that, even most databases don't
store an immutable, permanent transaction log (as they often grow to be huge
and are truncated every so often, and tables are used as the current state).

This article by Confluent seems to cover the topic (yes, recognizing it is
written by the very vendor you suggest is trying to lock us in):
[https://www.confluent.io/blog/okay-store-data-apache-
kafka/](https://www.confluent.io/blog/okay-store-data-apache-kafka/).

Ok, so how about the idea of a persistent, immutable, never-ending transaction
log (uhoh, sounds like blockchain now!)? Setting aside Kafka for now, what do
you think about the basic design pattern? To me it sounds a bit like it could
represent a temporal database in raw transactional log form. Why not?

EDIT: After rereading your comment I see your main concern is using Kafka as a
database management system (DBMS). I would agree, that's not what Kafka is
for. But, I don't think Confluent intends that use case, do they? I look at it
more as an unbundled single component that is very useful by itself, and is
part of a more complex data platform/architecture (ex. Lambda or Kappa
architecture).

~~~
yawaramin
> Ok, so how about the idea of a persistent, immutable, never-ending
> transaction log (uhoh, sounds like blockchain now!)? Setting aside Kafka for
> now, what do you think about the basic design pattern? To me it sounds a bit
> like it could represent a temporal database in raw transactional log form.
> Why not?

Because nobody wants to be replaying events all the time to get their actual
data. They want the data to be, well, materialized. Replaying events can be
helpful if you need an audit trail but the systems which need that have mostly
all evolved their own audit trail techniques, e.g. double-entry accounting.

The people who invented these streaming event systems quickly realized that
continually replaying events from the beginning would get absurdly expensive,
so they even implemented 'checkpoint' events that snapshot the current state
of the data every once in a while so that replays can start from (a hopefully
recent) snapshot and finish quickly. At that point you have to encode the
logic of how to roll up all events into the current state into your
checkpointing code, which immediately enforces the notion of a global current
state of the data, which is in fact what RDBMs solve anyway.

~~~
monadic2
RDBMs do not generally (ie with a generic interface) allow access to the
backing event stream, though, meaning you’ll need to write this yourself when
you need to sync changes across databases. There’s no free lunch.

~~~
yawaramin
Across different databases as in Postgres-to-Maria, sure. But people who
decide to do that should know what they are getting into. But across different
instances of say Postgres, it is rather simple.

~~~
gunnarmorling
Debezium (debezium.io) might be interesting for you; it provides open-source
change data capture for a variety of databases; together with the right Kafka
Connect sink connectors (or using it via Pulsar etc.), setting up a Postgres-
to-Maria data pipeline is quite easy. (Disclaimer: I work on Debezium)

------
Twisell
Just a remark to writers, when you redact an "introduction to smth" please
refrain from writing down the name of your product 50 or more time in the
first paragraph. It's totally frustrating and made me run away to just look up
wikipedia instead to get a grasp of the general idea.

Example: You've probably heard of smth thing and wonder how smth differ from
smth else. In this article about smth we will dive into smth things to
discover how smth is well better suited to do things thanks to smth things and
other things that are really specific to smth! The power of smth enable things
to things in a way things do things that make smth something you need to learn
about.

So during this journey about smth will be cut in four part the first being an
introduction.

(...) --sudo click the link for first part

This is the first part of a fourth series about smth to learn more about
things and things in smth.

Smth is very specific about things, and that's specifically why smth is well
suited for things. Smth is a new way to do something that will make you think
more about things and things. Let's now dive into smth...

Smth use things as a things to do the smth things with tings on smth.........

\--sudo repeat marketing ad nauseam

No matter of wether the actual product is pure gold or pure garbage you
probably just lost 50% of the readers at some point.

~~~
unphased
i totally agree with what you're saying but you should also know that redact
doesnt fit in your sentence

~~~
Twisell
Too late to edit sorry. English is not my native language and I clearly see my
translation mistake now. Thank for pointing out ;)

------
ckdarby
What every software engineer should know about Kafka, it's dead.

If you're not already technically chained into it and Confluence hasn't
already upsold your poor organization avoid it.

If you want the early flexibility and the rapid PoC just look at AWS
Kinesis/Firehouse.

If you're looking at large scale (+1 gbit ingest, 100k/s, kind of stuff) then
Apache Pulsar is where to go.

~~~
studmuffin650
I would argue that Kinesis is not the way to go for quick POC unless you're
tied to the JVM.

Pulsar is still niche in most enterprise.

Kafka is not dead, there are many enterprises (including 2 successful ones
I've worked at in the past 5 years) that have built POCs and successful
products on Kafka. Its supports all language performantly and has tons of
community support. I would argue there is nothing better to build a POC on.

~~~
andonisus
Why would you be tied to the JVM by using Kinesis? You can write a client
library for any language. We did it for Go.

~~~
agacera
Because the official client from AWS is written in Java.

It is possible to write clients in any language, however, it is not that
simple especially when you need to handle logic of scalling out or in your
kinesis stream (that will split or join shards) and when you have multiple
consumers in the same consumer group (you will need a distributed locking
mechanism and a logic to steal locks if one consumer dies).

So it is not trivial.

~~~
etxm
Sure it is. AWS ships a bunch of clients for Kinesis as well as a kinesis
agent for shipping log files.

[https://docs.aws.amazon.com/sdk-for-
go/api/service/kinesis/](https://docs.aws.amazon.com/sdk-for-
go/api/service/kinesis/)

~~~
agacera
This is an SDK not a full client for Kinesis.

This is the only official implementation from AWS:
[https://github.com/awslabs/amazon-kinesis-
client](https://github.com/awslabs/amazon-kinesis-client)

The client handles horizontal scalling, checkpoiting, shards split and shards
merges. Using just the SDK, you have to build this yourself (unless you are
using Kinesis for use cases that dont need it to be done correctly).

This is the doc for developing consumers using the SDK
[https://docs.aws.amazon.com/streams/latest/dev/developing-
co...](https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-
with-sdk.html)

And in the second paragraph of this documentation: "These examples discuss the
Kinesis Data Streams API and use the AWS SDK for Java to get data from a
stream. However, for most use cases, you should prefer using the Kinesis
Client Library (KCL) . For more information, see Developing KCL 1.x
Consumers."

------
gnfargbl
Something that I wish I had known about Apache Kafka a year or so ago is that
it essentially has no support for long-running tasks, i.e. tasks where
longest-possible-worker-execution-time >> longest-tolerable-group-rebalance-
time.

After much angst in trying to work around this issue, I finally gave up and
switched to Pulsar. Pulsar isn’t without it’s own issues (mostly around bugs
and general maturity) but it handles this particular scenario admirably.

~~~
ketralnis
It's true, message buses and work queues have different characteristics. It
sounds like you want a work queue, not a message bus. I have very successful
experience with using rabbitmq for work queueing, but as you mention there are
others too.

~~~
biggestlou
Pulsar works quite well as a message queue:
[https://pulsar.apache.org/docs/en/cookbooks-message-
queue/](https://pulsar.apache.org/docs/en/cookbooks-message-queue/)

~~~
skyde
Pulsar can also support infinite data retention using data tiering and by
spreading data on all nodes.

It’s much better than Kafka when you don’t know when the consumer will come
back looking for what changed

~~~
biggestlou
Spreading data on all BookKeeper nodes, yes :) Pulsar brokers are themselves
stateless.

------
bonquesha99
As a Kafka alternative, has anyone attempted to use PostgreSQL logical
replication with table partitioning for async service communication?

Proof of concept (with diagrams in the comments):
[https://gist.github.com/shuber/8e53d42d0de40e90edaf4fb182b59...](https://gist.github.com/shuber/8e53d42d0de40e90edaf4fb182b59dfc)

Services would commit messages to their own databases along with the rest of
their data (with the same transactional guarantees) and then messages are
"realtime" replicated (with all of its features and guarantees) to the
receiving service's database where their workers (e.g.
[https://github.com/que-rb/que](https://github.com/que-rb/que), skip locked
polling, etc) are waiting to respond by inserting messages into _their_
database to be replicated back.

Throw in a trigger to automatically acknowledge/cleanup/notify messages and I
think we've got something that resembles a queue? Maybe make that same trigger
match incoming messages against a "routes" table (based on message type,
certain JSON schemas in the payload, etc) and write matches to the que-rb jobs
table instead for some kind of distributed/replicated work queue hybrid?

I'm looking to poke holes in this concept before sinking anymore time
exploring the idea. Any feedback/warnings/concerns would be much appreciated,
thanks for your time!

Other discussions:

* [https://old.reddit.com/r/PostgreSQL/comments/gkdp6p/logical_...](https://old.reddit.com/r/PostgreSQL/comments/gkdp6p/logical_replication_for_async_service/)

* [https://dba.stackexchange.com/questions/267266/postgresql-lo...](https://dba.stackexchange.com/questions/267266/postgresql-logical-replication-for-async-service-communication)

* [https://www.postgresql.org/message-id/CAM8f5Mi1Ftj%2B48PZxN1...](https://www.postgresql.org/message-id/CAM8f5Mi1Ftj%2B48PZxN1AbM-P%3D4YMLENY5zRaPwTbmbkFwCsTkA%40mail.gmail.com)

~~~
PopeDotNinja
Just as Kafka isn't a database, PostgreSQL isn't a queue/broker. You can use
it that way, but you'll spend a lot of time tweaking it, and I suspect you'll
find it's too slow for non-trivial workloads.

~~~
mianos
Skype at its earlier peak used Postgres as a queue at huge scale. PGQueue. It
had a few tweaks and, sure, it is an anti-pattern but it can work. It is sure
handy if you are already using postgres and want to maintain a small stack.

------
Rebelgecko
I'm surprised by the amount of criticism in this thread. I've used Kafka in
the past and it definitely got the job done (as a message bus, not using
stream processing or the other more whiz-bang features). What do people use
instead?

~~~
skyde
my experience is the opposite unless you really need the whiz-bang feature you
should never use Kafka it’s the least reliable and hardest to run it require
active babysitting by skilled team of admin and ops.

if you don’t need low end-to-end latency (tailing consumer blocked on long
polling) it’s better to use something like HdFS or AWS S3, but if you need it
but have low throughout it’s better to use something like RabbitMQ. if you
have both high throughout and need polling then it’s worth investing into
Apache Pulsar.

~~~
maxdo
Pulsar is half baked with limited number of half baked clients even for major
languages. Kafka instead is a bulletproof solution with tons of clients in any
flavor, with instruments to detect issues, alert about them. Don't even
compare them.

~~~
manigandham
Pulsar is far better designed than Kafka and is much more reliable and
scalable. Clients in every language are an entirely different issue and mostly
down to developer bandwidth because it's a small team.

Kafka actually doesn't have that many great clients either, they all are just
wrappers around the C++ librdkafka library.

Pulsar clients are also easy to create because of the stateless protocol, per-
message acknowledgements, and optional websockets API. Or you can just use
their Kafka bridge adapter and use your existing Kafka clients.

------
skyde
What every engineer should know about Kafka is that it should not be used for
anything critical like you would use Cassandra or Hbase.

But if you are ok with partitions not being available for many hours or losing
all written data because the cluster did not automatically move parution to 3
new replica after 2 of the replica failed ... then it’s a good scalable(speed)
product.

There is also no serious multi tenant support. So if you need multitenancy you
gotta use kubernete and do one cluster per tenant and automate that yourself.

------
Traster
There seems to be this common problem among relatively new technologies, that
they're not _actually_ aware of what the average person knows about them. So
let me be the moron in the room. I work at a company that uses Kafka. What I
know so far is that Kafka is broken. It seems to me that this article is more
about what every software engineer who plans to re-skill as a kafka engineer
should know.

~~~
oweiler
In which way is Kafka broken?

~~~
skyde
The main aspect for me after 5 years of running several Kafka cluster to run
production critical system at Microsoft is that it’s very fragile because of
the bad design.

distributed system are hard to design correctly. When using HDFS you can
create a single 1TB file if the cluster have enough capacity. And with (hdfs
hbase cassandra ..) if 5 server permanently fail in the span of 24 hours you
don’t need manual intervention and your data is not lost.

In Kafka your partition data need to fit entirely on a single server and you
might still run into issue if that server also hosting other partition.

And with Kafka if one by one the 3 server manually assigned as replica to one
partition fail before an engineer is wake up in middle of the night and have
time to fix it manually you just lost all your data.

~~~
LgWoodenBadger
Do you have examples of other systems that won’t lose all your data if all its
replicas fail?

This seems unavoidable in any scenario.

~~~
skyde
Assuming replication factor of 3 i’m not talking about 3 machine going down at
exact same time, but in sequence. like 5 minutes apart.

All other distributed system handle this with no issue. Ex: Cassandra,
Hbase/hdfs, CockroachDB

------
sixhobbits
So much criticism here! I've read a lot about Kafka over the last few years
and I wish I had read this article earlier -- even basic questions like "Can
Kafka store data persistently?" are not adequately answered in many intros to
it.

That said, I do find the tutorial flip-flops a bit in target audience. It's
mainly "this is what Kafka is", but sometimes has weird asides like "This is
how to optimise Kafka" (redundancy, number of partitions, etc) which are
pretty distracting from the more fundamental points.

------
nemetroid
I read the introduction to the series, and then the introduction to the first
part, and I’m still not sure exactly what Kafka is, or why I (as part of
”every software engineer”) need to know anything about it. The title suggests
that the article(s) will convey some concepts that are useful in a broad
sense, but from a skim, this looks like a lot of details about some database-
ish product, which probably are good to know if you use that product, but not
so much otherwise.

------
seemslegit
Hmm, I'm pretty sure that a software engineer developing safety-critical
firmware for embedded medical systems does not need to know anything about
Apache Kafka. Or a game developer. Or a web frontend developer. Given the
title it's surprising how many software engineers can in fact go through life
and career without ever knowing anything about Apache Kafka.

~~~
vsareto
By now, "What Every Software Engineer Should Know" headlines aren't intended
to be serious.

~~~
seemslegit
I wish, instead they are just not intended to be taken literally.

------
PaulWaldman
Is there a reason Kafka wouldn't choose to leverage an existing mature RDBMS
for their table storage instead of rolling their own?

~~~
geodel
Well they use rocksDB internally. Though that is not RDBMS.

------
skyde
Anyone know how large scale chat system ( facebook messenger, ...) are
implemented? My guess is message are in a data store like hbase and a very
simple notification system let user that are online know to fetch for new
entry

~~~
gaogao
It was roughly that at one point,
[http://highscalability.com/blog/2010/11/16/facebooks-new-
rea...](http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-
messaging-system-hbase-to-store-135.html), but now does something else,
[https://engineering.fb.com/data-
infrastructure/messenger/](https://engineering.fb.com/data-
infrastructure/messenger/)

~~~
skyde
thanks a lot. This seem to be a very simple but smart design.

------
kevindeasis
anyone wanna share their thoughts about deploying their own messaging system
vs using a messaging system provided by their cloud provider?

~~~
realtalk_sp
The GCP Pub/Sub API has largely replicated all the features you'd want out of
Kafka (including Consumer Groups). The primary consideration at this point is
cost. There's an inflection point in size (at some very large message volume)
where it makes sense to start running your own Kafka cluster and hire a
dedicated person or two to manage it. Most companies will never get anywhere
close.

Any project just starting out should use Pub/Sub. One thing I really like is
that GCP provides emulators of Pub/Sub et al for local testing. That used to
be a bit of an obstacle not too long ago.

In terms of lock-in, I don't see how that applies to an AMQ. The data moving
through it should only be transiently persisted, up to a week or two at most
in the usual case.

If you want to avoid cloud lock-in, have DB backups, use Postgres/MySQL/etc,
containerize your service(s), replicate data in object storage, etc. Common
sense stuff, if that's something that's of concern.

Personally, I've seen "vendor lock-in" weaponized as an excuse for a lot of
costly NIH bullshit. It's painful to reflect back on a project that could have
involved literally a tenth of the time and pain it ended up taking because of
that one choice alone.

~~~
dirtydroog
GCP Pub/Sub is insanely expensive

~~~
bootlooped
I don't have a lot of devops experience, but I was just struck by how cheap it
appeared to me. $40 / TB? I can't even imagine how much money is sunk into
managing the Kafka clusters at my employer.

------
analognoise
"Every engineer".

The whole human activity of reducing science to practical art will do just
fine without knowing Apache Kafka, thanks.

------
simonjgreen
"Your scientists were so preoccupied with whether or not they could, they
didn’t stop to think if they should."

------
cosmotic
Cool click-bait title

~~~
fmjrey
Not necessarily, it's most likely a reference to the popular article from
2013: [https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

~~~
ken
Or directly from 1991's "What Every Computer Scientist Should Know About
Floating-Point Arithmetic".

~~~
pierrec
Where the "every" was actually justified, contrarily to its descendants.

------
badrabbit
Some times you should just use Graylog (kafka+elastic) ,especially if you are
already comfortable with Elastic. You get to scale,retain and monitor your
data in addition to stream processing. If I have fairly small Go webapp that
needs stream processing, I would just use Graylog instead of trying to use
Kafka directly.

------
est
One thing Kafka bite me is for each partition there can be only one consumer.
If your consumer had performance issue (e.g using CPython) then you are out of
luck.

~~~
bootlooped
The thing this article pointed out to me, which I didn't know before, it that
is why you should just set the partition count very high to begin with. Then
you just horizontally scale consumers.

~~~
est
Yeas that's the lesson I learned the hard way. The publisher was from another
department and had only 1 partition with large volume of data.

------
tspann
Kafka for events Flink for continous sql against topic events Kudu for event
storage. Databases for data storage and queries

------
a_c
1st rule is you don't need kafka.

Second rule is don't forget about 1st rule.

------
xrisk
can someone give me a tldr for what problem kafka solves / what situation
calls for the usage of kafka?

------
gigatexal
this is so timely, thank you!

------
unohoo
Use pulsar - so much better than kafka

~~~
lytedev
Why is that?

~~~
dominotw
millions of topics, no zookeeper ect. Kafka is addressing these shortcomings
on the roadmap.

~~~
oweiler
For a lot a projects this is hardly a problem. On the other hand Kafka is more
mature and has a huge ecosystem (Kafka Connect, Kafka Streams, KSQL, ...).

~~~
math
Kafka also has less moving parts even today before zookeeper removal is
complete (2 vs pulsar 3).

~~~
biggestlou
But one of those moving parts of Pulsar, BookKeeper, means that you're no
longer storing data on message brokers. Worth the extra puzzle piece for a lot
of use cases.

