
Apache Kafka Goes 1.0 - pradeepchhetri
https://www.confluent.io/blog/apache-kafka-goes-1-0/
======
manigandham
Congrats to the kafka/confluent team.

Side note: [https://pulsar.apache.org](https://pulsar.apache.org) also seems
to be gaining traction and has a much better story around performance,
pub/sub, multi-tenancy, and cross-dc replication. Will be interesting to see
the evolution of both going forward.

~~~
slap_shot
Interesting. Can someone give a quick/high level list what makes it better
than Kafka and in what use cases?

~~~
manigandham
I recommend reading the Pulsar architecture page:
[https://pulsar.apache.org/docs/latest/getting-
started/Concep...](https://pulsar.apache.org/docs/latest/getting-
started/ConceptsAndArchitecture/)

Biggest difference is that Pulsar splits compute from storage. Pulsar
_brokers_ are stateless services that manage topics and handle clients
(publishers and consumers) while reading/writing data from a lower tier of
_bookies_ , which is the Apache BookKeeper project for low-latency real-time
storage workloads.

Better performance and easier scaling compared to Kafka's partition
rebalancing. It also natively supports data center replication and multiple
clusters working together globally. Pulsar addressing is
property/cluster/namespace/topic so multi-tenant isolation is built-in.

Anywhere you use Kafka can use Pulsar and also consolidate pub/sub only
systems that need lower latency than Kafka.

~~~
lmsp
I think it is beyond partition rebalancing. There is a fact that people didn't
realize of making message broker `stateless`. It is actually much better on
reacting to failures or shifting traffic, which is critical when running a
messaging bus for online services. because it doesn't have to wait for copying
the data of a whole partition when error occurs.

~~~
manigandham
Agreed, the architecture page and other blog posts do a more thorough job of
explaining the details. Having a stateless broker layer on top of a focused
data layer makes all the operations much easier. I expect BookKeeper to
further integrate with the various cloud storage APIs so it can also start to
become stateless cache.

------
qntmfred
For the folks using kafka or kinesis or other products with streaming event
architectures - do you have a replay capability? For example, I am using
kinesis and I have a lambda that processes events and writes to a postgres
database. so say 3 months from now I want to also create a lambda that
firehoses to redshift. How do I get the first 3 months of my data into
redshift? Right now I have a lambda that writes all events to S3, so I can
replay them for other stream consumers (or even for debugging existing
consumers). This seems like a reasonable (if naive) solution, but I don't see
a lot of talk out there in the first place about ensuring this capability
exists, and definitely haven't seen any more robust or thoughtful options
discussed. Are people just not doing this generally?

~~~
etxm
We have something overly dumb and simple. We generate about 400GB of logs a
day through kinesis to lambda to s3. And a second lambda that reads of s3 and
forwards the stuff to wherever using an s3:objectCreated event ...

If we need to replay something we just move the group of files in and out of
the folder on s3 (bucket/logs -> bucket/temp -> bucket/logs) and it triggers a
new objectCreated and fires the lambda.

We execute the 2 move operations with the AWS cli tool.

The one other thing we do is we have a config file in bucket/config.json
that’s says event type X should go to redshift, pg, es, where ever. So we can
tweak that before a move to send data to additional data stores.

We follow ELT instead of ETL for streaming into data stores.

~~~
angersock
What are you doing that makes a half-terabyte of logs per day? That seems
kinda nuts.

~~~
odammit
Ads ads ads!

That’s just one system. Our Nginx access logs alone are about 250GB per day
(they use that system).

------
faizshah
I'm currently comparing using Kinesis vs running a small scale Kafka cluster
on AWS. The ecosystem around Kafka is great, especially Kafka connect's stuff
like Debezium. But I don't know if it's worth the trouble to deal with the
extra operational complexity. Any opinions on administrating Kafka at small
scale?

~~~
nehanarkhede
Disclaimer: I'm one of the creators of Kafka and founders of Confluent

The best way to get the power, throughput, latency of Kafka without the
operations is to use a hosted service. The one created and supported by the
Kafka team is Confluent Cloud [https://www.confluent.io/confluent-
cloud/](https://www.confluent.io/confluent-cloud/)

~~~
ralusek
I love companies developing a product also offering a managed implementation
as a service, but I cannot find the pricing.

Any time it seems like I have to establish some sort of relationship and get
an enterprise tailored solution, I no longer feel like I'm part of a public
and elastic market.

~~~
mianos
I beleive the phrase is "If you have to ask you can't afford it”.

~~~
samsonradu
Hehe, that only applies to luxury goods, which are decooupled from
supply/demand rules.

------
SEJeff
For workloads that aren't as important durability wise, has anyone considered
nats? [https://nats.io/](https://nats.io/)

~~~
pritambaral
I use NATS in production. Kafka and NATS are almost complete opposites. One is
a durable, partitioned queue; the other is a no-persistence message queue.

~~~
no1youknowz
Aren't they fixing that with NATS Streaming Server?

~~~
manigandham
There's nothing to "fix" \- NATS is a distributed and fast pub/sub system.
NATS Streaming is actually a client, it uses NATS to communicate and basically
just subscribes to some topics, saves messages it receives, and will then
replay them back to you if asked.

Imagine how you would build a persistent logging app on top of NATS, and
that's what NATS Streaming is.

~~~
SEJeff
We built a metrics relay using it (nats) and are pushing > 500k metrics per
second in bursts just fine. I've been pretty blown away with the perf of it
actually. That said, it isn't durable, so normal here be dragons warning
apply.

------
di4na
I really hope they would stop with that exactly once thing.

It does not mean what they use it for, and it does generate a lot of
confusion.

~~~
chairmanwow
How so?

~~~
ploxiln
"exactly once" is not possible, but you can do "at least once" and make the
processing of the message idempotent or "de-duplicated". This is true of many
messaging systems.

[https://www.confluent.io/blog/exactly-once-semantics-are-
pos...](https://www.confluent.io/blog/exactly-once-semantics-are-possible-
heres-how-apache-kafka-does-it/)

> Exactly-once Semantics are Possible: Here’s How Kafka Does it

> Now, I know what some of you are thinking. Exactly-once delivery is
> impossible, it comes at too high a price to put it to practical use, or that
> I’m getting all this entirely wrong! You’re not alone in thinking that.

blah blah blah ... of course it's at-least-once, with the "idempotent
producer" so only one resulting message is published back to another kafka
stream. Big surprise.

Now many people think "kafka has exactly-once delivery, that's what I want, I
don't want to have to deal with this at-least-once stuff" when really it's the
same thing, and others have been doing idempotent operations of various kinds
for years, and the user still has to figure out how to do their desired thing
(which might not be sending one more kafka message) in a mostly idempotent
way.

~~~
apurvamehta
From the very same blog post:

> Is this Magical Pixie Dust I can sprinkle on my application?

> No, not quite. Exactly-once processing is an end-to-end guarantee and the
> application has to be designed to not violate the property as well. If you
> are using the consumer API, this means ensuring that you commit changes to
> your application state concordant with your offsets as described here.

I think that is a pretty clear statement that end-to-end exactly once
semantics doesn't come for free. It states that there needs to be additional
application logic to achieve this and also specifies what must be done.

~~~
ploxiln
Right. But that's at the end of a separate article, while the post which this
HN discussion is about throws around the words "exactly once" a lot more
casually. The argument is over the use of the words "exactly once". They
should just refer to the feature as "transactions" or "idempotent producer".

------
kafkaisthatyou
Do people use kafka for situations where data loss is not tolerable(eg: accept
credit card receipts).

~~~
travisp
Yes, but the default configuration is not suitable to that situation. You will
need to make some adjustments if you cannot tolerate data loss.

~~~
kafkaisthatyou
What are the alternatives people are using instead of kafka in these
situations. Low volume but high reliability.

~~~
oskari
For low volume my recommendation would be to just use a traditional relational
database configured for high availability.

If you want to use Kafka and need disaster recovery capabilities we typically
recommend using Kafka Connect or other similar tools to replicate the data to
another cluster or persistent storage system such as S3.

~~~
theptip
+1. For use-cases which impose strict data durability requirements (either for
business or regulatory reasons), I think it's unwise to use anything fancy
like Kafka unless you've maxed out performance of your SQL database.

For example, for credit card receipts, simply by the nature of the type of
transaction, you're unlikely to be processing enough of these to put pressure
on a SQL database. One $1/transaction per second means you're grossing north
of $30m, which is easy to handle in even an unoptimized schema. Citus reckon
you can get thousands to tens of thousands of writes per second[1] on
Postgres, which would be grossing tens or hundreds of billions of dollars;
this tech stack is suitable even when "low volume" becomes quite significant.

Of course, Kafka is designed for situations where you need to process millions
of writes per second, which is into "GDP of the whole world" territory if
those writes are credit card receipts, so I'd contend you're unlikely to ever
need something like Kafka scale for your credit card payment handling
components.

[1]: [https://www.citusdata.com/blog/2017/09/29/what-
performance-c...](https://www.citusdata.com/blog/2017/09/29/what-performance-
can-you-expect-from-postgres/)

------
brootstrap
ugh... i dispise this crap..

" the “it just works” experience of using Kafka. "

Blog post says how it just works, then shows a crazy diagram stringing
together 8 different pieces of technology (which themselves are insanely
deep). Is there any easy way to understand WTF this crap does without knowing
about RDS, hadoop, monitoring, real time BS, streaming etc? Does it "just
work" once you've spent weeks and months understanding wtf it is?

~~~
manigandham
Nothing in that diagram has anything to do with Kafka, it's just showing that
you can connect different systems together using a single message bus. If you
are new to that concept then I recommend reading the seminal blog post about
using logs as the data backbone:
[https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

After that, you can choose whichever implementation you want, of which Kafka
is just one option. And yes it does require some work to setup and run, as
most software does, but it's no more or less than any other distributed system
at this point.

------
truth_seeker
Kafka has come a long way. Kudos to confluent team.

I came across some projects which claim to be full or partial alternative to
Kafka with better scalability and less operational overhead.

Jocko -
[https://github.com/travisjeffery/jocko](https://github.com/travisjeffery/jocko)

SMF - [https://senior7515.github.io/smf/](https://senior7515.github.io/smf/)

MapR-ES - [https://mapr.com/blog/kafka-vs-mapr-streams-why-
mapr/](https://mapr.com/blog/kafka-vs-mapr-streams-why-mapr/)

Has anyone tried it in their project ?

~~~
manigandham
None of those are alternatives. Jocko's creator now works for Confluent and
it's not production ready or supported, more of a hobby project to recreate
kafka in go without zookeeper.

SMF is a prototype RPC system and libraries based on the C++ Seastar
framework, not a messaging system in itself. MapR-ES is part of the MapR
hadoop/data platform and also not standalone.

If you want alternatives then look at NATS + NATS Streaming [1] or Apache
Pulsar [2].

1\. [http://nats.io](http://nats.io)

2\. [https://pulsar.apache.org/](https://pulsar.apache.org/)

~~~
truth_seeker
Thanks for sharing the Jocko's creators story.

Yes. MapR-ES comes with its own File system and tool chain but its more closer
alternative to Kafka than NATS as it follows Kafka 0.9 API and can act as drop
in replacement. Although it does not support exactly once semantics yet.

~~~
manigandham
Usually alternatives are about providing similar functionality, not the exact
interface. If you just want a Kafka API then there are multiple adapters
available for most other messaging systems, including pulsar.

MapR is an entire data platform and suite of services, one of which is the ES
(event streams) offering. It's not drop-in for Kafka unless you're already
running the MapR platform which is much more involved than just running Kafka.

------
pwdisswordfish
So can it be used comfortably now with anything else than Java?

~~~
frankmcsherry
I've been using it in Rust, but librdkafka had (has?) a bug that's been fixed
in tree that cut the read throughput to about 10MB/s using loopback (edit:
cuts throughput to 10MB/s in many settings, which hurts especially in
loopback). With that fixed, it seemed pretty pleasant for my use cases (framed
bytestreams at line rates, nothing fancy or web-scale).

Edit: crate I was using, and enjoy (which brings in librdkafka):
[https://github.com/fede1024/rust-rdkafka](https://github.com/fede1024/rust-
rdkafka)

------
zekrioca
Is Kafka good for processing hundreads of GB size messages per second? Can
also a Kafka Stream (output) be input to another Kafka stream dynamically?

~~~
spiffytech
Kafka expects messages to be under 1MiB. If you need larger messages, they
recommend you store the payload somewhere else and use Kafka to pass around
pointers to the payload. So e.g., pass around an S3 URL.

------
banq
kafka transactional producer not working on windows 7, this bug not yet
fixed.[https://issues.apache.org/jira/browse/KAFKA-6052](https://issues.apache.org/jira/browse/KAFKA-6052)

