Hacker News new | past | comments | ask | show | jobs | submit login
RabbitMQ Streams Overview (rabbitmq.com)
253 points by DeadTrickster on July 13, 2021 | hide | past | favorite | 63 comments

This is really interesting for small scale projects.

If you are a solo Dev and really want/ need data streams than you can only chose between Kafka and Redis Streams at the moment. While there are a lot of client libraries for Kafka for nearly every programming language, you really don't wanna manage your own Kafka instance and, to be fair, it really feels like overkill.

So RabbitMQ seems to fit into the same spot as the Redis Streams do: Easy manageable and usable streams, even for smaller projects. Really hope that this will be usable in many different client libraries soon.

Redpanda [1] is fantastic if you want a solid Kafka clone. It's written in C++, with no external dependencies; it has Raft consensus built in. The API is Kafka-compatible.

There are some missing features compared to Kafka (no dynamic rebalancing of partitions being the top one), but they're also rolling out some new features that Kafka does not have, such as transformation pipelines [2].

NATS's brand new JetStream [3] feature is also looking very promising. It uses Raft internally. NATS itself is rock solid, and JetStream builds on that foundation.

NATS also has very interesting support replication topologies, meaning you can build graphs of streams that each feed in other streams, so you can do the publishing and consumption in different locations, with different availability constraints.

[1] https://vectorized.io/redpanda

[2] https://vectorized.io/blog/wasm-architecture/

[3] https://docs.nats.io/jetstream/jetstream

Do you have actual production experience with RedPanda? If so I would love to hear about it- we found out that there is something of a wall of 10k partitions per broker before things start failing/under replicating without warning or any issues outside of the URPs. This appears to be a limitation of zookeeper and fetching metadata. We are fighting this by raising timeouts and such, but this blindsided us, and really the solution is to get people to stop creating topics with dozens of partitions when they aren't needed.

I took a look at redpanda this week, it sounded nice on paper, but them being a young company, I am concerned about what "gotchas" we are going to run into.

I've only run it as part of stress tests when evaluating it for a new application, and I found it to be a pleasure to work with.

We did push it to more than 10k partitions, but I honestly don't remember how that affected it; that's when I discovered that partitions cannot be dynamically rebalanced, which meant we'd have to change the way we would use it.

Interestingly, we also did a similar test with NATS JetStream, which did start struggling around 10,000 consumers. (A consumer in NATS is similar to a partition, as it has its own Raft group.) What I tried to do with JetStream goes against the grain a bit, mind you; I still think it's an excellent piece of software.

We have some great work coming that allows lighter weight consumers to scale to that level and beyond. Happy to chat with folks on how we can make that work today.

You might want to check out NATS[0], it starts out with RabbitMQ like functionality (message queue) and can also do Kafka things (distributed persistent log) via NATS Jetstream[1] or Liftbridge[2].

It's riding the wave of CNCF cloud-buzzword projects but don't let that scare you off -- generally that means that it is actually really easy to set up and operate, and does most of the things you'd expect via well structured pre-inserted configuration which is a plus. The devil is still in the details though, so read the docs thoroughly to make sure it fits your usecase.

[0]: https://nats.io/

[1]: https://docs.nats.io/jetstream/jetstream

[2]: https://liftbridge.io/

NATS has different characteristics. Afair, it disconnects not consuming / Not fast enough clients.

For regular messaging

So there is base NATS and abstractions built upon NATS. Which are you referring to?

There's documentation on this bug/feature[0] and it looks like using NATS streaming is a way to fix this. I will admit that it's a bit annoying to figure out the difference between NATS streaming, jetstream and Liftbridge but I don't think this issue affects all 3. Jetstream is essentially NATS streaming built into the NATS binary itself, so switching to it should produce the feature set you're looking for.

[0]: https://docs.nats.io/nats-server/nats_admin/slow_consumers

NATS JetStream is very different, implementation-wise, from NATS Streaming. The latter is deprecated and no longer under development.

JetStream is impressive and looks very promising. I did some stress testing recently and found some performance issues and possible bugs, but I wouldn't hesitate to put it into production.

Completely agree with that. I was referring base NATS.

RabbitMQ plays in the field of traditional general purpose messaging system. NATS was specializing on performance. Kafka focused on streams. That is a focus and includes trade offs. Which is all fine. That is all I want to address.

>you really don't wanna manage your own Kafka instance

So much of my life would be better right know if you didn't just describe me.

Care to provide some details on this?

Lets say you have a small team and dev, test, prod environments. You want high availability. You need kafka and zookeeper servers, 2-3 of each so let's say 5 total. Dev can be single servers, but you should probably have an environment that closely mirrors prod, so for your Kafka stack alone you have to manage 12 servers, 5 for prod/test, 2 for dev.

Then you probably have Kafka connect running somewhere. That's another handful of servers. Maybe Kafka streams is a few more servers. Then you're going to have servers that collect events and publish them to the server. How many more servers is that per environment?

Congratulations, you now have a state of the art event streaming enterprise grade platform, and 20 servers to manage. Better hope your company gets on board with the real time model, otherwise you're now the owner of these pet servers until the end of time.

What's that, you ran into a rare kafka bug that caused your offsets to be lost after a reboot and now the bus is pushing millions of messages down the pipe to all the consumers? Wow, that sure sucks, hope you can juggle your day job with this massive production issue.

Probably because....zookeeper

I find zookeeper to be the easy part. Managing Kafka’s byzantine auth systems, and applying upgrades to clients and servers that change a million things every release, and dealing with the shitstorm of gigabytes of text logs, and dealing with tuning mistakes (don’t you dare leave the file descriptor limit at the OS default!) just sucks up so much attention.

I thought Kafka stopped using zookeeper.

They're working toward that goal.

2.8.0 is out and it supports zk-less. But I'm not too sure the alternative is much better.

Confluent will tell you straight out it is not production ready. I know there are other common features not supported, but authorization for example is not supported in the current implementation. They told us 2 years before zook is dead dead.

I think there are others?

I use google pubsub as a stream - it's just me developing our tools. I find it very easy to use and just works.

But maybe that doesn't count as a 'stream' under that definition?

I use GAE to accept and pass to pubsub super fast 1000s/second bursty webhook data, pass to pubsub, which is triggers cloud function to write to DBs. Cloud function retries if there is a write error or timeout or something.

It worked so well I've now just used this as a kind of micro-service for all DB writes I have to do. Now also parsing out other 'processing' services that don't need to respond with data to the request, like for instance an example 'service' we verify and format cell phones with twilio and then update that user profile.

For me one of the nicest things is that you can have both "traditional" messaging _and_ streaming in the same system. Feeding messages published to exchanges into both queues (for processing) and streams for archiving, auditing and analysis.

Best of both worlds :)

can you give some examples why i'd want a data stream for a 'small scale project' ?

Integration and webhooks are very well suited for streams. Having your core product emitting messages as event have many benefits:

- You can have a team working on the core product sending messages and another working on integrations triggering actions.

- If your integrations fails, your core product is not impacted. You can also replay old messages once you've fixed the stream consumer.

- Having streams allows you to do all kind of experiments too, you can connect a new project to a stream and go through a week of data almost instantly and see the result... rince and repeat as much as necessary

Event-based architectures, for example: https://martinfowler.com/eaaDev/EventSourcing.html For some types of applications moving to a model with a stream of events as the source of truth can solve a ton of hairy distributed computing problems.

How is that not possible in the exchanges + queues model that Rabbit has supported forever?

I was recently thinking about this: let’s say you built a chrome extension and wants to collect some basic usage analytics (with explicit user consent and knowledge, while preserving privacy) - you could batch-send activities at intervals to a REST-style API, but would be nicer to handle as a stream (eg to respond in real-time somehow).

Something lightweight like MQTT is also well suited for this. It was originally designed for telemetry messages in IoT situations but it also supports websockets so it can be used in web applications or browser extensions.

> eg to respond in real-time somehow Make the api call in real-time. No batching. Pipe the api request into your streaming service.

Where's the illustrated children's book? Rabbits vs otters. :)

Mh, this seems useful for our internal discussions of kafka vs rabbitmq, and might make some of the stuff we do with rabbitmq easier or more effective. Neat.

So, I'm taking bets how many days it takes until some teams decide that any further development is impossible without this feature. Yes, it's been a long day.

Once you know that your ideal tool exists it’s super super shitty to implement half-baked workarounds that you know will get deleted in the next upgrade cycle.

I for one on the ops side will be pushing to upgrade ASAP because the sooner devs adopt this the sooner I can tear down Kafka and have one less thing to maintain.

> A RabbitMQ stream models an append-only log with non-destructive consuming semantics

> Streams in RabbitMQ are persisted and replicated. This translates to data safety and availability (in case of the loss of a node)

Let's talk about the elephant in the room: Do streams replicate the same way queues do? If you ever lost a node while you had a few GBs of data in queues you will know that bringing a node back will sync all data over the net while completely blocking ANY operation on the queue until this process finishes. Please don't recommend quorum queues, they lack important features as well.

This was (is) the case for classic queue mirroring. Quorum queues use a raft implementation and can synchronise the delta. As of now, the two major features missing from quorum queues are message TTL and priority. The former will come soon. It is true that QQs have different runtime characteristics but they are much more stable in a clustered environment.

No they do not replicate like classic mirrored queues do. They are much more similar to quorum queues in that they only (asynchronously) replicate the delta after a disconnection. After all both streams and quorum queues use log replication. They are also both quorum systems in terms of availability.

W.r.t quorum queue features set we are working on Message TTLs. Priorities we'll have to see. We want to provide something there but it may not be a priority queue as provided by classic queues as this isn't the best way to do priority based messaging.

Isn't the manual sync mode the solution for blocking for classic queues (assumming mirroring)?

After a brief glimpse at the documentation, I am missing a feature comparable to Kafka's partitioning. Well, the protocol [1] briefly mentions it, but it does not seem to be exposed.

Getting insights into the roadmap for this would certainly be interesting...

[1]: https://github.com/rabbitmq/rabbitmq-server/blob/v3.9.x/deps...

Genuine question: what is it that you need from Kafka partitioning?

Not OP, but what I need from Kafka partitioning is guaranteed message ordering (per partition key).

I don't know about RabbitMQ, but with Apache AMQ there is message grouping, which is kind of similar, but not quite the same. With Kafka it's unavoidable, which is good.

Guaranteed ordering per partition while keeping the ability to scale across partitions.

Surprised at the focus on a proprietary, binary protocol as I'd always taken RabbitMQ to be "all in" on standards based protocols.

AFAIK there is no proper standard streaming protocol which is why we went with a dedicated protocol that works really well with our approach to streams.

Looks fascinating, kind of like Kafka. I'm sure someone will chime in with why Event Sourcing can't be done with append only logs like Kafka etc. Still don't know why.

Log rentention is probably the main reason. Event sourcing typically would need some kind of snapshot to be calculated to replace the head of the log before it is deleted.

If you had unbounded storage you could perhaps.

Kafka's key based log compaction can be applied to this.

This seems comparable to redis streams. How is it different?

You are right that these are very comparable. A few years ago I was very experienced in both as I created Lightbus (lightbus.org).

From what I can still recall, AMQP (RabbitMQ) looked great to me until Redis Streams came along. RabbitMQ always seemed more heavyweight in various ways, whereas Redis Streams was very easy to pickup and get going with.

Redis Streams isn’t simple, but I always found RabbitMQ to be more complex and to have more gotchas.

TLDR: RabbitMQ 3.9 introduces new data/queue type which is backwards compatible with existing amqp 0.9.1-based clients but gives enormous performance boost when used via new custom streams protocol. Oh and it's replicated/persistent too.

For example, on the test 3-node cluster (c2-standard-16) it achieved publish rate of almost 5 millions messages per second.

It’s not a queue but an append only log

Redis too have an append only log at-least for backup and persistence. What’s the difference with rabbitmq?

RabbitMQ already had a queue - that's it's main feature. This adds something which is not a queue; it's an append-only log.

Queues let many writers put messages into a single topic. Then, many readers come to that topic, and pop messages off. Each message goes to just one reader. You could use this for background jobs. For example, if you were running YouTube, you might handle uploads this way: the uploaded videos are put into a queue, and workers process them to transcode them and make them playable on the website. You don't want the uploads to be processed more than once.

Append-only logs let many writers append messages onto a single topic. Then, many readers may replay the entire history of messages whenever they want. Each message may go to many readers, and may even go to the same reader multiple times if they want. You could use this to build a "message bus" where you want lots of things to happen after an action. For example, every time a user "likes" something on facebook, maybe we want to notify the content-producer, notify their friends, and update some recommendation algorithms - three different things that we want to do each time, and we don't want errors in one to block the others.

There is already a mechanism suitable for your Facebook example use-case, namely publishing to an exchange which will send the message to multiple queues.

Right. The big feature that append only logs allow is replay. I have never really seen the point, though - and I say this as a big Kafka user!

It looks like Rabbitmq stream semantics matches up better with Kafka. Redis’s stream is more lighter weight.

I thought append only logs required crypto tokens to use nowadays?

I'm positive it's used in crypto, but there's nothing novel about append only logs. It just means you don't remove the old stuff.

To elaborate a bit, crypto is backed by tokens specifically to ensure integrity in a distributed network (on machines you don't own)

If you own the DB you don't need this.

Of course you can pay for each new log entry using crypto tokens, but you don't have to.

You can use an open source version of AOL (Append-Only Log) SW for free, alternatively you can use a managed cloud version, but there you will have to pay with something called USD which is a social construct 1:1 pegged to a crypto-tokens like Tether USDT, MakerDAO DAI, and other so called "stablecoins" ;)

I guess I have to add /s explicitly nowadays!

Is RabbitMQ something industry used to use before AWS/Azure queues?

I imagine it’s most commonly used by Python developers because it’s the broker of choice for Celery, a distributed task queue.

Your username checks.

I've never heard of it myself. But ZeroMQ I've seen used for a few projects (anecdotally)

really!? rabbitmq is by far my favorite tech to use in a day to day basis.

it's absurdly reliable, it can scale super well because erlang is great (as long as you just give it enough hardware).

the only issue i had was clustering -- back in the day, it was a pain to cluster rabbitmq. nowadays it's not that hard and works really well!

I used to look after a bunch of Sensu clusters (a system that lets you define monitoring checks on clients and send keepalives/results/metrics to the server) and it used Redis to store state and RabbitMQ to handle messaging. The clients would put messages directly onto the server queues, and the server(s) would proceed the queues watching for missed keepalives, failures in check results etc. It was amazing until I decided to cluster it. It would work fine for weeks, even months. Then, for reasons unknown (possibly bad hardware, noisy neighbor, reboot) one of the nodes would drop out, some fool would "fix it" without understanding how it worked and we'd get a split brain and end up with two Sensu queue systems, but only one is getting keepalives. So if there is a Sensu server instance pointing at the other one, it alerts on every single client (hundreds) because "their keepalives have expired!". That throws hundreds of additional messages per minute onto the alert queue in that part of the fracture. Since we were practicing DevOps, all teams were responsible for their own production assets, so this would end up with a P1 incident and 20 people on a call (maybe at 4am) thinking that there's a major platform outage. And it's a nightmare to fix because you essentially have to ignore your alerting system (and it's swamped in bullshit messages anyway). I ended up working a script that replaced the alert handlers with a dev null, nuked RabbitMQ, recreated the cluster and waited for the clients to find it (or run another script to bounce all of them). Once everything settles, enable the alert handlers.

I switched to Sensu's experimental Redis queue config and never had that issue again. Ended up running 3x small Sensu servers in each region, each running Sensu + Redis in HA mode. Bulletproof, if properly configured.

Maybe clustering has improved, but having to use it as part of Sensu cluster put me off it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact