
Liftbridge: Lightweight, fault-tolerant message streams - tepidandroid
https://github.com/liftbridge-io/liftbridge
======
altmind
If someone want to hear some experience using NATS in production(this is based
on NATS):

* standalone server is mature and very stable, crunching messages with incredible speed

* server side error handling is ok, no problems here

* very simple, text-based protocol, even simplier than STOMP

* c, rust, java, python client libraries are weak; c library even contains some state transition errors when the library can lock on reconnect( [https://github.com/nats-io/nats.c/issues/217](https://github.com/nats-io/nats.c/issues/217) ). (unofficial) rust library is unusable. some netowork errors will not make nats.c reconnect what can be a surprise in prod.

* arbitrary hardcoded limit on the message size - 1M

~~~
KaiserPro
We've been using NATS in prod for a very specific usecase: request reply.

We'd got tired of trying to hack around SQS. After a boat load of testing, we
settled on NATS. It was our big "innovation token" for that project. It has
paid off wonderfully.

The server clusters very well, and is very simple to peer. We've joined two
clusters across two regions, and we've not had any issues.

If you need a fast, low latentcy, non persistant, non guarenteed message
delevery system NATS is for you.

If you like to store messages in your queues for later consumption, or need
_every_ message, NATS isn't for you.

~~~
krz
What problems did you have with SQS?

~~~
KaiserPro
Nothing really, We were asking it to do something its not really supposed to

Basically we needed a request reply semantic. (forgive the explanation) Its
where we have an instance of the front end sending a request to any of the
back end servers. A single backend replies directly to the same front end
instance.

Basically we needed something similar semantic to a load balancer, but in
messaging form.

SQS is only a 1-n Queue system. So we can deliver a message to a single
backend in a cluster, but we couldn't get it back to the producer without
making a message routing system. We used Dynamodb and repeated polling, but
that wasn't cheap to scale.

We might have been able use SNS as a broadcast bus of somesort, but that
seemed like a backwards step.

We still use SQS for virtually everything else. NATS is kept for that one use
case.

~~~
iddqd
> Basically we needed something similar semantic to a load balancer, but in
> messaging form.

Why not stick with a load balancer? I've been struggling to find a good use
case for request-response over message queues so I'm curious.

~~~
KaiserPro
Let me try and draw a diagram:

Api-gateway -> Lambda -> NATS -> docker backend on GPU

The backend is C++ and needs to be shielded as much as possible from the web.
It also takes a fair wedge of time to deal with requests (1-5 seconds) THe
rationale is the Lambda front end validates and routes, the backend replies.

Before we moved to NATs we had the added bonus that using SQS meant that there
was no possible link between front-backend apart from SQS. (both lambda and
SQS are outside of the VPC by default)

the backend has a strict schema, and to keep moving parts down it seemed wise
to avoid having to plumb in a webserver as well. They really are badly suited
to this sort of thing. They also add a boat load of extra latency.

We have a large number of machines, all of which deal with certain areas, This
allows us to intelligently cache hotter areas closer to clients.

Request reply is a reasonable pattern for interacting with backends that are
constantly changing capacity based on demand. It also, if done correctly can
be a way to optimise for latency over capacity (depending on how you do it.)

With NATS you can ask for more than one response, which might also be useful.
Some other systems use a "fastest reply wins" which guarantees a fast response
at the expense of capacity.

------
tepidandroid
It looks like Liftbridge occupies a nice little niche between full-blown,
durable Kafka and lightweight, ephemeral NATS.

Additional reading: [https://bravenewgeek.com/introducing-liftbridge-
lightweight-...](https://bravenewgeek.com/introducing-liftbridge-lightweight-
fault-tolerant-message-streams/)

------
gtirloni
While searching for comparable solutions to understand what this about, I
found this website which has a very good summary about queue systems and the
like: [http://queues.io](http://queues.io)

------
gregwebs
This is based on NATS. NATS just had their 2.0 release: [https://nats-
io.github.io/docs/whats_new/whats_new_20.html](https://nats-
io.github.io/docs/whats_new/whats_new_20.html)

------
tylertreat
Author here. Happy to answer any questions.

~~~
softwarelimits
Hi, what are the message delivery guarantees with Liftbridge? I would like to
replace a PostgreSQL based hand made queue system, but I need the
transactional safety of a database - should I use Liftbridge?

Thank you very much for your attention!

~~~
tylertreat
It has at-least-once delivery guarantees currently. I have plans for some
lightweight transactional-style publishing with an idempotent producer though.

------
canadaduane
I believe NATS streaming server also fills this niche:
[https://github.com/nats-io/nats-streaming-server](https://github.com/nats-
io/nats-streaming-server)

Can anyone speak to the current state of each project, pros & cons?

~~~
manigandham
NATS is a pub/sub server written in Go. It's designed for simplicity and
performance with features like request/reply and 1-hop fanout but no
persistence.

NATS Streaming (aka STAN) was an answer to persistence by basically building
acks and storage on top of the base NATS protocol, resulting in another
protocol on top that are stored as files or backed by a database. STAN has
limitations on topic/subs and is designed as single master with multiple
failovers but works well if you need a simpler self-contained alternative to
Kafka. Later a Raft-based clustering option was added but isn't a great
architecture and has a major performance impact. The docs recommend you use
the failover model instead.

Liftbridge offers a better version that uses the standard NATS protocol and
basically functions as invisible subscribers to the topics and just logs all
the messages using multiple Raft groups, which lets you scale out by using
more topics. NATS itself has recently been revamped with 2.0 release that adds
some major new features and they're working on a better answer to the
persisted/distributed log offering.

If you need fast pub/sub for ephemeral messaging then NATS is great. If you
need persistence and a single-server is enough throughput then use NATS
Streaming (with failover if you need it). If you need more persistence
throughput then use Liftbridge. Although once you start getting to these
scales I would recommend looking at Kafka (which also has gotten much better
with v2.0+) or something like Apache Pulsar which is more advanced and
scalable then both. There are also commercial options like Solace with a free
tier that are worth checking out.

~~~
tylertreat
Just to clarify, Liftbridge relies on a single Raft group used purely for
metadata replication, i.e. control plane, not data plane. Streams themselves
are replicated using a protocol very similar to Kafka (ISR-based with
followers fetching messages from the leader's log).

------
polskibus
Say I need to add messaging with persistence (at least once delivery) for my
stack (multiple services, often but not always on the same machine), and I
want the stack to not have external dependencies-so it's easy to install. Does
liftbridge fill this niche? Would NATS streaming be a better solution or maybe
Akka with Persistence module?

~~~
paddybyers
I can't speak for NATS streaming or Akka but it does look like liftbridge can
fit that niche very well, depending on what features and semantics you need.
Your services that publish can be assured that a message has been persisted to
a quorum of replicas (strictly, the in-sync replicas) before receiving an
acknowledgement, so you are able to assure that all onward processing to your
downstream services occurs at least once for all acknowledged messages.

If you require fanout to your stream consumers then this is supported simply
by having multiple consumers subscribe to each stream. If you require a
competing consumers model then this isn't directly supported, but you can have
load-balance groups which you could use to achieve something similar, but with
a static configuration - that is, your consumer population can't resize
dynamically, but you can distribute messages to a fixed set of streams, each
of which has a consumer.

liftbridge doesn't yet support durable consumer-offset checkpointing, although
this is advertised as being on the roadmap. In the meantime, if you wanted the
consumer offset to survive consumer restarts, then the consumer itself would
need to take care of persisting its current offset.

~~~
tylertreat
This is correct. Also, I'm planning to introduce a mode that runs NATS
"embedded" within the Liftbridge server, such that NATS basically becomes an
implementation detail of Liftbridge.

Re: consumer-offset checkpointing, this will be implemented in the form of
consumer groups, similar to how Kafka works where a consumer group checkpoints
its position in the log by writing it to the log itself.

