
Queues are Databases (1995) - anthony_barker
https://arxiv.org/abs/cs/0701158
======
jackvanlightly
One of the issues with modelling queue semantics over a database is
performance. All that locking, key lookups and mutating of B trees is
expensive.

The latest generation of durable messaging systems that offer queue semantics
do so by modelling those semantics over a distributed, replicated log, such as
Apache Pulsar and RabbitMQ's new replicated queue type called Quorum Queues.

A queue is different to a log in that reading from a queue is destructive, but
reading from a log is not. So if I have two applications (shipping and
auditing) that want a queue with all the shipping orders in, then each needs
their own queue - so they don't compete over the messages. Whereas a log can
be read by both, but both need to track their independent position in the log.

Apache Pulsar offers queue semantics to shipping and auditing by storing the
shipping orders in one distributed log (a topic) and creating two separate
subscriptions (also logs) that track the position (like Kafka consumer
offsets). The destructive read of a queue is simulated by advancing the cursor
(offset) of the subscription. The performance improvement this append-only log
data structure offers compared to a mutable B-tree of the RBDMS is massive.

Quorum queues do it a different way, but still modelling queue semantics over
a log.

Of course some future RDBMS storage backend wouldn't have to use B-trees and
read_past locking etc, it could also use a log based data structure for
message storage too.

~~~
coolgeek
Not all databases are relational. Not all relational databases are built on
b-trees.

From the summary:

> Many people are building queue managers on file systems as a transactional
> resource manager and a TP-lite monitor. An alternative approach is to evolve
> an Object-Relational database system to support the mechanisms needed to
> build a queuing system

The key word here is evolve.

The point is to think of a queuing system in the database as a concept sense,
rather than the database as a specific implementation sense.

------
worewood
Websphere MQ has had DB2 as a dependency for ages. Not by accident.

OTOH as a sysadmin I hate when developers use queueing systems as databases.
It's like keeping full courier vans circling around the city instead of having
a warehouse. Messaging systems are pipes, not water towers. It makes
monitoring difficult, makes browsing and modifying data difficult, and
administration difficult. They may be sides of the same coin, but that doesn't
mean you have to use only a side of it.

~~~
montroser
In the 90s, I worked in a little electronics shop where the owner had a
"creative" scheme going. He would order all sorts of stock (cameras,
computers, radios, etc) via cash-on-delivery (COD), but then he had a deal
with the UPS man.

The UPS man would just keep all the goods in his truck for weeks and weeks.
Then whenever we had a customer at the store who wanted to buy a given model
of camera or whatever it was, I would be "dispatched to the stock room", aka,
I would go out the back with cash in hand, get a money order from the post
office next door, find the UPS truck (his turf was just a few city blocks),
and then triumphantly come back with the item for the customer.

It usually worked well enough, but occasionally failed in unpredictable ways
and sometimes spectacular ways. Like using a queue as a db -- that's not
really how any of that was meant to work...

------
dsjoerg
This is like saying "Prisons are Buildings". Yes they are. So what?

EDIT: Never mind, this paper is brilliant. EDIT2: Here's the point for my own
future reference: to make a good Queue, you need a pile of parts. A reasonable
construction plan for those parts is to assemble them into a Database, and
throw a Queue thingy on top. This is the best way to build a Queue, because
then you can use the good Database for other things too. This sort of happened
and was called Redis.

------
sixdimensional
And we are still having this discussion in 2020 with Kafka [1].

A DBMS is one thing, but data platform components are something else.

And I think this is good, actually - IMHO the ability to use unbundled
components of a "database" for different purposes has been huge. A distributed
processing query engine - Apache Spark. A distributed "transaction log" that
acts like a queue and can handle real time streaming and permanent storage -
Kafka. Distributed file storage - HDFS. Efficient storage and open file
formats for compressed data - Parquet for columnar data.. Snappy compression,
ORC, etc.. etc..

As a result, we have tools that represent the unbundled components, and we
still have traditional monolithic DBMS. And we still have queues that are
transient (ex. nanomsg or ZMQ). They can and do coexist and this is good. We
have lots of tools in the chest for different jobs, it's great!

[1] [https://dzone.com/articles/is-apache-kafka-a-database-
the-20...](https://dzone.com/articles/is-apache-kafka-a-database-
the-2020-update)

~~~
taywrobel
If in 2020 you still choose Kafka as your messaging infrastructure, you are
well behind the times.

I get it, I really do. Your manager has heard of Kafka, as has your PM and
CTO. Nobody has ever fired for buying Kafka. It’s the new IBM.

“Buying Kafka? It’s open source, so it’s free!” you say, my naïve friend who
hasn’t heard of Confluent.

Yes, Kafka is free until you go to production and need things like mirror
maker, for multi-site replication. Then it’s time to pay up to the company
that has taken over and monetized the project.

But its a pain to run, a pain to debug, uses avro as a binary schema, because
everyone uses avro, right? And partitions are great! Until you need to change
them and then you’re in for a world of strange and potentially unnoticed bugs
from a discontinuity in partitioning as the topic grows.

Or... you could have Pulsar. Dynamic partitioning, more expressive
subscription models, multi-active as part of the core product. No BS marketing
claiming “exactly once delivery semantics”, aka more than once with receiver
side deduplication, aka what TCP does and has always done.

There is no reason to be building something new on top of Kafka in 2020.

~~~
ramraj07
Wow, now pulsar as well. As someone who isn't full-time trying to keep up with
this tsunami of names it's just impossible to keep up. Apache (or someone,
anyone please) needs to make a matrix of all their own competing technologies
and what the actual differences are between them. It's just impossible!

~~~
taywrobel
I feel ya. Software is a world that’s constantly evolving.

Apache is great at software engineering, but sorely lacking in product design.
Because open source software is almost definitionally not a product, but a
tool.

With that comes increased bifurcation of the tooling when different
requirements arise, and increased complexity with running it. Kafka and pulsar
both have zookeeper as an external dependency, for instance. Pulsar has an
extra dependency even in bookkeeper, one of the few things I’ll readily fault
it for. It’s a stark contrast to openly commercial products like CockroachDB,
which has a single static binary, with symmetric nodes, and built in
management UI. It’s a product, not a tool.

FWIW, Apache has a project directory -
[https://projects.apache.org/projects.html](https://projects.apache.org/projects.html)

But it’s a far cry from a comparison matrix as you (and I think many other
confused and disheartened engineers) desire.

~~~
jlokier
It's not due to bifurcation.

It's because Apache "adopts" products that are created independently by other
companies who then want to open source their product, and leave to it someone
else to look after.

Kafka was created at LinkedIn and eventually donated to Apache Foundation.

Pulsar was created at Yahoo and eventually donated to Apache Foundation.

------
oomkiller
I really look forward to seeing what can be done with Postgres's pluggable
storage backends that were recently added. It seems that some of the issues
with treating a table as a queue could be mitigated with special storage
backends designed for such a job.

~~~
anarazel
FWIW, you already can use postgres' logical decoding / change data capture to
make queuing more efficient. Depending on what you need.

If it's the type of queue that various consumers need to see in their
entirety, then you can just use pg_logical_emit_message(transactional bool,
prefix text, payload data/bytea) to emit messages , which logical decoding
consumers then see either in time order (transactional = false) or in commit
order (transactional = true).

If it's more the the job type of queue where exactly one subscriber is allowed
to see a message it's a bit more complicated, but using logical decoding will
probably still be more efficient than querying a queue table with ORDER BY
etc.

Being able to do queue submissions as part of a transaction (i.e. queue
submissions will only be visible after commit) can be really useful to
integrate with external systems.

~~~
xyzzy_plugh
You could use logical replication for queuing but there are a lot of footguns.
It's far from a general purpose queue. For a handful of consumers, fine, but
you'll have trouble scaling this to hundreds or thousands of consumers, which
other queues solve for handily.

------
cbsmith
For the modern 2019 take:
[https://youtu.be/05mVvkp6f2M](https://youtu.be/05mVvkp6f2M)

~~~
oconnor663
This sent me down a very informative rabbit hole, thank you.

~~~
cbsmith
You are most welcome.

------
nikhilsimha
Recent work from Berkeley on AnnaDB and Martin Kleppman’s OLEP stuff is a
complete opposite of this idea. Something along the lines of - databases are
just a queue topology with synchronization.

Queues do definitely seem like a more fundamental primitive.

~~~
nikhilsimha
Just to clarify slapping a queue on a database is still a bad idea IMO

------
Animats
I've done queuing using in-memory tables in MySQL. This allowed fair queuing
to prevent one user hogging the system.

------
Hernanpm
remains me to: Martin Kleppmann | Kafka Summit London 2019 Keynote | Is Kafka
a Database?
[https://www.youtube.com/watch?v=BuE6JvQE_CY](https://www.youtube.com/watch?v=BuE6JvQE_CY)

------
tanilama
I mean Queues are basically sequence with id that can be seen as monotonically
incremental...

So they can be a database, but why is this claim surprising or insightful in
anyway?

------
dsimms
I feel like [https://aphyr.com/tags/jepsen](https://aphyr.com/tags/jepsen)
really drives this home.

