Hacker News new | past | comments | ask | show | jobs | submit login
When an SQL Database Makes a Great Pub/Sub (threedots.tech)
188 points by m110 38 days ago | hide | past | web | favorite | 67 comments

There’s a perspective that the transaction log of a typical RDBMS is the canonical form and the rows & tables merely the event-sourced projection. After all, if you replay the former, you should always get exactly the same in the latter.

It’s curious that over those projections, we then build event stores for CQRS/ES systems with their own projections mediated by application code.

Let’s also mention the journaled filesystem on which the database logs reside. And the log structure that your SSD is using internally to balance writes.

It’s been a long time since we wrote an application event stream linearly straight to media, and although I appreciate the separate concerns that each of these layers addresses, I’d probably struggle to justify them from first principles to even a slightly more Socratic version of myself.

It is curious indeed. The first time I noticed this curiosity is when I saw Martin Kleppman’s Turning the Database Inside Out [1]. It’s a great watch (or read) and I really recommend it!

[1] https://www.confluent.io/blog/turning-the-database-inside-ou...

Maybe all this results in a really durable and foolproof system. I don't see how this is a bad thing. It looks like defense in depth against errors and corruption.

Also, to my knowledge, the logs in a DB are not kept forever. Instead they are trimmed as soon as reasonable. It starts to smell a little bit like a https://en.wikipedia.org/wiki/Log-structured_merge-tree

It also smells like an example of the https://en.wikipedia.org/wiki/Inner-platform_effect

Indeed, I've also puzzled over this a fair bit. Its almost as if we are lacking a lower-level interface to the transaction log that enables one to push events without going via the higher-level representation of tables etc. However, the implementation details are somewhat beyond me :-) Postgres' Bottled Water [1] was what made me think about this. I mean, why bother with exporting to Kafka at all, and instead just use the Postgres transaction log directly?

[1] https://github.com/confluentinc/bottledwater-pg

Just a build: Bottled Water was the original concept, but Debezium[1] actually provides a production-ready product.

I would update all references of the former to the latter.

[1] https://debezium.io/

It's much easier to scale kafka than a relational db. There's also an advantage to offloading the read load to another system instead of hitting the db for that, too.

Before I proceed, let me just state that I have no particular knowledge of either RDBMS' internals or Kafka, so this is nothing but my own amateur musings on the subject.

Now, I wholeheartedly agree that Kafka is more scalable, but I think the key point here is that there is no particular law of nature as to why that is the case. It may just be an historical accident of how RDBMS - and PosgreSQL in particular - have evolved. Further: many of the properties of Kafka are in fact also desirable properties for the PostgreSQL transaction log.

My take on the inopinatus observation, together with the Samza article [1] mentioned on this discussion, are as follows. You can think of Postgres as two "products" (bounded contexts, if you like):

- a stream-based, possibly replicated, transaction log;

- a projection of that transaction log into relational calculus, plus all of the associated machinery.

Thus far we never had the need to think of these as clearly separate "products", but Kafka makes it obvious that they are. In truth, the amount of tools processing WAL outside of PostgreSQL were already hinting in this direction; Kafka just made it obvious.

From this perspective, it seems a tad expensive to take the original transaction log, convert it to a RDBMS representation, then convert it to events and, in some cases, then store it as an event stream in Kafka. It would be much more efficient to simply use the original transaction log directly - and this is why, to me, even Debezium [2] / Bottled Water [3] appear to be one layer too many. To the best of my understanding, this line of reasoning is also line with the observations in the Samza article [1]. Where I believe I differ from the article is in thinking that the RDBMS representation also adds a lot of value to applications - I see both having a role (e.g. streaming vs batch processing sort of thing). I think this would derail the present discussion too much, so I won't go in to it.

In conclusion: to the untrained eye, it seems that the right thing to do is to extract the transaction log out of PostgreSQL and make it as scalable as Kafka. Then, allow for it to log "things" which are not necessarily "projectable" into the relational plane. PostgreSQL then becomes just a client of the transaction log, together with other "kinds" of clients. I suspect that this is what will ultimately happen, but the engineering work required will probably span a decade or more.

My 2 Angolan Kwanzas, at any rate.

[1] https://www.confluent.io/blog/turning-the-database-inside-ou...

[2] https://debezium.io/

[3] https://github.com/confluentinc/bottledwater-pg

Kafka can scale and distribute individual streams across the cluster. That is, the entire "database" is distributed across nodes. With postgres, your unit of scalability is the entire database. You can't natively have some tables on one node and some tables on another node, for example.

Kafka is also just more optimized for what it does. Postgres is a superset of what kafka does, so kafka is unsurprisingly better able to optimize for its usecase. It has a zero-copy protocol that can shuttle data to/from disk to/from the network without bringing it into memory (using the sendfile syscall). It doesn't wait for disk writes when doing writes, because it achieves durability via replication.

Also, don't forget the things you'd have to do when implementing consumers. How will you load balance streams between consumers? Meaning, if you have a stream and you want multiple consumers to burn it down at a time, how can you make sure they aren't duplicating work and they can handle the consumer group growing/shrinking? How will you handle checkpointing where each consumer tracks what they've done so far? What about streams' data rolling off?

All of this is doable with pg, but you'd have to implement it yourself. With kafka and its client drivers, this is handled for you.

As I said, I haven't given a lot of thought about this so please take my opinion with a grain of salt - but I believe that once you split the log out of PostgreSQL, a lot of functionality of this ilk could start to be considered. When/if added, I think it would make for a stronger PostgreSQL in the end. However, I do understand this is an insanely hard amount of work. In a way, it bears some similarities to splitting GTK out of GIMP, for instance; extremely difficult thing to do but ultimately it turned out to be a massive win for both projects. This would be even harder, but ultimately, greatly advantageous.

Musing similarly... Or just replace PostgreSQL's transaction log with Kafka, given all the Kafka's advantages. Seems to me that traditional RDBMSs cram too many seemingly independent subsystems into a single system, each one of which if separate and behind a well defined interface could be made substantially better. I think there's more room for software that is akin to SQLite - a library replacing RDBMS functionality within an application and thus allowing for composability on language-linking level as opposed to process-linking level (done with an orchestrator).

Perhaps the title of the article doesn't capture its underlying value. PubSubOnSQL Layers rarely make sense compared to dedicated Pub/Sub systems but the trade-off is acceptable in the three use cases described: 1. When your dedicated Pub/Sub is ephemeral and you need message durability, 2. When you need distributed transactions across SQL and your Pub/Sub system, and 3. When you need a poor man's heterogeneous SQL replication system.

The academic/enterprise database space has been discussing and tackling the types of questions you raise for decades. I don't think that is a useful lens to evaluate this article which is effectively a "tips on when to use our GoLang SQL Pub/Sub layer".

I'd add a #4, when you have such low volume that it just isn't worth it to put up a full system, with the accompanying need for deployment, resources, monitoring, and additional knowledge and skills added to the minimum set of knowledge and skills your team must possess.

One must be careful not to use this as an excuse, of course, and keep an eye on the scaling concerns and certain other details. SQL-as-pubsub has certain well-known issues and anyone using it this way ought to be aware of them. But it's a thought worth having.

I've got a system I'm managing where Cassandra is the backend. I've got about 5 "documents" (in the MongoDB sense, let's say) I want to store in the system. I don't put up a whole "document DB" for them, I just have a table in Cassandra. I have in my entire system, one distributed lock I'd like to have per certain resource, of which I expect there to be single digit numbers of that resource over the lifetime of the system. Cassandra is not a great distributed locker, but it does work (and as far as I can tell, done properly, is also correct), so rather than install an entire distributed lock server, I use Cassandra.

Should this ever turn out to become a mistake, all code that uses either of these functions is cleanly isolated and I can easily swap them out later. I am aware of the possibility that could happen in the future and have prepared for it. In the meantime, I've avoided two entire systems being poorly deployed and understood in favor of the one system that is well-deployed and understood by the team.

Agreed, the author did hint at this idea with the statement "SQL databases are readily available if you don’t have access to the dedicated messaging infrastructure. They also have some unique features that turn out useful while working with message-driven applications." Specialized tools are wonderful but sometimes their use represents a premature optimization when a familiar general purpose tool is good enough to get the job done in the short term.

> SQL-as-pubsub has certain well-known issues and anyone using it this way ought to be aware of them.

Such as? Other than performance.

In addition to what james_s_taylar said, another one is that you have to watch how the tables are structured internally, because for some databases having a table that writes a lot of rows and deletes them all very quickly is pessimal usage. You can get a table that fills up with tombstones internally. I've seen cases where it literally takes seconds to pull a single message off the queue from an otherwise unloaded server, because the act of "SELECT * FROM messages LIMIT 1" requires a huge table scan across a whole bunch of tombstones, which can be very frustrating to diagnose since it looks like the table is nearly empty.

I have an application at work I have to use (not something I wrote) based on an old version of Postgres that was experiencing this until quite recently when they upgraded, and I'm still not 100% sure that it's fixed because it hasn't been enough time. (I do expect it to be fixed, though.)

The aforementioned Cassandra is pretty much catastrophic in this use case, from what I've seen; I didn't try to use it as a message queue, all I did was write some unit testing that involved adding some test records during the test and removing them, then running a whole bunch of other tests that would add records to the same primary key and remove them, simply to clean up for the subsequent tests, a very similar pattern as using a table as an event queue from the database's point of view even though we humans see that as quite different, and I very quickly noticed my tests were running noticeably slower every execution, on scales that compared to what Cassandra can nominally handle were very, very tiny. For the testing, I found I could use TRUNCATE, which tells Cassandra "Hey, I'm not just happening to delete every row in this table, I'm actively nuking it, so please clear it out", resulting in Cassandra completely wiping the table tombstones and all, which was fine for this unit test case where they run synchronously relative to each other and I have clear points where I can say "just throw the table away now, tombstones and all", but that's completely unsuitable for an event queue use.

Somewhat ironically, you sometimes want the more "primitive" table types for this use care. IIRC there was a time when if you were using MySQL, you'd want to use a MyISAM-type table for the queue table, because it pretty much works as you expect and deleted rows are actually deleted and gone, but InnoDB used the "higher performance" tombstone-based approach, which is must faster, as long as deletes are relatively rare, but much slower if they aren't. Know your DBs!

The key thing about the performance aspect is SQL produces bi-modal behavior due to there being a query planner / operator execution. If all of a sudden your performance characteristics change dramatically because the query planner produced a different result than normal due to something changing about your dataset, that could be pretty bad. So, not having this risk of bi-modal behavior is a nice feature of dedicated pub/sub and queue software.

> It’s curious that over those projections, we then build event stores for CQRS/ES systems with their own projections mediated by application code.

That's really logical. From the view of the application there is no transaction log, only a table. It's an implementation detail of the database.

The application wants similar guarantees a log can provide, so they build their own.

Yep yep yep. Amazing how we keep reinventing the same thing over and over and over again.

I wouldn't say this is "reinventing the wheel", it's more like applying the same design/architectural pattern to solve a common problem.

> the transaction log of a typical RDBMS is the canonical form

Do databases keep the whole transaction log forever? It seems like that could keep growing forever even when the tables stay at constant size.

Usually you'll keep the transaction logs for a certain period, say for two weeks. That gives you point in time recovery for any moment in that two weeks (assuming you've set up everything correctly!). Logs past that are usually archived or just dumped, but you'll have full backups that go back further.

For your question about the logs growing forever, there are usually points in the process where a transaction log is saved off somewhere else and then can be overwritten, but on a transactional system the logs over a period of a week or two can sometimes be many times larger than the actual data stored at any given time, yes.

Really, I wouldn’t teach junior developers that it’s ok to use a database table when a queue is needed. Sure, you can get away with this and there are cases when it’s all you need. But I’ve been one of those juniors who forgot to limit the query, who didn’t have enough indices, who tried to order all records by date and had full table scans everywhere, who implemented the worker with a cron job and didn’t synchronize this with a lock.

It might work, but it’s not the general case and you might spend more time to debug your table then to write the code to use a real queue.

And I’ve also seen people build their own queueing engine for a few hundred tasks per day. Why don’t they just choose one of the very good open source solutions?

Like many design patterns when implemented badly, queuing will usually result in a lot of problems. But if your team is competent, you’re using an off-the-shelf library, and you don’t have crazy demands, then re-using infrastructure can be a good idea as it’s fewer things to manage.

Fewer things to manage is the key!

What's the rationale to teach databases as message queues when it requires special querying and updates when there are so many message queues services already available, easy to use, and standard compliant?

> What's the rationale to teach databases as message queues when it requires special querying and updates when there are so many message queues services already available, easy to use, and standard compliant?

If you already need a database for something else, using the DB as a Queue means you don't need to list {mqFlavorOfChoice} as a requirement for new hires. You also don't have to manage that extra infrastructure. Of course, you are putting additional load on the DB.

Mind you, I'm speaking of a pub-sub type queue and not a FIFO here. You can do FIFO queues in DB as well of course, it's just not as compelling of a story nowadays.

Also way easier to look at and 'poke' a Database queue if you need to. The queries are also not really difficult to write for a general purpose use case.

I agree that using a traditional DB as a queue has it's pitfalls, but it also has great benefits with regards to consistency, reliability and simplicity.

If I was building a new system that required a queue I'd definitely put in the same postgres db as the rest of the data until I had a good reason not to.

"Because someone might add bugs to the code, or do it wrong" isn't a reason not to take a particular approach.

Of course it is. When one approach is far more likely to introduce bugs, complicated interactions, and be more difficult to maintain, it is absolutely a reason to not take a particular approach. It’s one of factors you need to consider in everyday software engineering.

If you don’t take this into consideration then you’re detracting from the business to satisfy another need.

Sure it is. If one approach is easier to mess up than the other then take the approach with less footguns.

All the issues you mention are fairly easy to fix.

After you found them. And you find them after you become aware of problems.

And how hard it is to clean up the data at this point in time depends entirely on the kind of system you're working with.

The "database as message queue" pattern is quite common and often considered to be an antipattern, which I tend to agree with but I don't have that strong of a position on it myself. I've certainly used this pattern for expediency, but that was before we had all the messaging solutions we do today. http://mikehadlow.blogspot.com/2012/04/database-as-queue-ant... has some good points.

A lot has changed in the 7 years since that was written.

Polling isn't a huge issue to begin with, and is mitigated with LISTEN/NOTIFY (on certain DBs). Inserts with indexes are not a performance problem at the scale of most applications. A separate messaging service won't prevent you from building a "hugely coupled monster".

Personally, I almost always start with the database as a queue. The operational overhead of running, updating, and monitoring another entire service is non trivial. If the messaging rate exceeds the database's capabilities in the future, I'll migrate then.

If you need just one queue yes. If you have lots of queues it’s worth investing in a queue service of some sort and there are many of them out there which is a good thing but could turn into a bad thing quickly. In the past I worked at a place that had 3 different queueing services implemented by different developers and it became a pain to manage them or to even know what was on the queues.

I've always considered message-queues as a close cousin (if not sibling) of databases. Arguably performing the same function with different foci. Pub/sub focusing on the "oplog". DBMS focusing on "state".

(Blockchain another "oplog" that ends up caring a lot about state eventually).

It's no wonder you can use them interchangeably in many common base cases.

Seems like they fall into the same pit as many does: using primary keys with autoincrement as offset. This leads to skipping messages because there is no guarantees that primary keys will be available in monotonic order. Because, you know, transactions.

Can you expand a bit on this? From my understanding, autoincrement keys ca mn have gaps, but are always increasing. Sometimes a message might arrive "late", so you get a 3,then a 2. This problem cannot really be solved without giant locks that are not ideal. As far as I'm aware, all messaging systems are subject to this problem.

Messages will never arrive, arrive out of order and I don't remember the third one right now (messages will arrive late?)

Databases such as postgresql will effectively issue a buffer of keys to each connection, meaning in some circumstances the sequences will not be monotonic with respect to time. Also that usually long running transactions will use the timestamp the transaction was opened, regardless of how many seconds have passed between then and when the statement is executed.

Very interesting details, thanks. So the alternative is have inconsistency, or "giant locks". One is not performant, the other is inconsistent.

Tough choice, interesting nevertheless

Yes, you described the problem exactly as it is. The problem is not in arrive order to subscriber, the problem is "selecting next messages with offset > last_offset". And in this case you simply miss late messages.

Oh ok. Well, I don't believe is permanently solvable, but there can be mitigation techniques where instead the software reads messages way back every now and then, to recover some messages.

Some messages might still be too late and get missed, but most of them should get through, which is what every messaging service is currently doing

It is solvable, but not with SQL. I mean, one have to have side logic to keep all transactions, messages and such and use as simple storage (i.e. do not use database transaction/locking mechanisms for main business logic).

Yeah my point is, it's not solvable by the transport mechanism.

The application logic can indeed solve it, "eventually consistent systems" are a thing. My main goal was figuring out if this was impossible in SQL for some reason, but my understanding is just that the implementations are usually weak and do the "read-back" they need to, to recover late messages.

Off topic but does this not effect the Pagination functionality of databases as well? Using primary keys to skip first N pages and then limit the count of results seems to be the suggested way for getting items for the Nth page. If primary key is not monotonic then this is going to give jumbled results thus messing up results in the Nth page.

EDIT: More context for the above process[1]


Hummm, not sure I follow but most likely no.

What parent mean is that there may be holes in the sequence of primary keys. What you do with pagination is that you first sort the sequence, then thrown away the first N results, and finally select only the next M results.

It will work just fine.

That is one way, not necessarily the most efficient. And having gaps in the id sequence can complicate pagination.

Recommended reading: https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...

> What parent mean is that there may be holes in the sequence of primary keys.

Are you sure that's what they mean? It's not what they said. "Monotonic" means "strictly increasing" (or decreasing) e.g. 1, 2, 5, 7 is monotonic even though it has gaps. "Contiguous" means "without gaps".

I have linked the article for the above process.

Perhaps it's not so much about pub/sub, but about store-and-forward.

When the "forward" part of "store-and-forward" is most important then Kafka is a fine solution.

However, when the "store" part - for example you want to be able to stream historical data again, or interact with the data in different ways - is most important I have recommended HBase (+ Phoenix) as a better solution in the past.

For anyone just looking for ‘plug and play’ web socket pub/sub functionality, I have been developing something that provides the functionality for PostgreSQL: https://github.com/supabase/realtime

It's an Elixir server (Phoenix) that allows you to listen to changes in your database via websockets. Basically the Phoenix server listens to PostgreSQL's replication functionality, converts the byte stream into JSON, and then broadcasts over websockets. The beauty of listening to the replication functionality is that you can make changes to your database from anywhere - your api, directly in the DB, via a console etc - and you will still receive the changes via websockets.

The article suggests Postgres’ native LISTEN/NOTIFY functionality. I tried that originally and found that NOTIFY payloads have a limit of 8000 bytes, as well a few other inconveniences.

It's still in very early stages, although I am using it in production at my company and will work on it full time starting Jan.

One way to get around the 8k NOTIFY limit is to only use the capability to notify only. It would them be incumbent on the client to go fetch the data from a table somewhere. I ran into a similar limitation with SQL Server 2005 years ago and used this approach with great success.

MessageDB was doing the rounds in reddit the other day [1]. Looks interesting for simple use cases...

[1] https://www.reddit.com/r/PostgreSQL/comments/ebu6nh/message_...

That's basically the same pattern as the "outbox pattern", e.g. listed in Chris Richardson's pattern of microservices patterns.

An alternative implementation is provided by Debezium [1], a general solution for change data capture for MySQL, Postgres, MongoDB, SQL Server and others, based on top of Apache Kafka (but can also be used with Pulsar and others).

There's support for outbox coming as part of Debezium out of the box [2].

Disclaimer: I'm working on Debezium.

[1] https://debezium.io/ [2] https://debezium.io/documentation/reference/1.0/configuratio...

If anyone is using Elixir, Oban[0] is a job processor that uses PostgreSQL for its back-end and state management.

It's incredibly well written and I am using it in a project.

[0]: https://github.com/sorentwo/oban

One fun open source software I've played with, that I don't think many have heard of, is Deepstream.io. It attempts to be a batteries included real-time web server that works with websockets, and can function as pub/sub server and client. It has a connector for using PostgreSQL as the database. The frontend JavaScript library is really easy to get working.



(I'm not affiliated with the project.)

Thanks, this does really look interesting. I'd like to find something generic, lightweight to replace Kafka or Pulsar. Not sure this could be it, but it looks like it'd be worth having a look at...

This could slightly out of context.

I'm working on a module that send notifications to a user when an alert is generated. I have PostGreSQL as the database and NodeJS is the handler and for connection pooling. Are there any good pub/sub tools that I can use. Thanks in advance.

How about simply having an after insert trigger on an alerts table that calls notify, and then you listen for that in node? It's a simple setup with less moving parts and could probably get you a long way...

Thanks, this will probably work for me. However, if the inserts are higher, I don't want to get notified that frequently. How would I add a periodic alerts to this? Thank you!

You could use a column/table to track notifications. Eg. have an alert_sent_at column on the alerts table. Then when the node service receives the notify on an insert you can defer based on any logic you need, and send notifications when needed by a query that fetches alerts with alert_sent_at = null.

Thank you! Thank you so much :)

You're welcome. To ensure you're not sending alerts twice either select and update in a transaction or use a CTE like:

  with alert as (
    select alert_id from alerts 
    where alert_sent_at is null
  update alerts set
    alert_sent_at = now()
  from alert
  returning *

Meteor provides pub/sub with a MySQL backend using the atmosphere package vlasky:mysql.

It works by following the MySQL binary log and triggering a reactive query based on event conditions specified by the programmer, e.g. a change in a field.


Oracle has a very advanced and flexible system for this. It is called Advanced Queueing.

Ever use google's firebase? While not SQL -- I've always felt `tis a nice solution to persistence+async...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact