My colleague did some internal benchmarking and found that LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. Our findings were pretty consistent with this blog post.
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
Nice! I'm using DBOS and am a little active on the discord. I was just wondering how y'all handled this under the hood. Glad to hear I don't have to worry much about this issue
We considered using WAL for change tracking in DBOS, but it requires careful setup and maintenance of replication slots, which may lead to unbounded disk growth if misconfigured. Since DBOS is designed to bolt onto users' existing Postgres instances (we don't manage their data), we chose a simpler, less intrusive approach that doesn't require a replication setup.
Plus, for queues, it's so much easier to leverage database constraints and transactions to implement global concurrency limit, rate limit, and deduplication.
I’ve seen several blog posts trying to analyze HN data on the best time to post. However, the results are all over the place. For example, the below ones have different recommendations (weekend vs weekday).
It's cool to build a database in 3000 lines, but for a real production-ready database you'll need testing. Would love to see some coverage on correctness and reliability tests. For example, SQLite has about 590 times more test code than the library itself. (https://www.sqlite.org/testing.html)
DBOS always uses transactions to perform database operations.
If you're writing a function that performs database operations, you can use the @DBOS.transaction() decorator to wrap the function so that DBOS's bookkeeping records commit in the same transaction as your operation.
However, if you're interfacing with a third-party API, then that wouldn't be part of a database transaction (you'll use @DBOS.step instead). The reason is that you don't want to hold database locks when you're not performing database operations.
Hi! How does it perform under heavy load and with thousands of workflows trying to run concurrently since it relies on Postgres for a lot of things (including using a transaction)? In the end it seems that if I have an application with lots of distributed workers trying to run workflows, I'll still be limited by the CPU/memory of the DB.
Hi there, I think I might have found a typo in your example class in the github README. In the class's `workflow` method, shouldn't we be `await`-ing those steps?
It's not recommended--the assumed model is that every workflow finishes on the code version it started. This is managed automatically in our hosted version (DBOS Cloud) and there's an API for self-hosting: https://docs.dbos.dev/typescript/tutorials/development/self-...
That said, we know sometimes you have to do surgery on a long-running workflow, and we're looking at adding better tooling for it. It's completely doable because all the state is stored in Postgres tables (https://docs.dbos.dev/explanations/system-tables).
The main use case is to build reliable programs. For example, orchestrating long-running workflows, running cron jobs, and orchestrating AI agents with human-in-the-loop.
DBOS makes external asynchronous API calls reliable and crashproof, without needing to rely on an external orchestration service.
How do you persist execution state? Does it hook into the Python interpreter to capture referenced variables/data structures etc, so they are available when the state needs to be restored?
About workflow recovery: if I'm running multiple instance of my app that uses DBOS and they all crash, how do you divide the work of retrying pending workflows?
Each workflow is tagged by the executor ID that runs it. You can command each new executor to handle a subset of the pending workflows. This is done automatically on DBOS Cloud. Here's the self-hosting guide: https://docs.dbos.dev/typescript/tutorials/development/self-...
I was originally looking at the docs to see if there was any information on multi-instance (horizontally scaled) apps. Is this supported? If so, how does that work?
Yeah, DBOS Cloud automatically (horizontally) scales your apps. For self-hosting, you can spin up multiple instances and connect them to the same Postgres database. For fan-out patterns, you may leverage DBOS Queues. This works because DBOS uses Postgres for coordination, rate limiting, and concurrency control. For example, you can enqueue tasks that are processed by multiple instances; DBOS makes sure that each task is dequeued by one instance.
The article nicely explains how to build a minimalist OS — works great as an intro material. I think understanding basic OS concepts is essential for performance tuning and debugging.
Notice a bunch of downvotes -- Apologies for being unfamiliar with the rules here (I've always been reading HN, but I'm new to commenting). I should've added a lot more details to my previous comment and been more specific. Any other guides would be helpful too. I'll be careful in the future.
When I learned OS, I followed MIT 6.828 (https://pdos.csail.mit.edu/6.828/2017/overview.html) and implemented a small OS called JOS based on Xv6. So if you're looking for some teaching OS in x86, check it out.
Exactly, you have to (vaguely) know what you’re looking for and have some basic ideas of what algorithms would work. AI is good at helping with syntax stuff but not really good at thinking.
Great article — it clearly explains “The devil is in the details” :) Would love to see another one for LSM-Tree, and the comparison between B-Trees and LSM-Trees.
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
[1] https://github.com/dbos-inc/dbos-transact-py
reply