related: McSherry's preceding blog post was all about demonstrating how binary joins can achieve worst-case optimal runtime, given suitable adjustments to the query plan.
For materialization-heavy workloads (program analysis, etc.), we often find that optimized binary join plans (e.g., profile-optimized, hand-optimized, etc.) beat worst-case optimal plans due to the ability to get better scalability (less locking) without the need to use a trie-based representation. Within the space of worst-case optimal plans, there are still lots of choices: but a bad worst-case optimal plan can often beat a bad (randomly-chosen) binary plan. And of course (the whole point of this exercise), there are some queries where every binary plan explodes and you do need WCOJ. There's also some work on making more traditional binary joins robust (https://db.in.tum.de/people/sites/birler/papers/diamond.pdf), among other interesting work (https://arxiv.org/html/2502.15181v1). Effectively parallelizing WCOJs is still an open problem as far as I am aware (at least, this is what folks working on it tell me), but there are some exciting potential directions in tackling that that several folks are working on I believe.
Algebraic effects are in delimited continuation territory, operating on the program stack. No amount of monad shenanigans is going to allow you to immediately jump to an effect handler 5 levels up the call stack, update some local variables in that stack frame, and then jump back to execution at the same point 5 levels down.
Quite the opposite, that's exactly what continuation monads do, for example `ContT`, and more structured versions such as `freer`. Those essentially simulate a stack rather than using the actual RTS stack. For the latter there are `eff` and `bluefin-algae` (the latter very much work in progress). So yes, in Haskell at least, monads are the right API for deli meted continuations.
> Like goto, but you don't even need to name a label.
That's what exceptions are.
But effects don't cause you to see huge stack traces in errors because the whole point is that you provide the effect and values expected and the code goes on running.
>> Like goto, but you don't even need to name a label.
> That's what exceptions are.
In many contexts I agree. However, the difference here is exceptions are one-way execution control transfers.
Remember what was originally said:
No amount of monad shenanigans is going to allow you to
immediately jump to an effect handler 5 levels up the call
stack, update some local variables in that stack frame, and
then jump back to execution at the same point 5 levels down.
Jumping "5 levels up the call stack", modifying some state, then jumping back "5 levels down" is to me pretty much the definition of a nightmare to reason about.
I think the article was getting at this at the end - different use cases naturally call for either a point-in-time snapshot (optimally serviced by pull) or a live-updating view (optimally serviced by push). If I am gauging the health of a system, I'll probably want a live view. If I am comparing historical financial reports, snapshot. Note that these are both "read-only" use cases. If I am preparing updates to a dataset, I may well want to work off a snapshot (and when it comes time to commit the changes, compare-and-swap if possible, else pull the latest snapshot and reconcile conflicts). If I am adjusting my trades for market changes, live view again.
If I try to service a snapshot with a push system, I'll have to either buffer an unbounded number of events, discard events, or back-pressure up the source to prevent events from being created. And with push alone, my snapshot would still only be ephemeral; once I open the floodgates and start processing more events, the snapshot is gone.
If I try to service a live view with a pull system, I'll have to either pull infrequently and sacrifice freshness, or pull more frequently and waste time and bandwidth reprocessing unchanged data. And with pull alone, I would still only be chasing freshness; every halving of refresh interval doubles the resource cost, until the system can't keep up.
The complicating real-world factor that this article alludes to is that, historically, push systems lacked the expressiveness to model complex data transformations. (And to be fair, they're up against physical limitations: Many transformations simply require storing the full intermediate dataset in order to compute an incremental update.) So the solution was to either switch wholesale to pull at some point in the pipeline (and try to use caching, change detection, etc to reduce the resource cost and enable more frequent pulling), or, introduce a pulling segment in the pipeline ("windowing" joins, aggregations, etc) and switch back to push after.
It's pretty recent that push systems are attempting to match the expressiveness of pull systems (e.g. Materialize, Readyset), but people are still so used to assuming pull-based compromises, asking questions like "How fresh does this data feed really _need_ to be?". It's analogous to asking "How long does this snapshot really _need_ to last?" - a relevant question to be sure, but maybe doesn't need to be the basis for massive architectural lifts.
I made this "expert-friendly" doc, to orient all who find themselves probing the Java Streams source code in despair. It culminates in the "Stream planner" - a little tool I made to simulate how (parallel) stream operations affect memory usage and execution paths.
Go forth, use (parallel) streams with confidence, and don't run out of memory.
If I could ask independently of the sentiment of this thread - I am genuinely curious: Why is marking the difference good? (sorry this is only tangential to the article)
The behavior and performance of asynchronous function depend on the behavior and performance of concurrent processes, making them more complex to reason about, test and be confident in their correctness and well-suitedness.
Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.
- https://github.com/frankmcsherry/blog/blob/master/posts/2025...