davery22's comments

davery22 · 2025-06-15T19:03:02 1750014182

related: McSherry's preceding blog post was all about demonstrating how binary joins can achieve worst-case optimal runtime, given suitable adjustments to the query plan.

- https://github.com/frankmcsherry/blog/blob/master/posts/2025...

kmicinski · 2025-06-15T19:28:58 1750015738

For materialization-heavy workloads (program analysis, etc.), we often find that optimized binary join plans (e.g., profile-optimized, hand-optimized, etc.) beat worst-case optimal plans due to the ability to get better scalability (less locking) without the need to use a trie-based representation. Within the space of worst-case optimal plans, there are still lots of choices: but a bad worst-case optimal plan can often beat a bad (randomly-chosen) binary plan. And of course (the whole point of this exercise), there are some queries where every binary plan explodes and you do need WCOJ. There's also some work on making more traditional binary joins robust (https://db.in.tum.de/people/sites/birler/papers/diamond.pdf), among other interesting work (https://arxiv.org/html/2502.15181v1). Effectively parallelizing WCOJs is still an open problem as far as I am aware (at least, this is what folks working on it tell me), but there are some exciting potential directions in tackling that that several folks are working on I believe.

davery22 · 2025-05-24T04:53:25 1748062405

Algebraic effects are in delimited continuation territory, operating on the program stack. No amount of monad shenanigans is going to allow you to immediately jump to an effect handler 5 levels up the call stack, update some local variables in that stack frame, and then jump back to execution at the same point 5 levels down.

tome · 2025-05-24T12:43:41 1748090621

Quite the opposite, that's exactly what continuation monads do, for example `ContT`, and more structured versions such as `freer`. Those essentially simulate a stack rather than using the actual RTS stack. For the latter there are `eff` and `bluefin-algae` (the latter very much work in progress). So yes, in Haskell at least, monads are the right API for deli meted continuations.

https://www.stackage.org/haddock/lts-23.15/transformers-0.6....

https://hackage.haskell.org/package/freer-0.2.4.1

https://github.com/lexi-lambda/eff

https://hackage.haskell.org/package/bluefin-algae

grg0 · 2025-05-24T05:05:27 1748063127

That sounds like a fucking nightmare to debug. Like goto, but you don't even need to name a label.

cryptonector · 2025-05-24T07:02:22 1748070142

> Like goto, but you don't even need to name a label.

That's what exceptions are.

But effects don't cause you to see huge stack traces in errors because the whole point is that you provide the effect and values expected and the code goes on running.

AdieuToLogic · 2025-05-27T02:45:26 1748313926

>> Like goto, but you don't even need to name a label.

> That's what exceptions are.

In many contexts I agree. However, the difference here is exceptions are one-way execution control transfers.

Remember what was originally said:

  No amount of monad shenanigans is going to allow you to 
  immediately jump to an effect handler 5 levels up the call 
  stack, update some local variables in that stack frame, and 
  then jump back to execution at the same point 5 levels down.

Jumping "5 levels up the call stack", modifying some state, then jumping back "5 levels down" is to me pretty much the definition of a nightmare to reason about.

vkazanov · 2025-05-24T06:31:28 1748068288

Well, you test the fact that the handler receives the right kind of data, and then how it processes it.

And it is useful to be able to provide these handlers in tests.

Effects are AMAZING

agumonkey · 2025-05-24T10:11:11 1748081471

clos condition system is said to be just that, people seem to like it

also this kind of non local stack/tree rebinding is one way to implement prolog i believe

davery22 · 2025-05-18T02:53:47 1747536827

I think the article was getting at this at the end - different use cases naturally call for either a point-in-time snapshot (optimally serviced by pull) or a live-updating view (optimally serviced by push). If I am gauging the health of a system, I'll probably want a live view. If I am comparing historical financial reports, snapshot. Note that these are both "read-only" use cases. If I am preparing updates to a dataset, I may well want to work off a snapshot (and when it comes time to commit the changes, compare-and-swap if possible, else pull the latest snapshot and reconcile conflicts). If I am adjusting my trades for market changes, live view again.

If I try to service a snapshot with a push system, I'll have to either buffer an unbounded number of events, discard events, or back-pressure up the source to prevent events from being created. And with push alone, my snapshot would still only be ephemeral; once I open the floodgates and start processing more events, the snapshot is gone.

If I try to service a live view with a pull system, I'll have to either pull infrequently and sacrifice freshness, or pull more frequently and waste time and bandwidth reprocessing unchanged data. And with pull alone, I would still only be chasing freshness; every halving of refresh interval doubles the resource cost, until the system can't keep up.

The complicating real-world factor that this article alludes to is that, historically, push systems lacked the expressiveness to model complex data transformations. (And to be fair, they're up against physical limitations: Many transformations simply require storing the full intermediate dataset in order to compute an incremental update.) So the solution was to either switch wholesale to pull at some point in the pipeline (and try to use caching, change detection, etc to reduce the resource cost and enable more frequent pulling), or, introduce a pulling segment in the pipeline ("windowing" joins, aggregations, etc) and switch back to push after.

It's pretty recent that push systems are attempting to match the expressiveness of pull systems (e.g. Materialize, Readyset), but people are still so used to assuming pull-based compromises, asking questions like "How fresh does this data feed really _need_ to be?". It's analogous to asking "How long does this snapshot really _need_ to last?" - a relevant question to be sure, but maybe doesn't need to be the basis for massive architectural lifts.

davery22 · on Jan 31, 2025

I made this "expert-friendly" doc, to orient all who find themselves probing the Java Streams source code in despair. It culminates in the "Stream planner" - a little tool I made to simulate how (parallel) stream operations affect memory usage and execution paths.

Go forth, use (parallel) streams with confidence, and don't run out of memory.

davery22 · on Sept 30, 2024

So: A channel with buffer size 1, so long as it is only written to once and read from once, feels a lot like a Promise.

teraflop · on Sept 30, 2024

A variable you can only read once is a lot less useful than a variable you can read as many times as you want. Same goes for promises.

hinkley · on Sept 30, 2024

I don’t think I’ve seen a promise implementation that can’t handle multiple reads after the resolve for many years.

It would not take a lot of glue code to alter the contract to make that happen here.

binary132 · on Oct 1, 2024

that is a trivial memoizing wrapper on top of what GP described

davery22 · on July 22, 2024

If I could ask independently of the sentiment of this thread - I am genuinely curious: Why is marking the difference good? (sorry this is only tangential to the article)

withoutboats3 · on July 22, 2024

The behavior and performance of asynchronous function depend on the behavior and performance of concurrent processes, making them more complex to reason about, test and be confident in their correctness and well-suitedness.

davery22 · on Feb 7, 2024

Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.