However, if somebody is interested, this is a great paper on how to architect a query compiler: https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...
I found it very enlightening and simple to implement. The paper also delves into details why the push-based architecture has more room for optimizations.
People shy away from this conclusion because doing everything both ways is quite hard and requires fancier abstractions to make nice than people are familiar with.
That's really is a simple and intuitive summary. Thank you for that insight!
http://reactivex.io/documentation/operators/backpressure.htm... describes ReactiveX's approach to the problem in great detail. To quote the section that essentially reaches the same conclusion as the grandparent post pointing out how "forks" (using the same pull-style data source twice) inherently require buffering/materialization or sampling:
> When a cold Observable is multicast (when it is converted into a connectable Observable and its Connect method is called), it effectively becomes hot and for the purposes of backpressure and flow-control it should be treated as a hot Observable. Cold Observables are ideal for the reactive pull model of backpressure implemented by some implementations of ReactiveX (which is described elsewhere). Hot Observables typically do not cope well with a reactive pull model, and are better candidates for other flow control strategies, such as the use of the operators described on this page, or operators like Buffer, Sample, Debounce, or Window.
There are great visuals if you expand the RxJS examples in the link above.
I feel like there's some wisdom here, but I don't think I grocked it
With joins, pushing only works if both sides push at the same time.
In general, joins work by a push on one side triggering a pull on the other side, and then both are combined.
For forks there should be some subscribe mechanism to turn pulls into pushes that can cheaply fan out to all subscribers.
(Side note, the mitigations ought to be completely symmetric. This is a sign there is more to the problem that I am missing.)
“Trigger” is not necessarily the relationship, between build and probe sides though it could be. The DAG paths that produce/push the probe can be triggered by the completion of the build. “Happens after” is more strictly apt.
Is the author younger than 25. Not picturing a world of databases before 1994? Did the world even exist?
I kid but regardless of when a paper describing some process was published the “oldest” query execution model is decidedly a lot older than 26 years.
> This is the oldest and most well-known query execution model, and is named for the paper which standardized its conventions in 1994.
Since the paper references previous databases, I'd rather assume incorrect expression from the blog author than ignorance.
With push-based you'd still usually use lossless transports, which could also have a "closing packet" which tells the consumer that "this stream is done", so I don't see how pull-based is better here.
Re queue full: If I understand you correctly, queues are only a problem when you start having multiple parallel stages one after the other, with queues in-between. Then you have queueing both with push-based and pull-based architectures.
Through this lens, a pull-based architecture is your conventional N-tier application where the stateless application issues queries and consumes results. The individual consumer (and producer) agents are independent and decoupled, allowed to handle failures with simple restart/retry while in control of their own transactions.
Conversely, a push-based architecture is often rendered into some enterprise message bus with queues and brokers. The applications can only subscribe to the queues and pub/sub system, but lack any real control fault recovery or transactional boundaries.
> Noria observes that, by having developers provide the set of queries their application will make in advance, the database can be smarter about how to execute those queries. In particular, it can choose to pre-compute, and incrementally maintain, the results for queries. This allows Noria to answer those queries quickly, and essentially obviates the need for application caches.
> Crossing a boundary from a pull system to a push system requires polling its state, and crossing a boundary from a push system to a pull system requires materialization of its state.
When you look at it from the perspective of a language designer or compiler, things start to look even more similar:
* Values that live across iterations are stored in a closure (push) or generator object (pull)
* In the first half of an iteration, before producing a result, operators dispatch to their producers via a return address on the call stack (push) or callbacks (pull)
* In the second half of an iteration, while building up a result, operators dispatch to their consumers via callbacks (push) or a return address on the call stack (pull)
The difference the article mentions between "unrolling" the two approaches comes from the level of abstraction where inlining is performed. The push model produces natural-looking output when inlining the closures. The less-natural result of inlining the pull model's `next` functions is where the control flow in loops shows up. (See for example the performance benefits of adding a push-model complement to Rust's usual pull-model Iterator trait: https://github.com/rust-lang/rust/pull/42782#issuecomment-30...)
But, you can get the same natural-looking output from the pull model if you perform inlining at the level of generators! This is a less familiar transformation than function inlining, but it's still fundamentally the same idea, of erasing API boundaries and ABI overhead- which in this case is a more straightforward formulation of the various loop optimizations that can sometimes get rid of the control-flow-in-loops produced by inlining `next` functions.
The difference around DAGs is similar. In the push model, a DAG just means a producer calling into more than one consumer callback. In the pull model, this is tricky because you can't just yield (i.e. return from `next`) to more than one place- return addresses form a stack! (And thus the dual problem for the push model shows up for operations with multiple inputs, like the merge join mentioned in the article.)
Overall I think a more flexible approach might be to generalize from `yield` and iterator APIs, to algebraic effects (or, lower level, delimited continuations)- these let you build both push and pull-like APIs for a single generator-like function, as well as hybrid approaches that combine aspects of both, by removing the requirement for a single privileged call stack.
It's better to think of this stuff with constantly changing underlying data, which adds a lot of "color" to the domain.
LINQ's syntax is just syntactic sugar on top of chained method calls to various interface extension methods.
Thanks to the monadic abstractions, they are actually agnostic as to whether the implementation is push or pull or a hybrid. The query is only evaluated on demand, at which point the entire graph is available and can be planned however.