
When Parallel: Pull, Don't Push - signa11
https://nullprogram.com/blog/2020/04/30/
======
fluffything
> If we’re using concurrency, that means multiple threads of execution may be
> mutating the same cell, which requires synchronization — locks, atomics,
> etc. That much synchronization is the death knell of performance.

You can do this without synchronization by exploiting the fact that all
"lanes" of a SIMD vector execute simultaneously, e.g. by doing vector scatters
"to the left" and then "to the right" those writes will not race with each
other, and don't require locks or atomics.

Whether pull or push is faster heavily depends on the algorithm - there is no
silver bullet here. To write efficient parallel algorithms you need to know
how to do both.

The author only appears to know how to do pull (if that's wrong, they didn't
really do push any justice in the article), so from their POV push just
doesn't work.

The article does show something very important though: pull does give you
reasonable performance and is quite easy to reason about and get right. In
contrast, push is quite hard due to the write-write data-races the author
mentions, so I don't blame the author from arriving at their conclusion that
one should always do pull. It might not always be true, but it is definitely a
safe place to start, particularly if you aren't super familiar with the
algorithm yet and don't know how or if push would work well.

For example, many of the convolutions that neural networks use are actually
implemented using the "push" approach because that's more efficient, and they
are quite similar to what the blog post author is doing.

~~~
sitkack
What does push gain you over pull? Why would someone want to use it? Does it
help with inplace updates vs having to duplicate storage?

~~~
fluffything
The problem it solves is temporal cache locality, and whether pull or push is
better often only depends on what is it that you want to keep in cache for the
next operation.

For example, suppose you are doing a 3x3 convolution over a matrix, and you
want to keep the result of the convolution in cache because the next operation
consumes it.

So your cache is pretty much filled with the output matrix, and you only have
space in cache to pull in one row from global memory for reading.

Now consider what happens if you process your input matrix row-by-row, first
with the pull approach and then with the push approach.

With the pull approach you need to load 3 rows from memory into cache before
you can push. No matter how you do this, as you advance through the input, you
are going to be loading some rows twice. For example, suppose that from the
3x3 window, we load the rows in this order: top, bottom, mid (and then push
the convolution to the mid row of the output). In one iteration we load rows
0, 2 (evicts row 0) and 1 (evicts row 2). The next iteration loads 1 (hit) and
3 (evicts 1), and then 2 (reloads 2!). That's bad!

With the push approach, you load one row, and push partial contributions to
the 3 rows of the output matrix. Then you load the next row, and do the same.
The output matrix is always in cache, so those writes always hit, and you
never need to load a row from memory twice.

So in a sense, pull optimizes for bringing the rows of the output into cache
once, which is good if you want to continue working with the input afterwards,
while push optimizes for keeping the output in cache, which is good if you
want to continue working with the output afterwards.

One isn't really better than the other. Which one to choose depends on what is
it that you want to do. Maybe you have one input matrix, and multiple
convolution windows, and want to compute multiple outputs that you don't need
right away. In that case, you want to keep the input matrix in cache, and
compute the rows of the output one by one, evicting them from the cache after
each computation.

~~~
grogers
If you can't keep 3 rows of the input matrix in cache, why would you be able
to keep 3 rows of the output matrix in cache? If the matrices are large enough
that they don't fit in cache, shouldn't they be processed in squareish blocks
recursively to maximize cache reuse (not in rows)?

~~~
fluffything
> If you can't keep 3 rows of the input matrix in cache, why would you be able
> to keep 3 rows of the output matrix in cache? If the matrices are large
> enough that they don't fit in cache, shouldn't they be processed in
> squareish blocks recursively to maximize cache reuse (not in rows)?

Your cache size is fixed by the computer and you can't change it. If you are
doing a convolution, you want "(N+1)xN = cache size" to determine the size of
the "blocks" (NxN matrix, and an 1xN row) of the convolution you process.
Those blocks are also matrices, so by decomposing your matrix into smaller
matrices you don't avoid this problem.

So sure, you could also decompose your problem by doing "(N+3)xN = cache-
size", but that means that the size of the "square blocks" that you can now
properly fit in cache is smaller, and need to decompose your original problem
into more sub-problems to get it done. You could also do: "2Nx2N = cache
size", and just load a whole input block and an output block into cache.

In both cases, you are bringing memory into cache that you only touch once,
yet continue occupying cache space after you don't need them anymore. You
could have used that space to process larger matrices instead, and have to
decompose your problem into fewer blocks.

------
saagarjha
A surprisingly simple way to be able to remove synchronization in certain
cases. It’s mentioned halfway through, but it’s still worth noting: you can
only use this if your operations are commutative, or you don’t mind the lack
of ordering. So it’s good for things like cellular automata when you have
double buffering but less good when you’re touching the input directly.

~~~
pvillano
an important note

------
chrisbennet
Shaders are are a great place to use "pull".

Take an image that is 1024x1024 that you want to display as a 256x256 image.

The simple way to do this is to visit every pixel in the the source and write
it to the display buffer [x'=x/4, y'=y/4]. 3 out 4 of those writes will just
get written over in the display buffer.

A more efficient way is step though display buffer space and calculate where
that pixel came from (reverse transform) [x'=4x, y'=4y]

~~~
heavenlyblue
No, they’re equivalent. Nobody stops you from from stepping through pixels in
multiples of 4 and then writing that to the display buffer.

~~~
ajuc
You can write code both ways but it's much easier to ensure every destination
pixel is handled exactly once when you interate over the destination pixels,
than if you do it iterating over the source pixels. It becomes especially
visible when you add perspective distortion and other complications.

Which is why every graphic api out there does it pull-style.

~~~
heavenlyblue
Drawing triangles on a screen is a push-style operation, isn’t it?

~~~
ajuc
Yes, from the POV of lists of polygons. But from the point view of textures it
is pull (you go through destination pixels and fill them with source pixels
fetched through transformation).

~~~
heavenlyblue
In the same way you could apply the reverse transform to the polygon and get
the coordinates of display pixels in texture coordinates and then do a push
from there.

They are equivalent, as I am trying to say. The only difference is semantics
or whatever makes it easier to think about.

~~~
ajuc
You would have to skip variable number of pixels in source texture according
to the transformation, then calculate the reverse transformation to know where
to push.

Notice how you not only change x * 4 into x/4, you also change x++ into x+=4

    
    
        for (int x=0; x<width; x++)
            for (int y=0; y<height; y++)
               destination[x][y] = source[x*4][y*4];
    

vs

    
    
        for (int x=0; x<width; x+=4)
            for (int y=0; y<height; y+=4)
               destination[x/4][y/4] = source[x][y];
    

If it's as simple as scaling: x+=4 isn't much harder than x++. But for complex
transformations calculation of how many pixels to skip in any particular row
will be slow and complicated. Meanwhile when doing pull you don't need to do
it.

This is the reason every graphical API i know of does these tranformations
iterating over destination pixels.

------
slashvar2701
That's a somewhat misleading wording for a classic rule of parallel algorithm:
a thread is the owner of its computations.

So, each actor own a group of cells and is the only one able to modify them.
This highly simplifies most algorithms, removes a lot of synchronization
issues ...

This is also why in parallel algorithm you should split the outputs rather
than the inputs.

~~~
cogman10
Often, I find that "if you are synchronizing, you are wrong". Doesn't mean you
can't or shouldn't use synchronizing in cases where message passing is
impractical. However, it does mean that it should be a last choice rather than
a first choice. It is simply far too easy to get locks wrong.

------
jillesvangurp
It's also an argument against using queues in some cases. Instead of
publishing data on a queue, you can make consumers of the data pull the data.
Especially if you have a 1 to many relation ship between producer and consumer
of the data that moves the problem of ensuring data is received out of the
infrastructure to the consumer side. Staying up to date is now their problem
and they can ensure they are up to date by simply pulling regularly.

A good example would be the twitter firehose and consuming that. Other
examples would be websites publishing RSS feeds. You can actually combine the
two and push the information that "something changed" which would cause the
receiver to pull. Caching and CDNs make publishing like this cheap and
scalable.

~~~
kqr
Well, what do the consumers pull _from_ , if not a queue? What is an RSS feed
if not a FIFO queue of published pages?

Queues inherently lead to a pull-based module. The producer does not publish
stuff directly to the consumer, but to the queue. The consumers then pull
stuff at their own pace/terms from this queue.

~~~
jillesvangurp
A rest API. An rss feed is a simple http GET that returns a document. You can
cache it, put it in a CDN. There no queues involved here.

~~~
elteto
Can you expand on this? So each worker gets its own REST endpoint to pull
from?

Because if all workers pull from a single endpoint then that endpoint is a
queue, no?

------
w0utert
This is pretty much the analogue of gather vs scatter in GPGPU programming.
It's a well known fact that in GPU programming, in almost any case a gather
approach (threads map to outputs) works better than a scatter approach
(threads map to inputs). The challenge is to transform the algorithm to still
have some data locality for the reads to allow for caching, coalesced reads
into local memory etc, which can be very hard or infeasible.

------
jedberg
Sysadmins have been using this principle for decades.

When running a fleet of machines, it was always better to have them pull their
config rather than push to them. It was far more reliable because you didn’t
have to worry about whether a machine missed a push.

It would just grab its config whenever it came back online.

~~~
jabl
Well yes, but like another comment here on HN points out, it's not universal,
and there are trade-offs, and different systems have chosen differently. E.g.

Monitoring: Mostly push, except prometheus which is pull.

Config management: Mostly pull, except ansible which is push (yes, there is
ansible-pull, but most ansible code out there assumes push).

~~~
eeZah7Ux
In both monitoring and config management, the majority is right.

Monitoring: you don't want to have all your applications and even scripts run
a thread to listen on a dedicated port to allow pulling. It's bad for
security, code complexity, configuration complexity (all those ports) and
reliability (when an application stops the datapoints are lost).

Config management: with a push model you can easily end up having 2 developers
push to the same system (or to systems that should be updated in sequence) in
an unmanaged manner.

~~~
longcommonname
Push based monitoring has its own flaws. Mainly service discovery is coupled
with service health.

Pull based monitoring makes it so you can remove the service and independently
recognize failures.

~~~
eeZah7Ux
You can do the same easily with push.

------
lmeyerov
EDIT: Totally forgot we are giving a webinar tutorial on this on Friday as
part of the Security (Analytics) Jupyterthon. See the Python GPU one-liners
talk @
[https://infosecjupyterthon.com/agenda.html](https://infosecjupyterthon.com/agenda.html)
!

The hardware opts desired here are vector data parallelism, both in bulk data
movement and execution. When you realize that those are the instructions the
device is running, yet are using languages or libraries that hide that, you
end up easily going against the grain.

In contrast, I've been doing a lot of gpu dataframe stuff for the last 6
years, which is basically a specialized and embeddable form of functional
nested data parallel libs from the early 90's . (See NESL and later Haskell
generalizations. In the top 5 for most beautiful papers I've read!)

Writing code this way basically comes down to writing closed form / point-free
solutions as pure maps & scans (~= materialized views) that in turn have
easily automatically optimized multicore + SIMD / GPU implementations
underneath. Put differently again, columnar analytics where you try to only do
pure functions over columns (tbl.col_x + tbl.col_y; tbl.col_z.max()) yet get
all your rich functional language features.

Another poster correctly said to focus on vector gather over vector scatter.
That is the assembly code primitive starting point for this journey. NESL's
nested functional data abstractions were invented precisely for automatically
compiling into this kind of extreme vectorization -- and before GPUs were even
a thing! By sticking to libraries built entirely for that, it's easy to write
data parallel code that can run on say 8 GPUs, with way less brain damage, and
lots of automatic invisible optimizations missed here.

The fun part to me is that parallelizing this kind of cellular automata
simulation maps to a class of parallel compute called 'grid codes' or 'stencil
codes'. There are DSLs with specializing compilers precisely for all sorts of
weird tile shapes - imagine writing a weather simulator. Think autotuning,
synthesis, etc. It's a dangerously niche space. I suspect GPU dataframe libs
can help put them to rest, would be cool to see what extensions it'd take!

~~~
fluffything
Which languages are you using?

In my experience, the main problem with expressing your algorithms in that way
is "What do you do when the primitives you are suing (e.g. maps and scans) do
not produce optimal code?".

For example, suppose that you are doing two scans with two maps, whose
efficient implementation does a single memory traversal with the two scans and
maps fused into a single kernel. If whatever abstraction you are using isn't
able to do this type of kernel fusion, then you easily bump to a 4x larger
memory bandwidth, which might be translated to 4x slower performance.

I don't know of any language or optimizing runtime that's able to do these
transformations and match the performance of the CUDA code one would have
written by hand.

~~~
lmeyerov
I agree in the small and disagree in the large.

Something like single-gpu cudf (think pandas on GPUs) will have that issue
because it is eager/concrete. When going to multi-GPU with dask-cudf or say
BlazingSQL, the deferred symbolic compute means they can add in fusion,
deforestation, etc. That is exactly what happened with Spark's CPU-era
compiler over the years.

In the large.. we have a giant code base for GPU visual graph analytics. We
hand write maybe 1% of the frontend and 1% of the backend. The libs underneath
likewise have handwritten code where it matters, but are also built on the
abstraction stack. So yes, we absolutely handwrite... in the tiny minority of
cases :)

In practice, I care less about the 4x thing here -- rather get more of our
code vectorized and more devs productive on it. In the small, what ends up
mattering (for us) is stuff like making sure they do not do excessive copying,
which is more of an artifact of trying to match Python Pandas's API design
bug-for-bug. Likewise, stuff like testing for performance regressions across
upgrades. We have been building a style guide and tools to ensure this, and
are able to use practices similar to any big non-GPU software project (lint,
CI, etc).

In contrast, a codebase on handwritten CUDA would be incredibly small and
quickly out-dated. We have enough stress in our lives as is :)

~~~
fluffything
That makes perfect sense, and it appears that you always have the option to
handrol something if you discover it as a bottle neck worth fixing.

------
qubex
This reminds me of the oddball comment I read somewhere to the effect that “
_come-from_ , despite all the humour, turns out to be a very sensible
parallel-safe alternative to _go-to_ ”. Can’t remember the exact wording nor
the source, though (unfortunately, because I’ve taken it to heart as very deep
architectural wisdom).

~~~
dehrmann
If it helps, I've heard "return-from" used to describe Aspect-oriented
programming.

------
defenestration
This seems counterintuitive as it is the opposite of the programming ‘rule’:
Tell, don’t ask.

~~~
ajuc
Yes. And it's one of the most important reasons OOP is so bad at
multithreading.

You want "pretty" OOP? Tell Don't Ask.

You want performance and reliable multithreading? Command-Query Separation.
Which is basically the same as functional programming :) Make all the
functions pure except for a few clearly marked ones with side effects.

It's especially visible in gamedev. After a decade of trying to make OOP work
with multithreading people moved on to Entity Component Systems (which is
basically imperative programming with some relational and functional aspects).

It's also why functional programming is getting more popular - because it is
pull by default. It's also why almost all graphical shaders are functional
even if they look like C :) Many inputs one output and no side effects :)

~~~
MaxBarraclough
> graphical shaders are functional even if they look like C :) Many inputs one
> output and no side effects :)

That's not functional programming, it's just pure functions. Shaders make use
of sequential execution of statements, and they don't support recursion.

~~~
alecmg
you are splitting hair on definitions.

It is the pure function nature of functional programming that makes it easy
for concurrent and parallel tasks. Recursion is neither here nor there

~~~
MaxBarraclough
It's not hair-splitting. As I understand it, functional shader languages are
an active area of research. Mainstream shader languages such as HLSL are most
certainly not functional.

~~~
ajuc
Languages are functional if they support higher order functions, recursion,
closures, etc.

Code is functional if it separates side-effects to small, clearly marked parts
of the codebase.

------
ball_of_lint
The author claims that their code is optimized, yet still took 10 days to run.
Unfortunately, they missed a key observation that would have cut their runtime
by 1/4 or more.

When stablizing a sandpile that begins in a symmetric configuration, it can
never enter a non-symmetric configuration. Therefore instead of computing the
whole 10000x10000 pile, you can compute just one 5000x5000 quarter of it, and
add a special case for the internal edges. They should not subtract 4 when
overflowing, but instead subtract 3 (or 2 for the corner) because they will
always trade grains with their mirror image.

Although it's a bit more complex, you can also improve upon that by tracking
not a quarter of the grid, but instead 1/8th by using a triangle. When
tracking a triangle the cells on the main diagonal of the square are not
mirrored, but instead their neighbors are. This may not work quite as well
with their parallelism setup though.

~~~
allover
Very interesting optimisations, but I would not say the author "missed" these
points, since this wasn't the point of their post.

------
loeg
This is kind of silly. The author has reduced the number of mutations per
entry by a factor of up to 4. That's why contention is reduced (less M(O)ESI
eviction of shared lines). It happens to be that this problem is amenable to
that kind of computation but it isn't always the case.

------
polskibus
Seems a bit similar to volcano model in databases, where iterators pull next
tuple.

------
lmilcin
It is much easier to build reliable distributed system when each component
consumes (pulls) input and then publishes results.

This is mainly due to the fact that consumer knows when it is ready to
consume.

------
code-faster
Designs based on pulling are also faster to code than their push equivalents,
especially when parallel or in a client server model.

