You can do this without synchronization by exploiting the fact that all "lanes" of a SIMD vector execute simultaneously, e.g. by doing vector scatters "to the left" and then "to the right" those writes will not race with each other, and don't require locks or atomics.
Whether pull or push is faster heavily depends on the algorithm - there is no silver bullet here. To write efficient parallel algorithms you need to know how to do both.
The author only appears to know how to do pull (if that's wrong, they didn't really do push any justice in the article), so from their POV push just doesn't work.
The article does show something very important though: pull does give you reasonable performance and is quite easy to reason about and get right. In contrast, push is quite hard due to the write-write data-races the author mentions, so I don't blame the author from arriving at their conclusion that one should always do pull. It might not always be true, but it is definitely a safe place to start, particularly if you aren't super familiar with the algorithm yet and don't know how or if push would work well.
For example, many of the convolutions that neural networks use are actually implemented using the "push" approach because that's more efficient, and they are quite similar to what the blog post author is doing.
* decompose the problem data to the subsets handled by each machine
* within each node, use pushing. this allows walking the node-internal data structure sequentially, thus is very cache friendly
* for the boundary cells, use pulling
edit: I should add, it may also make sense to do pulling between the cores, otherwise you need locks which could hurt performance as well.
Well, SIMD is implicit synchronization. Within a SIMD wavefront, all instructions have an implicit barrier between them (because all instructions begin, and finish, simultaneously).
This implicit synchronization may be all you need.
> Whether pull or push is faster heavily depends on the algorithm - there is no silver bullet here. To write efficient parallel algorithms you need to know how to do both.
I agree. The terminology in HPC seems to be "gather" and "scatter" operations.
It should be noted that "gather" can be converted into "scatter" with some tricks. But the programmer should be prepared in using both methods, depending on the circumstance.
For example, suppose you are doing a 3x3 convolution over a matrix, and you want to keep the result of the convolution in cache because the next operation consumes it.
So your cache is pretty much filled with the output matrix, and you only have space in cache to pull in one row from global memory for reading.
Now consider what happens if you process your input matrix row-by-row, first with the pull approach and then with the push approach.
With the pull approach you need to load 3 rows from memory into cache before you can push. No matter how you do this, as you advance through the input, you are going to be loading some rows twice. For example, suppose that from the 3x3 window, we load the rows in this order: top, bottom, mid (and then push the convolution to the mid row of the output). In one iteration we load rows 0, 2 (evicts row 0) and 1 (evicts row 2). The next iteration loads 1 (hit) and 3 (evicts 1), and then 2 (reloads 2!). That's bad!
With the push approach, you load one row, and push partial contributions to the 3 rows of the output matrix. Then you load the next row, and do the same. The output matrix is always in cache, so those writes always hit, and you never need to load a row from memory twice.
So in a sense, pull optimizes for bringing the rows of the output into cache once, which is good if you want to continue working with the input afterwards, while push optimizes for keeping the output in cache, which is good if you want to continue working with the output afterwards.
One isn't really better than the other. Which one to choose depends on what is it that you want to do. Maybe you have one input matrix, and multiple convolution windows, and want to compute multiple outputs that you don't need right away. In that case, you want to keep the input matrix in cache, and compute the rows of the output one by one, evicting them from the cache after each computation.
Your cache size is fixed by the computer and you can't change it. If you are doing a convolution, you want "(N+1)xN = cache size" to determine the size of the "blocks" (NxN matrix, and an 1xN row) of the convolution you process. Those blocks are also matrices, so by decomposing your matrix into smaller matrices you don't avoid this problem.
So sure, you could also decompose your problem by doing "(N+3)xN = cache-size", but that means that the size of the "square blocks" that you can now properly fit in cache is smaller, and need to decompose your original problem into more sub-problems to get it done. You could also do: "2Nx2N = cache size", and just load a whole input block and an output block into cache.
In both cases, you are bringing memory into cache that you only touch once, yet continue occupying cache space after you don't need them anymore. You could have used that space to process larger matrices instead, and have to decompose your problem into fewer blocks.
Take an image that is 1024x1024 that you want to display as a 256x256 image.
The simple way to do this is to visit every pixel in the the source and write it to the display buffer [x'=x/4, y'=y/4].
3 out 4 of those writes will just get written over in the display buffer.
A more efficient way is step though display buffer space and calculate where that pixel came from (reverse transform) [x'=4x, y'=4y]
PS: 256x256 is 1/16th 1024x1024 so your wasting 15/16 pixels.
Which is why every graphic api out there does it pull-style.
They are equivalent, as I am trying to say. The only difference is semantics or whatever makes it easier to think about.
Notice how you not only change x * 4 into x/4, you also change x++ into x+=4
for (int x=0; x<width; x++)
for (int y=0; y<height; y++)
destination[x][y] = source[x*4][y*4];
for (int x=0; x<width; x+=4)
for (int y=0; y<height; y+=4)
destination[x/4][y/4] = source[x][y];
This is the reason every graphical API i know of does these tranformations iterating over destination pixels.
The last time I did something like this, I needed to take a rectangular buffer and transform it to a circle. Each buffer line became a "spoke" in the circle. Much easier/simpler [I always strive for simple] to figure out the reverse transform and step through every "display" pixel.
So, each actor own a group of cells and is the only one able to modify them. This highly simplifies most algorithms, removes a lot of synchronization issues ...
This is also why in parallel algorithm you should split the outputs rather than the inputs.
A good example would be the twitter firehose and consuming that. Other examples would be websites publishing RSS feeds. You can actually combine the two and push the information that "something changed" which would cause the receiver to pull. Caching and CDNs make publishing like this cheap and scalable.
Queues inherently lead to a pull-based module. The producer does not publish stuff directly to the consumer, but to the queue. The consumers then pull stuff at their own pace/terms from this queue.
Because if all workers pull from a single endpoint then that endpoint is a queue, no?
When running a fleet of machines, it was always better to have them pull their config rather than push to them. It was far more reliable because you didn’t have to worry about whether a machine missed a push.
It would just grab its config whenever it came back online.
Monitoring: Mostly push, except prometheus which is pull.
Config management: Mostly pull, except ansible which is push (yes, there is ansible-pull, but most ansible code out there assumes push).
Monitoring: you don't want to have all your applications and even scripts run a thread to listen on a dedicated port to allow pulling. It's bad for security, code complexity, configuration complexity (all those ports) and reliability (when an application stops the datapoints are lost).
Config management: with a push model you can easily end up having 2 developers push to the same system (or to systems that should be updated in sequence) in an unmanaged manner.
Pull based monitoring makes it so you can remove the service and independently recognize failures.
The hardware opts desired here are vector data parallelism, both in bulk data movement and execution. When you realize that those are the instructions the device is running, yet are using languages or libraries that hide that, you end up easily going against the grain.
In contrast, I've been doing a lot of gpu dataframe stuff for the last 6 years, which is basically a specialized and embeddable form of functional nested data parallel libs from the early 90's . (See NESL and later Haskell generalizations. In the top 5 for most beautiful papers I've read!)
Writing code this way basically comes down to writing closed form / point-free solutions as pure maps & scans (~= materialized views) that in turn have easily automatically optimized multicore + SIMD / GPU implementations underneath. Put differently again, columnar analytics where you try to only do pure functions over columns (tbl.col_x + tbl.col_y; tbl.col_z.max()) yet get all your rich functional language features.
Another poster correctly said to focus on vector gather over vector scatter. That is the assembly code primitive starting point for this journey. NESL's nested functional data abstractions were invented precisely for automatically compiling into this kind of extreme vectorization -- and before GPUs were even a thing! By sticking to libraries built entirely for that, it's easy to write data parallel code that can run on say 8 GPUs, with way less brain damage, and lots of automatic invisible optimizations missed here.
The fun part to me is that parallelizing this kind of cellular automata simulation maps to a class of parallel compute called 'grid codes' or 'stencil codes'. There are DSLs with specializing compilers precisely for all sorts of weird tile shapes - imagine writing a weather simulator. Think autotuning, synthesis, etc. It's a dangerously niche space. I suspect GPU dataframe libs can help put them to rest, would be cool to see what extensions it'd take!
In my experience, the main problem with expressing your algorithms in that way is "What do you do when the primitives you are suing (e.g. maps and scans) do not produce optimal code?".
For example, suppose that you are doing two scans with two maps, whose efficient implementation does a single memory traversal with the two scans and maps fused into a single kernel. If whatever abstraction you are using isn't able to do this type of kernel fusion, then you easily bump to a 4x larger memory bandwidth, which might be translated to 4x slower performance.
I don't know of any language or optimizing runtime that's able to do these transformations and match the performance of the CUDA code one would have written by hand.
Something like single-gpu cudf (think pandas on GPUs) will have that issue because it is eager/concrete. When going to multi-GPU with dask-cudf or say BlazingSQL, the deferred symbolic compute means they can add in fusion, deforestation, etc. That is exactly what happened with Spark's CPU-era compiler over the years.
In the large.. we have a giant code base for GPU visual graph analytics. We hand write maybe 1% of the frontend and 1% of the backend. The libs underneath likewise have handwritten code where it matters, but are also built on the abstraction stack. So yes, we absolutely handwrite... in the tiny minority of cases :)
In practice, I care less about the 4x thing here -- rather get more of our code vectorized and more devs productive on it. In the small, what ends up mattering (for us) is stuff like making sure they do not do excessive copying, which is more of an artifact of trying to match Python Pandas's API design bug-for-bug. Likewise, stuff like testing for performance regressions across upgrades. We have been building a style guide and tools to ensure this, and are able to use practices similar to any big non-GPU software project (lint, CI, etc).
In contrast, a codebase on handwritten CUDA would be incredibly small and quickly out-dated. We have enough stress in our lives as is :)
You want "pretty" OOP? Tell Don't Ask.
You want performance and reliable multithreading? Command-Query Separation. Which is basically the same as functional programming :) Make all the functions pure except for a few clearly marked ones with side effects.
It's especially visible in gamedev. After a decade of trying to make OOP work with multithreading people moved on to Entity Component Systems (which is basically imperative programming with some relational and functional aspects).
It's also why functional programming is getting more popular - because it is pull by default. It's also why almost all graphical shaders are functional even if they look like C :) Many inputs one output and no side effects :)
That's not functional programming, it's just pure functions. Shaders make use of sequential execution of statements, and they don't support recursion.
It is the pure function nature of functional programming that makes it easy for concurrent and parallel tasks. Recursion is neither here nor there
Code is functional if it separates side-effects to small, clearly marked parts of the codebase.
I have a background in Databases and it's always interesting how ECS face some of the same problems the DB field faced, they just haven't realised it yet. They require super fast joins to figure out which components to feed to the systems for example.
I'd predict massive cool strides in the DB and ECS/Gamedev/Simulation world, once the two realise how much they can learn from one another.
Uuid are Entities, for easy distribution.
Components are just entity attribute subsets in the DB.
Systems are external components subscribing to queries.
This wouldn't have all the nice memory layout properties of ecs, but the thing is that "normal" doesn't need fast. "normal" needs correct, and provenance, and accountability.
If you wanted subparts in such an architecture that are high speed you'd make them regular ecs components, and have them log into the append only store at intervals/checkpoints/whatever.
When stablizing a sandpile that begins in a symmetric configuration, it can never enter a non-symmetric configuration. Therefore instead of computing the whole 10000x10000 pile, you can compute just one 5000x5000 quarter of it, and add a special case for the internal edges. They should not subtract 4 when overflowing, but instead subtract 3 (or 2 for the corner) because they will always trade grains with their mirror image.
Although it's a bit more complex, you can also improve upon that by tracking not a quarter of the grid, but instead 1/8th by using a triangle. When tracking a triangle the cells on the main diagonal of the square are not mirrored, but instead their neighbors are. This may not work quite as well with their parallelism setup though.
For instance, programmers of reciprocal space density-functional codes are certainly familiar with symmetry groups and (ir)reducible k-point grids etc.
This is mainly due to the fact that consumer knows when it is ready to consume.