Hacker News new | past | comments | ask | show | jobs | submit login
When Parallel: Pull, Don't Push (nullprogram.com)
245 points by signa11 26 days ago | hide | past | web | favorite | 61 comments

> If we’re using concurrency, that means multiple threads of execution may be mutating the same cell, which requires synchronization — locks, atomics, etc. That much synchronization is the death knell of performance.

You can do this without synchronization by exploiting the fact that all "lanes" of a SIMD vector execute simultaneously, e.g. by doing vector scatters "to the left" and then "to the right" those writes will not race with each other, and don't require locks or atomics.

Whether pull or push is faster heavily depends on the algorithm - there is no silver bullet here. To write efficient parallel algorithms you need to know how to do both.

The author only appears to know how to do pull (if that's wrong, they didn't really do push any justice in the article), so from their POV push just doesn't work.

The article does show something very important though: pull does give you reasonable performance and is quite easy to reason about and get right. In contrast, push is quite hard due to the write-write data-races the author mentions, so I don't blame the author from arriving at their conclusion that one should always do pull. It might not always be true, but it is definitely a safe place to start, particularly if you aren't super familiar with the algorithm yet and don't know how or if push would work well.

For example, many of the convolutions that neural networks use are actually implemented using the "push" approach because that's more efficient, and they are quite similar to what the blog post author is doing.

Having done quite a bit of HPC in my research career, the usually best approach is (IMO) to:

* decompose the problem data to the subsets handled by each machine

* within each node, use pushing. this allows walking the node-internal data structure sequentially, thus is very cache friendly

* for the boundary cells, use pulling

edit: I should add, it may also make sense to do pulling between the cores, otherwise you need locks which could hurt performance as well.

> You can do this without synchronization by exploiting the fact that all "lanes" of a SIMD vector execute simultaneously, e.g. by doing vector scatters "to the left" and then "to the right" those writes will not race with each other, and don't require locks or atomics.

Well, SIMD is implicit synchronization. Within a SIMD wavefront, all instructions have an implicit barrier between them (because all instructions begin, and finish, simultaneously).

This implicit synchronization may be all you need.

> Whether pull or push is faster heavily depends on the algorithm - there is no silver bullet here. To write efficient parallel algorithms you need to know how to do both.

I agree. The terminology in HPC seems to be "gather" and "scatter" operations.

It should be noted that "gather" can be converted into "scatter" with some tricks. But the programmer should be prepared in using both methods, depending on the circumstance.

Convolution is a great illustration of the difference between the push and pull approaches. You can implement (one-dimensional, but generalizes easily to any dimension) convolution either by assigning each point a weighted average centered at that point, computed over all points (pull), or by sliding the weight function along the interval and accumulating (summing) the weighted value of the current center of the weight function at each point (push). The push approach is optimized for reads (linear reads, quadratic writes), and the pull approach is optimized for writes (quadratic reads, linear writes).

What does push gain you over pull? Why would someone want to use it? Does it help with inplace updates vs having to duplicate storage?

The problem it solves is temporal cache locality, and whether pull or push is better often only depends on what is it that you want to keep in cache for the next operation.

For example, suppose you are doing a 3x3 convolution over a matrix, and you want to keep the result of the convolution in cache because the next operation consumes it.

So your cache is pretty much filled with the output matrix, and you only have space in cache to pull in one row from global memory for reading.

Now consider what happens if you process your input matrix row-by-row, first with the pull approach and then with the push approach.

With the pull approach you need to load 3 rows from memory into cache before you can push. No matter how you do this, as you advance through the input, you are going to be loading some rows twice. For example, suppose that from the 3x3 window, we load the rows in this order: top, bottom, mid (and then push the convolution to the mid row of the output). In one iteration we load rows 0, 2 (evicts row 0) and 1 (evicts row 2). The next iteration loads 1 (hit) and 3 (evicts 1), and then 2 (reloads 2!). That's bad!

With the push approach, you load one row, and push partial contributions to the 3 rows of the output matrix. Then you load the next row, and do the same. The output matrix is always in cache, so those writes always hit, and you never need to load a row from memory twice.

So in a sense, pull optimizes for bringing the rows of the output into cache once, which is good if you want to continue working with the input afterwards, while push optimizes for keeping the output in cache, which is good if you want to continue working with the output afterwards.

One isn't really better than the other. Which one to choose depends on what is it that you want to do. Maybe you have one input matrix, and multiple convolution windows, and want to compute multiple outputs that you don't need right away. In that case, you want to keep the input matrix in cache, and compute the rows of the output one by one, evicting them from the cache after each computation.

If you can't keep 3 rows of the input matrix in cache, why would you be able to keep 3 rows of the output matrix in cache? If the matrices are large enough that they don't fit in cache, shouldn't they be processed in squareish blocks recursively to maximize cache reuse (not in rows)?

> If you can't keep 3 rows of the input matrix in cache, why would you be able to keep 3 rows of the output matrix in cache? If the matrices are large enough that they don't fit in cache, shouldn't they be processed in squareish blocks recursively to maximize cache reuse (not in rows)?

Your cache size is fixed by the computer and you can't change it. If you are doing a convolution, you want "(N+1)xN = cache size" to determine the size of the "blocks" (NxN matrix, and an 1xN row) of the convolution you process. Those blocks are also matrices, so by decomposing your matrix into smaller matrices you don't avoid this problem.

So sure, you could also decompose your problem by doing "(N+3)xN = cache-size", but that means that the size of the "square blocks" that you can now properly fit in cache is smaller, and need to decompose your original problem into more sub-problems to get it done. You could also do: "2Nx2N = cache size", and just load a whole input block and an output block into cache.

In both cases, you are bringing memory into cache that you only touch once, yet continue occupying cache space after you don't need them anymore. You could have used that space to process larger matrices instead, and have to decompose your problem into fewer blocks.

A surprisingly simple way to be able to remove synchronization in certain cases. It’s mentioned halfway through, but it’s still worth noting: you can only use this if your operations are commutative, or you don’t mind the lack of ordering. So it’s good for things like cellular automata when you have double buffering but less good when you’re touching the input directly.

an important note

Shaders are are a great place to use "pull".

Take an image that is 1024x1024 that you want to display as a 256x256 image.

The simple way to do this is to visit every pixel in the the source and write it to the display buffer [x'=x/4, y'=y/4]. 3 out 4 of those writes will just get written over in the display buffer.

A more efficient way is step though display buffer space and calculate where that pixel came from (reverse transform) [x'=4x, y'=4y]

Both push and pull are used in graphics. Creating the classic z-buffer is used when pushing a lot of polygons is the fastest option even if that seems wasteful. Face culling on the other hand is classic pull technique where you first figure out if you want something and only then compute the details.

PS: 256x256 is 1/16th 1024x1024 so your wasting 15/16 pixels.

If you want to be general polygon rasterization is push and raytracing is pull :)

Oh you're right!

No, they’re equivalent. Nobody stops you from from stepping through pixels in multiples of 4 and then writing that to the display buffer.

You can write code both ways but it's much easier to ensure every destination pixel is handled exactly once when you interate over the destination pixels, than if you do it iterating over the source pixels. It becomes especially visible when you add perspective distortion and other complications.

Which is why every graphic api out there does it pull-style.

Drawing triangles on a screen is a push-style operation, isn’t it?

Yes, from the POV of lists of polygons. But from the point view of textures it is pull (you go through destination pixels and fill them with source pixels fetched through transformation).

In the same way you could apply the reverse transform to the polygon and get the coordinates of display pixels in texture coordinates and then do a push from there.

They are equivalent, as I am trying to say. The only difference is semantics or whatever makes it easier to think about.

You would have to skip variable number of pixels in source texture according to the transformation, then calculate the reverse transformation to know where to push.

Notice how you not only change x * 4 into x/4, you also change x++ into x+=4

    for (int x=0; x<width; x++)
        for (int y=0; y<height; y++)
           destination[x][y] = source[x*4][y*4];

    for (int x=0; x<width; x+=4)
        for (int y=0; y<height; y+=4)
           destination[x/4][y/4] = source[x][y];
If it's as simple as scaling: x+=4 isn't much harder than x++. But for complex transformations calculation of how many pixels to skip in any particular row will be slow and complicated. Meanwhile when doing pull you don't need to do it.

This is the reason every graphical API i know of does these tranformations iterating over destination pixels.

Yes, you are of course correct for this very simple example.

The last time I did something like this, I needed to take a rectangular buffer and transform it to a circle. Each buffer line became a "spoke" in the circle. Much easier/simpler [I always strive for simple] to figure out the reverse transform and step through every "display" pixel.

Also, what if you want to use a better scaling method.

That's a somewhat misleading wording for a classic rule of parallel algorithm: a thread is the owner of its computations.

So, each actor own a group of cells and is the only one able to modify them. This highly simplifies most algorithms, removes a lot of synchronization issues ...

This is also why in parallel algorithm you should split the outputs rather than the inputs.

Often, I find that "if you are synchronizing, you are wrong". Doesn't mean you can't or shouldn't use synchronizing in cases where message passing is impractical. However, it does mean that it should be a last choice rather than a first choice. It is simply far too easy to get locks wrong.

It's also an argument against using queues in some cases. Instead of publishing data on a queue, you can make consumers of the data pull the data. Especially if you have a 1 to many relation ship between producer and consumer of the data that moves the problem of ensuring data is received out of the infrastructure to the consumer side. Staying up to date is now their problem and they can ensure they are up to date by simply pulling regularly.

A good example would be the twitter firehose and consuming that. Other examples would be websites publishing RSS feeds. You can actually combine the two and push the information that "something changed" which would cause the receiver to pull. Caching and CDNs make publishing like this cheap and scalable.

Well, what do the consumers pull from, if not a queue? What is an RSS feed if not a FIFO queue of published pages?

Queues inherently lead to a pull-based module. The producer does not publish stuff directly to the consumer, but to the queue. The consumers then pull stuff at their own pace/terms from this queue.

A rest API. An rss feed is a simple http GET that returns a document. You can cache it, put it in a CDN. There no queues involved here.

Can you expand on this? So each worker gets its own REST endpoint to pull from?

Because if all workers pull from a single endpoint then that endpoint is a queue, no?

This is pretty much the analogue of gather vs scatter in GPGPU programming. It's a well known fact that in GPU programming, in almost any case a gather approach (threads map to outputs) works better than a scatter approach (threads map to inputs). The challenge is to transform the algorithm to still have some data locality for the reads to allow for caching, coalesced reads into local memory etc, which can be very hard or infeasible.

Sysadmins have been using this principle for decades.

When running a fleet of machines, it was always better to have them pull their config rather than push to them. It was far more reliable because you didn’t have to worry about whether a machine missed a push.

It would just grab its config whenever it came back online.

That's a completely different problem than what the author is solving, which is about synchronization overhead and performance, not reliability.

The two go hand in hand. The less synchronization that is necessary the more reliable a system is.

Well yes, but like another comment here on HN points out, it's not universal, and there are trade-offs, and different systems have chosen differently. E.g.

Monitoring: Mostly push, except prometheus which is pull.

Config management: Mostly pull, except ansible which is push (yes, there is ansible-pull, but most ansible code out there assumes push).

In both monitoring and config management, the majority is right.

Monitoring: you don't want to have all your applications and even scripts run a thread to listen on a dedicated port to allow pulling. It's bad for security, code complexity, configuration complexity (all those ports) and reliability (when an application stops the datapoints are lost).

Config management: with a push model you can easily end up having 2 developers push to the same system (or to systems that should be updated in sequence) in an unmanaged manner.

Push based monitoring has its own flaws. Mainly service discovery is coupled with service health.

Pull based monitoring makes it so you can remove the service and independently recognize failures.

You can do the same easily with push.

And now InfluxDB 2.0 which uses push for Telegraf, and pull to scrape Prometheus-style metrics.

EDIT: Totally forgot we are giving a webinar tutorial on this on Friday as part of the Security (Analytics) Jupyterthon. See the Python GPU one-liners talk @ https://infosecjupyterthon.com/agenda.html !

The hardware opts desired here are vector data parallelism, both in bulk data movement and execution. When you realize that those are the instructions the device is running, yet are using languages or libraries that hide that, you end up easily going against the grain.

In contrast, I've been doing a lot of gpu dataframe stuff for the last 6 years, which is basically a specialized and embeddable form of functional nested data parallel libs from the early 90's . (See NESL and later Haskell generalizations. In the top 5 for most beautiful papers I've read!)

Writing code this way basically comes down to writing closed form / point-free solutions as pure maps & scans (~= materialized views) that in turn have easily automatically optimized multicore + SIMD / GPU implementations underneath. Put differently again, columnar analytics where you try to only do pure functions over columns (tbl.col_x + tbl.col_y; tbl.col_z.max()) yet get all your rich functional language features.

Another poster correctly said to focus on vector gather over vector scatter. That is the assembly code primitive starting point for this journey. NESL's nested functional data abstractions were invented precisely for automatically compiling into this kind of extreme vectorization -- and before GPUs were even a thing! By sticking to libraries built entirely for that, it's easy to write data parallel code that can run on say 8 GPUs, with way less brain damage, and lots of automatic invisible optimizations missed here.

The fun part to me is that parallelizing this kind of cellular automata simulation maps to a class of parallel compute called 'grid codes' or 'stencil codes'. There are DSLs with specializing compilers precisely for all sorts of weird tile shapes - imagine writing a weather simulator. Think autotuning, synthesis, etc. It's a dangerously niche space. I suspect GPU dataframe libs can help put them to rest, would be cool to see what extensions it'd take!

Which languages are you using?

In my experience, the main problem with expressing your algorithms in that way is "What do you do when the primitives you are suing (e.g. maps and scans) do not produce optimal code?".

For example, suppose that you are doing two scans with two maps, whose efficient implementation does a single memory traversal with the two scans and maps fused into a single kernel. If whatever abstraction you are using isn't able to do this type of kernel fusion, then you easily bump to a 4x larger memory bandwidth, which might be translated to 4x slower performance.

I don't know of any language or optimizing runtime that's able to do these transformations and match the performance of the CUDA code one would have written by hand.

I agree in the small and disagree in the large.

Something like single-gpu cudf (think pandas on GPUs) will have that issue because it is eager/concrete. When going to multi-GPU with dask-cudf or say BlazingSQL, the deferred symbolic compute means they can add in fusion, deforestation, etc. That is exactly what happened with Spark's CPU-era compiler over the years.

In the large.. we have a giant code base for GPU visual graph analytics. We hand write maybe 1% of the frontend and 1% of the backend. The libs underneath likewise have handwritten code where it matters, but are also built on the abstraction stack. So yes, we absolutely handwrite... in the tiny minority of cases :)

In practice, I care less about the 4x thing here -- rather get more of our code vectorized and more devs productive on it. In the small, what ends up mattering (for us) is stuff like making sure they do not do excessive copying, which is more of an artifact of trying to match Python Pandas's API design bug-for-bug. Likewise, stuff like testing for performance regressions across upgrades. We have been building a style guide and tools to ensure this, and are able to use practices similar to any big non-GPU software project (lint, CI, etc).

In contrast, a codebase on handwritten CUDA would be incredibly small and quickly out-dated. We have enough stress in our lives as is :)

That makes perfect sense, and it appears that you always have the option to handrol something if you discover it as a bottle neck worth fixing.

This reminds me of the oddball comment I read somewhere to the effect that “come-from, despite all the humour, turns out to be a very sensible parallel-safe alternative to go-to”. Can’t remember the exact wording nor the source, though (unfortunately, because I’ve taken it to heart as very deep architectural wisdom).

If it helps, I've heard "return-from" used to describe Aspect-oriented programming.

This seems counterintuitive as it is the opposite of the programming ‘rule’: Tell, don’t ask.

Yes. And it's one of the most important reasons OOP is so bad at multithreading.

You want "pretty" OOP? Tell Don't Ask.

You want performance and reliable multithreading? Command-Query Separation. Which is basically the same as functional programming :) Make all the functions pure except for a few clearly marked ones with side effects.

It's especially visible in gamedev. After a decade of trying to make OOP work with multithreading people moved on to Entity Component Systems (which is basically imperative programming with some relational and functional aspects).

It's also why functional programming is getting more popular - because it is pull by default. It's also why almost all graphical shaders are functional even if they look like C :) Many inputs one output and no side effects :)

> graphical shaders are functional even if they look like C :) Many inputs one output and no side effects :)

That's not functional programming, it's just pure functions. Shaders make use of sequential execution of statements, and they don't support recursion.

you are splitting hair on definitions.

It is the pure function nature of functional programming that makes it easy for concurrent and parallel tasks. Recursion is neither here nor there

It's not hair-splitting. As I understand it, functional shader languages are an active area of research. Mainstream shader languages such as HLSL are most certainly not functional.

Languages are functional if they support higher order functions, recursion, closures, etc.

Code is functional if it separates side-effects to small, clearly marked parts of the codebase.

Man I love entity component systems. They're elegant, yet super hardware friendly. Extremely simple and easy to understand.

I have a background in Databases and it's always interesting how ECS face some of the same problems the DB field faced, they just haven't realised it yet. They require super fast joins to figure out which components to feed to the systems for example.

I'd predict massive cool strides in the DB and ECS/Gamedev/Simulation world, once the two realise how much they can learn from one another.

My dream is marry relation + array + functional + bit of imperative for a programming language (building slowly at https://github.com/Tablam/TablaM). But the ECS is one paradigm that must fit here but don't see how actually apply in more "normal" scenarios like erp/ecommerce programming. Do you have any hint?

My hunch would be to have a RDF-like (not RDF cause that sucks, but flexible self describing triples) graph database, that is append only (similar to a persistent datastructure). With systems subscribing to query streams and appending new facts. So like an event sourcing / rdf / ecs hybrid.

Uuid are Entities, for easy distribution. Components are just entity attribute subsets in the DB. Systems are external components subscribing to queries.

This wouldn't have all the nice memory layout properties of ecs, but the thing is that "normal" doesn't need fast. "normal" needs correct, and provenance, and accountability.

If you wanted subparts in such an architecture that are high speed you'd make them regular ecs components, and have them log into the append only store at intervals/checkpoints/whatever.

These can be complementary. Eg in actor systems you “tell” an actor what to do by sending a message, which sits in the actor’s inbox until it pulls work. This allows the actor to pull work at a rate it can manage.

The author claims that their code is optimized, yet still took 10 days to run. Unfortunately, they missed a key observation that would have cut their runtime by 1/4 or more.

When stablizing a sandpile that begins in a symmetric configuration, it can never enter a non-symmetric configuration. Therefore instead of computing the whole 10000x10000 pile, you can compute just one 5000x5000 quarter of it, and add a special case for the internal edges. They should not subtract 4 when overflowing, but instead subtract 3 (or 2 for the corner) because they will always trade grains with their mirror image.

Although it's a bit more complex, you can also improve upon that by tracking not a quarter of the grid, but instead 1/8th by using a triangle. When tracking a triangle the cells on the main diagonal of the square are not mirrored, but instead their neighbors are. This may not work quite as well with their parallelism setup though.

Very interesting optimisations, but I would not say the author "missed" these points, since this wasn't the point of their post.

Well yes, using symmetry to reduce the amount of computation is a common trick.

For instance, programmers of reciprocal space density-functional codes are certainly familiar with symmetry groups and (ir)reducible k-point grids etc.

Could you have a memoization function that calculates if a point is in a set of points and return the same result?

This is kind of silly. The author has reduced the number of mutations per entry by a factor of up to 4. That's why contention is reduced (less M(O)ESI eviction of shared lines). It happens to be that this problem is amenable to that kind of computation but it isn't always the case.

Seems a bit similar to volcano model in databases, where iterators pull next tuple.

It is much easier to build reliable distributed system when each component consumes (pulls) input and then publishes results.

This is mainly due to the fact that consumer knows when it is ready to consume.

Designs based on pulling are also faster to code than their push equivalents, especially when parallel or in a client server model.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact