Implementing Fast Barriers for a Shared-Memory Cluster of 1024 RISC-V Cores

samsquire · 2023-07-28T09:02:57

I skimmed the linked technical paper until I got the implementation details. It's a tree of integers. This is similar to my thoughts of a latch grid, which is basically an increment and if statement against a grid to see what is ready.

I was imagining a programming language that has "latches" or "barriers" as a primitive of the language.

Barriers and CountdownLatches are just If statements if a number is equal I think.

But I want barriers or latches that can schedule "something else" if the barrier is not ready.

How many times have you been deep in the depths of some code and realised you need to wait for something in another part of the system to become true?

I think lifecycles and state machines and barriers and latches are related, so I'm trying to define a syntax that looks like this:

  substate1(a) substate2(b) = substate3(a) substate4(a) | substate5(b) substate6(b)

When substate1 and substate2 are "fired", the state machines waits for substate3 and substate4, then when they fire it waits for substate5 and substate6.

I can either rely on barriers or latches or rely on explicit scheduling to implement this state machine.

EDIT: It occurred to me that Go's channels and CSP and its send/receive operations are like barriers.

adrian_b · 2023-07-28T12:28:13

You can have barriers that can schedule "something else" if the barrier is not ready, by using the services provided by the operating system.

For instance, in Linux you can use the syscalls FUTEX_WAIT and FUTEX_WAKE.

Instead of waiting in a loop for the barrier variable to reach the right value, invoking FUTEX_WAIT will schedule "something else". Whoever modifies the value of the barrier variable invokes FUTEX_WAKE, so that everybody who waits will check if the new value is what they are waiting for, otherwise they will yield execution again.

In Windows there is the analogous function WaitOnAddress().

Without using the operating system, on Intel or AMD CPUs it is possible to use the instructions MONITOR and MWAIT, or their unprivileged variants UMONITOR and UMWAIT (the latter are available only in some recent models).

By invoking MWAIT/UMWAIT, a thread will sleep until another thread will execute a store instruction towards the monitored barrier variable, changing its value.

rkangel · 2023-07-28T11:43:41

FWIW, that line you've written looks a lot like Verilog (or any other hardware description language).

olliej · 2023-07-28T06:30:20

This is literally just the abstract for the paper at https://arxiv.org/abs/2307.10248 (at least they provided the link?).

There is no commentary or discussion, just the abstract and the link.

The url should be updated to point to https://arxiv.org/abs/2307.10248

KeplerBoy · 2023-07-28T15:01:35

> To our knowledge, this is the first work where shared-memory barriers are used for the synchronization of a thousand processing elements tightly coupled to shared data memory.

Did they just forget about GPUs?

saagarjha · 2023-07-28T07:24:58

> By fine-tuning our tree barriers, we achieve 1.6x speed-up with respect to a naive central counter barrier and just 6.2% overhead on a typical 5G application

Sigh, has 5G advertising reached technical papers too? :(

karamanolev · 2023-07-28T07:56:56

Disclaimer: I don't know what kind of hardware 5G towers and routing run on.

If it's a typical application for these high CPU count clusters, why not measure real-world impact and mention it?

TheMode · 2023-07-28T11:37:56

Not sure if this is on topic, but couldn't a processor designed to run cellular automata scale better?

I have doubt that it is possible to scale an architecture that by design attempt to hide space.

__s · 2023-07-28T12:32:11

What do you mean by hiding space?

TheMode · 2023-07-28T14:20:18

CPU cache, main ram & disk are all treated the same at the software layer. All memory accesses are somehow expected (in theory) to be O(1).

The best you can do with such a model is try to predict space (access pattern etc). On the other hand, cellular automata would make it much more obvious and benefit both small & big programs.

Parallelism is ultimately the duplication of space. If one bucket isn't enough, you get a second one. But with our space-unaware ISAs you need some cooperation from your OS, all the layers above and predict as best you can.