
Reference Capabilities for Safe Parallel Array Programming - mpweiher
https://programming-journal.org/2020/4/1/
======
heinrichhartman
This is a very nice Journal! [https://programming-
journal.org/](https://programming-journal.org/)

> Diamond open access, free-of-charge to authors. The articles are accessible
> to everyone, forever, and the authors don’t need to pay any fee.

> All articles are published under a Creative Commons license.

> The journal is an overlay on arXiv, meaning that we upload all articles to
> arXiv as a long-term storage system.

Just as it should be. If I write a relevant publication, this will be on the
top of my list. At this point, publishing with folks like Elsivier should be
considered hostile and anti-science e.g. [1].

[1] [http://heinrichhartmann.com/blog/2013/01/17/Why-Openness-
ben...](http://heinrichhartmann.com/blog/2013/01/17/Why-Openness-benefitts-
research.html)

------
anaphor
Check out the Pony tutorial for a good introduction to reference capabilities

[https://tutorial.ponylang.io/reference-
capabilities.html](https://tutorial.ponylang.io/reference-capabilities.html)

They can be tricky to wrap your head around at first, but the idea (combined
with object capabilities) is very powerful IMO.

~~~
saagarjha
> There aren’t currently any mainstream programming languages that feature
> reference capabilities.

Does Rust not have something similar built into the borrow checker?

~~~
foota
I don't think you can hold multiple mutable non-overlapping slices though,
which this would appear to allow.

Edit: I see you're talking about pony, not the op. It would seem your comment
is correct. Although, pony seems to have a slightly different perspective on
it.

~~~
dbaupp
Rust allows multiple mutable non-overlapping slices. They cannot be
constructed by direct array[i..j] slicing, but there is a variety of
convenience functions for it.

Sequential: [https://doc.rust-
lang.org/std/primitive.slice.html#method.sp...](https://doc.rust-
lang.org/std/primitive.slice.html#method.split_at_mut) [https://doc.rust-
lang.org/std/primitive.slice.html#method.sp...](https://doc.rust-
lang.org/std/primitive.slice.html#method.split_at_mut)

Parallel:
[https://docs.rs/rayon/1.2.0/rayon/slice/trait.ParallelSliceM...](https://docs.rs/rayon/1.2.0/rayon/slice/trait.ParallelSliceMut.html#method.par_chunks_mut)

~~~
kzrdude
And rust ndarray implements strided mutable "slices" too, interleaved.

------
dragontamer
Here's a common GPU / SIMD trick for parallel programming, which also happens
to be my favorite array trick I've ever learned about. Since SIMD is
implemented on Intel x86, AMD x86, ARM, and POWER9... as well as NVidia GPUs
and AMD GPUs... I think anyone who is into parallel processing of arrays
should learn the SIMD trick. This is truly universal on modern hardware
(although the precise instructions are different).

First, a helper function called "active_lane":

    
    
        active_lane():
           exclusive_parallel_prefix_sum(current_lane_is_active);
    

This "parallel_prefix_sum" operation is a GPU-intrinsic. On AMD, it is
"__ockl_activelane_u32()", which compiles into 2-assembly language statements
with throughput 1-instruction per clocktick. I assume NVidia's __active_lane()
function is similarly very efficient

AVX can implement active_lane by performing parallel-prefix-sum over the
execution mask:
[https://en.wikipedia.org/wiki/Prefix_sum](https://en.wikipedia.org/wiki/Prefix_sum)

Imagine the following execution mask in AVX: 0 1 0 1 0 1 1 1. The exclusive
parallel prefix sum is 0 0 1 1 2 2 3 4. Thread#1 gets index0. Thread#3 gets
index1. Thread#7 gets index4.

Intel 128-bit SIMD can implement this for 8-bit ints as follows:

    
    
        ; Conceptual purposes only, untested. Using AVX-style "dest, src0, src1..." methodology here
        ; Assume xmm0 is the execution-mask / input
        psrldq xmm1, xmm0, 1 ; Byteshift right 1-byte
        paddb xmm0, xmm1, xmm0 ; 8-bit integer add
        psrldq xmm1, xmm0, 2 ; 2-byte shift
        paddb xmm0, xmm1, xmm0
        psrldq xmm1, xmm0, 4 ; 4-byte shift
        paddb xmm0, xmm1, xmm0
        psrldq xmm1, xmm0, 8 ; 8-byte shift
        paddb xmm0, xmm1, xmm0
        ; xmm0 contains prefix-sum 
    

Extending it beyond 128-bits requires vpbroadcastd due to the AVX2 128 "lane"
split, but otherwise continues to follow the pattern. In any case, "active
lane" is still a very efficient operation in AVX, as it doesn't even touch L1
cache (register space only).

\------

Once Active-lane is figured out, treating an array as a 16-way parallel SIMD-
stack (or bigger) is actually quite simple:

    
    
        SIMD-Push(X)
            array[tail + __active_lane()] = X
            toAdd = popcount(exec_mask);
            if(__active_lane() == 0){ // Tail is shared. Only have one lane update "tail".
                tail += toAdd;
            }
    
        SIMD-Pop()
            toRemove = popcount(exec_mask);
            if(__active_lane() == 0){
                tail -= toRemove;
            }
            return array[tail + 1 + __active_lane()];
    

The above is GPU-style (ipsc, ROCm, OpenCL, or CUDA). While all threads read
"tail" together you only want one thread updating the "tail" variable at a
time.

\----------

With that being said: the work in the paper is interesting. "Split" and
"Merge" (especially "interleaved" forms) are an interesting way to formally
specify Array of Structs / Structure of Arrays, which is one of the problems
you will commonly deal with in SIMD or parallel-compute.

The "split" and "merge" primitives were also used as a cute mergesort
implementation at the end.

Arrays are surprisingly efficient and effective data-structures today, even
more so now that CPUs rely upon pre-fetchers to fill cache fast enough.

~~~
lostmsu
Can you summarize what does this do for you?

~~~
dragontamer
Hmmm... perhaps an example would be best? I wrote myself FizzBuzz a few days
ago as a proof-of-concept.

    
    
        // AMD cards are 64 SIMD-lanes
        // Untested, but I think you can understand the concept from this.
        for(int i=0; i<10000; i+=64){
            int myIdx = i + hipThreadIdx_x;
            bool fizz = myIdx % 3 == 0;
            bool buzz = myIdx % 5 == 0;
            if(fizz){
                fizzTable[fizzTail + active_lane()] = myIdx;
            }
            if(buzz){
                buzzTable[buzzTail + active_lane()] = myIdx;
            }
    
            fizzNum = popcnt(ballot(fizz));
            buzzNum = popcnt(ballot(buzz)); 
            if(active_lane() == 0){
                fizzTail += fizzNum;
                buzzTail += buzzNum;
            }
            __syncthreads(); // Thread barrier, this is the only synchronization needed.
        }
    

The above implements parallel fizzbuzz with 100% utilization and only one
synchronization primitive. The syncthreads(); call is extremely efficient for
SIMD programming: if you have a workgroup of size 64 on AMD platforms, it is
nearly a no-op (aside from a relatively cheap L1 memory-barrier to ensure all
threads are reading the same value).

In practice, the above methodology scales to +1024 threads at a time on AMD
and NVidia GPUs if you increase the work-item or CUDA-block size to the max.

If you tried to write 64-way parallel fizzBuzz, what would it look like? I
think the above implementation, using active_lane(), is pretty elegant.

\---------

I read this pattern off of Github with some high-speed GPU-raytracing code. It
seems that a lot of GPU-programmers know this pattern already, and use it to
increase utilization of the SIMD-cores.

