Ask HN: Where to run embarrassingly parallel, Integer, no SIMD workloads?

Const-me · on Aug 16, 2022

> The workload is branch rich so SIMD/SIMT is not possible

Just because there’re many branches doesn’t mean SIMD/SIMT ain’t possible or too slow.

I recently needed to solve a problem where I wanted to traverse a tree, computing something for every node, and deciding whether to descend or skip nodes based on results of some computations. Obviously branch rich, but modern GPUs weren’t too bad at that use case. My compute shader has a main loop like the following HLSL.

    ByteAddressBuffer tree: register( t0 );
    for( uint state = 0; state != UINT_MAX; )
    {
        const uint2 nodeHeader = tree.Load2( state ); // [ child, next ]
        if( shouldDescend( state + 8 ) )
        {
            // Need to go deeper, compute something for that node
            state = nodeHeader.x;
        }
        else
        {
            // Skipping the child, compute something else here
            state = nodeHeader.y;
        }
    }

freemint · on Aug 16, 2022

This is a helpful distinction. I need update a bunch of randomly accessed book keeping structures which influence future branches, so it can not be implemented as reduction over tree nodes.

Const-me · on Aug 16, 2022

> which influence future branches

If you mean “future branches on other threads”, I recommend re-thinking your algorithm and/or data structures.

If the hardware has strong memory order like AMD64, on a massively-multicore CPU your code will be very slow because cache coherency protocols. Reading from a cache line recently modified by another core is typically even more expensive than L3 cache miss and roundtrip to system RAM.

Or if the hardware has weak memory order (like GPUs unless using interlocked instructions which are done by dedicated piece of hardware shared across the complete chip), recent changes to your state will be ignored by other cores. Other cores may load old data from their local caches. The performance is better, but the results are incorrect.

freemint · on Aug 16, 2022

Future branches of the same "thread" luckily aka the entire book keeping data structure is local to the an thread of execution of the algorithm while access and modification of the book keeping structure is essentially random access, for any notion of "nearby" threads i can think of and future branch decision can depend on pretty arbitrary parts of that data structure. See my other long reply for more context. Synchronisation between threads can happen at pretty arbitrary time and involves exchanges of new rows in the book keeping structure

Const-me · on Aug 16, 2022

I’ve read that comment but I still don’t understand what it is you’re computing.

It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.

If that’s the case, I’m not sure this gonna work. You spend hours using profiler and trying micro-optimizations, but if the data structures involved aren’t good (for instance, too many pointers across things) these micro-optimizations won’t necessarily help much.

Also, I believe no modern hardware has general ways to efficiently synchronize fine-grained stuff across cores. Shared memory works, but the latency is not great for fine-grained tasks measured in microseconds or less.

freemint · on Aug 16, 2022

I appreciate you challenging my ideas as far as your knowledge took you. I appreciate that you stopped when my explanation was insufficient. Your diagnosis:

> It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.

is correct expect it was someone else who wrote an "substantial amount of complicated C++ written without much thoughts about performance" and now i "want to improve performance".

Const-me · on Aug 17, 2022

Well, I’m afraid for your case there’s no silver bullet, neither hardware nor software. If you really need to improve things, you should refactor code, and especially data structures, for performance. Couple tips.

I recommend staying away from GPGPUs, at least for now. Porting things from CPU to GPU, especially in the cross-platform code base, is relatively hard on it’s own. Unless you’re fine with vendor lock-in to nVidia: CUDA is easier to integrate than the others.

Viewing SIMD lanes as equivalents of GPU threads is only one possible approach. It’s also possible to leverage these fixed-length vectors as they are, within a single logical thread. A trivial example — all modern implementations of memcpy() are using SSE2 or AVX instructions to sequentially move these bytes without any parallelism. You obviously don’t need to implement memcpy because already in the standard library, but SSE2 and AVX2 sets have hundreds of instructions for manipulating integer lanes in these vectors.

theandrewbailey · on Aug 16, 2022

Processors with vector support are ubiquitous. There's no point in trying to escape them, even if you don't want them or use them. There's simply not enough demand for non-vector processors.

anonymoushn · on Aug 16, 2022

If you explain more about the problem, I can probably write a SIMD implementation. Most "branch rich integer instruction" algorithms such as e.g. running a billion different Miller Rabin tests can be made into SIMD algorithms that use almost no branches. This conversion will give you the throughput benefits of not having a bunch of hard-to-predict branches constantly causing pipeline flushes *and* the throughput benefits of SIMD.

> The workload is memory bound not compute bound.

I don't think so? How many gigabytes per second per core are you processing?

Edit: If for some reason you can talk about this problem to random SIMD programmers online privately but you cannot post about this problem publicly, please add your contact information to your profile or your post.

freemint · on Aug 16, 2022

>> The workload is memory bound not compute bound.

> I don't think so? How many gigabytes per second per core are you processing?

That's what the Intel VTune profiler tells me. 39.2% Memory bound = 21,7% of clock ticks L1 bound (execution stalled for data that was in L1) + 12.4% L3 bound on a Haswell 4 core Xeon.

> If for some reason you can talk about this problem to random SIMD programmers online privately but you cannot post about this problem publicly

I can talk about it publicly. I just did not want to distracted from the actual hardware question. I recently started to contribute to this https://gitlab.com/JoD/exact open source project. The algorithm tries to find a valid assignment for a bunch of equations of this form 4x1 -3x57 +1* not(x1232) <= 4 (there are special cases already accelerated). We guess an assignment for a certain variable, check all constraints, sometimes constraints imply other assignments to other variables (if x1 is true and x1232 is false x57 has to be true) then those get propagated to. One technique is called watch propagation and can be done for the SAT family of clauses. This technique (among other) is in incompatible with vectorisation over different assignments. I find SIMD over clauses dubious, as they are mostly random accessed of different length and sparse. The embarrassing parallelization comes from being able to work one different parts of the parameter space and exchange clauses learned from conflicts. We are currently not doing that yet but plan to do something HordeSAT like over MPI (there is different slightly cleverer tree exchange variant over MPI all to all but i do not have that reference handy).

We have some horrible sins (such a virtual method table look ups in loops, no -march=native compiler flags in main branch but it did not make a huge difference, ...) which the main developer created and we have not cleaned up. If i could nerd snipe you to run some experiments with that codebase and contribute some SIMD loops (with -march=native -mtune=native only 4 functions are currently SIMD, none are significant to the performance) that be great. For all the divisibilty checking i currently plan this: https://www.reddit.com/r/exact/comments/wokfhl/resource_on_f... (we spend 3% of compute time in the gcc standard libraries modulo) Use the reddit or the Issues page to get in contact async.

bick_nyers · on Aug 16, 2022

If you are worried about branching then use perf to profile it and see how many branch misses you have. You might be guessing 90%+ correctly and not actually have anything to worry about.

https://en.m.wikipedia.org/wiki/Branch_predictor

You can probably use SIMD just fine.

You can get an EPYC with a shit ton of CPU cache, 100MB is easy. Faster & higher IPC cores can be better than more cores. Even at 100MB per core you can do some work on the code to make it have better cache locality so that the CPU cache prefetcher(s) handle that constraint well. I am guessing C/C++? Are you compiling with -O3 and profile-guided-optimization (GCC)?

Worst case scenario throw it on an FPGA :)

nynx · on Aug 16, 2022

A GPU is almost certainly what you want. Modern GPUs have some interesting tricks up their sleeves that make them much better at branching code than previously.

yababa_y · on Aug 16, 2022

This isn’t off the shelf but you can rent enormous FPGA’s from amazon. You can dispatch ~6800 dsp slices at 500MHz for a hot ~1.7Tu32op/s. Each FPGA can have 4 DDR4 controllers for you to saturate with memory ops. You can cram ~800 vexriscv cores in there (possibly more, i’m extrapolating from less featured fpga architectures) for a minimum of 220 GIPS. Or about 1/7th of a threadripper 3990X (dhrystone ips).

I think that those vector functional units more than pay for themselves. If you combined them all on that alder lake die you linked you would get _one_ extra core of space. Every recent manycore mesh I know of packs in a vector or tensor unit because compute circuits are actually pretty small for the punch they pack! You vastly reduce dispatch overhead which is the main source of slowdown once you saturate your cache hierarchy. Distributing memory access in a manycore is frustratingly slow.

amirhirsch · on Aug 16, 2022

If you are doing few compute steps per memory access then you are memory bandwidth limited and >100MB is too much RAM requirement per core to not be bottlenecked by RAM access. Use a GPU or FPGA with HBM.