I think you misunderstand how "processing a node" works. It's a recursive algori...

dragontamer · on May 31, 2019

> I think you misunderstand how "processing a node" works. It's a recursive algorithm. Even if you correctly identify an ALL node (which is not always easy) and say "hey process all these moves at the same time" then each move will result in a radically divergent search path that a GPU won't be able to handle.

I don't think there's as much divergence as you think there is. Someone down-thread already created a chess move-generator and evaluator: https://news.ycombinator.com/item?id=20035835

Its not the "chess" part that results in divergent thread execution, its maybe the "search part", at least with current algorithms.

Pawn, king, and knight movements won't have any thread divergence: these are pure bitboard manipulations without if-statements. Only sliding piece attacks (Bishop, queen, rook) seem complicated enough to cause divergence... but its mostly about iteration. One bitboard with 6 "bishops" (2x white bishop, 2x black bishop, 1x white queen, 1x black queen) will run diagonal sliding piece move generation 6x. But a bitboard with only 3 "bishops" will only run it 3 times. But I don't expect this kind of thread divergence to cause major issues, and most boards in a SIMD thread probably will have similar numbers of pieces.

> You're likely looking at simplified diagrams of a search tree and seeing a circle that has lines that connect it to a layer of more circles and thinking "hey just do all those circles in parallel" but the way chess engines work is 100% different from that.

Yes, that was my initial idea. But... the more I look into it, the more it seems possible. I'm just a little bit beyond that point, although its all just theory in my mind.

----------

Lets look at this implementation of alpha-beta: https://www.chessprogramming.org/Alpha-Beta#Negamax_Framewor...

Lets change it to the following:

    int alphaBeta( future<int> alpha, future<int> beta, int depthleft ) {
       if( depthleft == 0 ) return quiesce( alpha, beta );
       for ( all moves)  {
          score = Spawn(-alphaBeta( -beta, -alpha, depthleft - 1 )); // <--- Issue #1
          if( score >= beta ) // <---- Issue #2
             return beta;   //  fail hard beta-cutoff
          if( score > alpha ) // <---- Issue #3
             alpha = score; // alpha acts like max in MiniMax
       }
       return alpha;
    }

Issue #0: Futures/Promises -- No one has implemented futures / promises on GPGPUs yet. How can this be done efficiently on the GPU? Can the SIMD architecture be leveraged? Atomics could work, but they probably won't scale... I think something "innate" to the GPGPU architecture has to be done.

Issue #1: "Spawning" threads is pretty simple (but unconventional), even on a GPGPU. Create tasks that can be load-balanced on the SIMD GPU through the use of SIMD-Stacks and SIMD-queues to load-balance. Basically the same thing that GPGPU raytracing engines do (rays don't always "bounce", most rays miss all targets. Figuring out which rays hit a target and need to bounce... vs rays that miss and need to be terminated... is a solved problem for GPGPU programmers. Its basically task spawning). "Negation" (the -alpha, and -beta) shows that the future<int> needs to support negation, so that we don't block on that line. This line should execute in parallel.

Issue #2: score >= beta is a "hard synchronization" point, the only point where future::wait() needs to be called. However, most checks will be against a beta of -infinity (PV and ALL nodes would have a -infinity value). So this step could be skipped in a huge number of cases. In all other cases, a "blocked" thread will have to wait for its tree of future<ints> "score" value to be all finished executing.

Issue #3: This is effectively score = max(alpha, score). Across multiple iterations, this is max(alpha, score0, score1, score2, score3), etc. etc. Alpha starts as negative infinity: meaning a large number of nodes will always spawn.

The PV-nodes, CUT nodes, and ALL nodes don't need to be analyzed. You can see that this "task-based" definition spawns parallel-threads on PV-nodes and ALL nodes, while blocking on CUT nodes. This clearly becomes a (parallel SIMD) depth-first search.

---------

Of these issues, #0 (creating a future / promise) on the GPGPU is probably the hardest issue to solve. It is clear to me that this "future" needs symbolic execution (able to handle max(future, future), as well as negation -future). A difficult programming challenge, but not really insurmountable. It should be noted that if depth-first search is prioritized (if we can somehow guarantee that all "searchers" are on the left-most edge of all nodes), futures will take up a constant space roughly proportional to O(depth + size-of-parallel-cluster). (well, it seems clear to me anyway...)

The overall GPGPU task scheduling algorithm has been done in GPGPU Raytracers, even if people don't really realize it yet. http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

GPGPU Raytracers have their rays either hit a target, or miss a target. If they hit a target, they spawn a new ray (!!), if they miss, then they die. Those silly GPGPU Raytracer programmers have solved a very, very difficult issue but haven't been telling everyone about it. Lol. This clearly can serve as the basis of a task-scheudling / load-balancing core on GPGPUs.

-------

EDIT: Don't get me wrong. I recognize that I'm basically proposing that a GPU recursive task scheduler / mini-SIMD operating system with futures needs to be built to accomplish the job. Yeah, that seems difficult, but it doesn't seem impossible to me.

mot524 · on May 31, 2019

I know exactly how much divergence there is because I'm the author of a chess engine and I've spent weeks/months looking at traces of my engine's tree searches.

I think what you're not getting is what a chess engine search tree looks like. You're thinking it has nice layers of nodes, but really, most of the time, the engine is in the quiescence search, searching sequences of captures. The branching factor is extremely low and each branch terminates at an unpredictable depth. Instead of nice layers of nodes, imagine a lightning bolt that's constantly changing. You can't predict what you're going to have to do next.

GPUs would excel if you have to do the same thing to different boards all at the same time. But that's never what happens. Imagine you're in the q-search and you're at a node with two captures. Might as well search both at the same time? Okay, a GPU can make two moves at the same time and evaluate the resulting positions simultaneously no problem. Maybe one evaluation causes the branch to terminate and then that thread needs to undo the move, whereas the other branch needs to keep going and generate and search subsequent moves. This sort of divergence is the norm, not the exception, and it's impossible to handle in any sort of efficient way with a GPU.

It's great that people have made GPU software to visit all possible positions to depth N, and it's great that somebody is making a MCTS engine run on a GPU. Both are tricky to do, and significant accomplishments if they work correctly. But in terms of making an effective engine these are just the absolute first initial baby steps and they don't signify anything about the viability of the rest of the project.

dragontamer · on June 1, 2019

I appreciate the discussion.

> most of the time, the engine is in the quiescence search

That seems to be the best case for GPUs. Whenever a GPU SIMD thread finishes a quiescence search, just grab another position that is under quiescence search.

Frankly, the quiescence search part seems most ideally suited to GPUs out of all, as long as it is appropriate batched.

> Imagine you're in the q-search and you're at a node with two captures. Might as well search both at the same time.

Put them onto the q-search queue, which will be checked as GPU threads go idle.

The q-search problem is exactly identical to raytracers. GPUs don't know if a ray will bounce zero, once, or even 10 times. But that doesn't cause divergence because it's just about sticking rays onto a queue and evaluating them

Rays in raytracers can even spawn multiple rays in a bounce, such as subsurface scattering. This pattern is solved!

mot524 · on June 1, 2019

Like I've been saying, you don't know what you want to q-search next until you've finished the last q-search. So you can't make a q-search queue because you couldn't add anything to the queue until the previous addition was evaluated. I feel like I've said this a few times but for some reason you're not getting it, not really sure what the hangup is?

Alpha-beta in general is so sequential that basically all the algorithms to parallelize it focus on making each thread as independent and sequential as possible. For example, the PV-split algorithm (which generalizes into Young Brothers Wait) splits the tree as low as possible (eventually to the root node) so that all threads can act as independently as possible. There is no local parallelism to be exploited because every branch is radically different from the last branch, and dependent on the result of the last branch. Local parallelism is necessary for a GPU to be effective and conventional game tree search algorithms have none.

dragontamer · on June 1, 2019

> I feel like I've said this a few times but for some reason you're not getting it, not really sure what the hangup is?

I think the hangup is that you aren't understanding what a C++ future is, and why it solves this problem. Remember, I'm operating under the assumption that the GPU-programmer has figured out how to write a C++ future on the GPU.

I kinda have an overall idea of how one would be written, but I recognize that no one has done it yet. But I don't see any reason why a C++ Future couldn't be done. The real question I have is if C++ Futures can be implemented efficiently to make all of this worthwhile. (I've gone through a bunch of ideas in my brain... I think I've got one idea that is "efficient enough"... but many obvious implementations... like atomics or mutexes won't work well on a GPU. An implementation that is innately "task based cooprerative switching" is the only efficient implementation I can think of)

> Like I've been saying, you don't know what you want to q-search next until you've finished the last q-search. So you can't make a q-search queue because you couldn't add anything to the queue until the previous addition was evaluated.

Future<int> f = spawn(new q-search);

f.wait(); <---- Really simple actually. If "f" is blocked, then the current task goes into the "blocked" list. The first task from the "unblocked list" gets pulled off. If the "unblocked list" is empty, then this SIMD-thread idles until someone else in the NVidia warp (or AMD workgroup) calls f.wait() for another opportunity.

Yes, there is some divergence here, but the overall process of f.wait() is efficient enough that I don't think its a big penalty to have the warp stall.

You only run Q-searches that are on the "Running" list. You only need 32-running Q-searches to keep an NVidia warp busy, or 64-running Q-searches to keep an AMD workgroup busy.

Surely, there are at least 32 Q-searches that are unblocked at a time in a parallel chess search? Sure, not at ply 1, but the number of Q-searches that could be performed in parallel grows exponentially per ply, and also remember... an NVidia Warp will operate at 100% utilization with as little as 32-Q searches.

An AMD Workgroup is less efficient, needing 64 Q-searches before operating at 100% utilization. But this is still a small number in the scope of the exponentially increasing chess game tree.

dragontamer · on June 1, 2019

> Alpha-beta in general is so sequential that basically all the algorithms to parallelize it focus on making each thread as independent and sequential as possible. For example, the PV-split algorithm (which generalizes into Young Brothers Wait) splits the tree as low as possible (eventually to the root node) so that all threads can act as independently as possible. There is no local parallelism to be exploited because every branch is radically different from the last branch, and dependent on the result of the last branch. Local parallelism is necessary for a GPU to be effective and conventional game tree search algorithms have none.

I forgot to respond to this half.

Have you heard of Task-based parallelism? Pretty much any recursive algorithm can easily become parallel through the technique.

Alpha-beta tree searches very easily follow the task-based parallelism. Give this page a look: https://software.intel.com/en-us/node/506102

-------

The key element that ties it all together is the future / promise construct. https://en.wikipedia.org/wiki/Futures_and_promises

If you have a task-based parallelism model, with symbolically executed futures/promises (that... instead of waiting immediately on max(alpha, score)... where "alpha" and "core" are both futures... it can immediately return a symbol representing "max(alpha, score)" and carry on with the execution / evaluation. Lazily evaluating the future at a later time when alpha and/or score is ready.

Yeah, its a few advanced threading concepts here and there. But I've seen the primitives written in various frameworks. And from my understanding of GPU-programming, C++ Futures / Promises CAN be implemented in GPUs. Its just that no one has really done it yet.

From those perspectives, I keep looking at something as "sequential" as alpha-beta pruning and I don't see any major thread-divergence issues. Honest. I just don't think anyone has combined these 4 or 5 techniques together in one program yet.

In short:

* GPGPU programming / SIMD

* Task-based parallelism (like TBB): https://software.intel.com/en-us/node/506102

* Futures

* Symbolically-lazily executed symbols associated with the futures (only "Beta-cuttoff" absolutely requires a yield or stall. Everything else can be "lazily" executed at a later point in time, which should discover latent parallelisms in the program).

* Alpha-Beta pruning

-------

You keep saying "it cannot be done", but what I'm saying is "I've seen some new parallel primitives invented in the last 10 years. I think it can be done now". As long as someone ports these primitives over to the GPU.

EDIT: I have looked through the AMD Vega instruction set. Modern GPUs allow the programmer to know the program-counter, to set the program counter (aka a jump), and all that good stuff. Function-pointers, as long as a full AMD workgroup (64-threads) or NVidia warp (32-threads) do in fact work (!) on a modern GPU.

Yeah, there are issues like the CPU Stack (it won't translate to the GPU. So everything talked about here would have to be converted into iterative form). So I haven't solved all the issues yet. But the general thought process looks promising to my eyes

mot524 · on June 1, 2019

I've already spent a bunch of time explaining to you why this won't work. But hey, apparently you're the expert. If you want to reimplement Stockfish on a GPU such than it's faster than a CPU then nobody's stopping you. The source code is only ~10k lines and very well-written, clear, well-documented, and easy to understand. Go for it. You'll be famous as the first person to effectively implement alpha-beta on a GPU and as the author of the strongest conventional chess engine in history. So go for it.

dragontamer · on June 1, 2019

Regardless, I appreciated the discussion! Maybe I'll give it a whack. It all comes down to whether or not I could build "future" efficiently on a GPU.

I don't want to pretend I'm an expert at all. I've looked at symbolic manipulation and chess mostly as a hobby (since the Chess community has so many cool bitwise algorithms they've invented). And the GPU-community is small, but interesting to me. HPC programmers / C++ programmers in another bunch... information really isn't flowing as well between the "styles of compute", I'm just trying to connect a bunch of ideas that other people have thought up together.

mot524 · on June 2, 2019

Okay, as long as we're not arguing, I will tell you that any sort of fine-grained intra-thread communication is to be avoided like the plague in computer chess. Engines are able to search millions of positions per second... most multithreading frameworks (e.g., the ones that give you your "futures" and "promises") rely on locks that can halt threads for milliseconds at a time. It's absolutely disastrous for efficiency.

I believe Stockfish uses the idea of "Lazy SMP" for parallel search which literally means no intra-thread communication other than locks on hash table entries to avoid data corruption of the hash table. Anything more sophisticated has proven to be a loss. So any sort of discussion of tasks, futures, promises, or even Young Brothers Wait is going to be fairly pointless. Also, anything that relies on offloading data to the GPU for processing for more than a few milliseconds is going to be a net loss for obvious reasons. Meaning that even if you could evaluate a million positions in parallel, and you already (somehow) knew which positions you needed to evaluate, doing the evaluation on the GPU is probably still going to be a loss due to the time it takes to copy the positions to/from the GPU.

Really, if you were tasked to come up with an algorithm that was poorly suited for GPGPU computation, it might be computer chess.

That being said, it doesn't really matter. How much were you hoping to speed up Stockfish? AlphaZero still crushes Stockfish at 10:1 time odds. Even if you were able to make Stockfish 10x faster it wouldn't change the dynamic of CNN chess engines being superior to conventional engines.

dragontamer · on June 2, 2019

> Engines are able to search millions of positions per second... most multithreading frameworks (e.g., the ones that give you your "futures" and "promises") rely on locks that can halt threads for milliseconds at a time. It's absolutely disastrous for efficiency.

Yeah, that's what I mean as "efficiently create a future" on the GPU.

GPUs absolutely cannot spinlock, and atomics are way slower than CPU atomics. So mutex / spinlock / atomic based synchronization absolutely won't work.

So... what's my plan? Well... as I said, task-based cooperative multithreaded futures. I've talked about some ideas for how to implement it above... but I won't post another wall of text. Also: some advanced synchronization primitives are free on SIMD-based GPUs (ex: thread-barriers within a warp are literally free and compile down to a no-op). I think there's a rich set of efficient GPU synchronization primitives.

Lazy SMP works for Stockfish, but clearly won't work with 163,840 threads (Vega 64 at max thread occupancy). So I'd have to do something else. (163840 threads locking the RAM one-at-a-time to access a shared hash table? Probably not going to work)

> How much were you hoping to speed up Stockfish?

The Vega64 GPU has 4096 shaders at 1.5 GHz. 5k to 50k shader-cycles per node would give 122 Million nodes/second to 1220 Million nodes/second. Vega64 GPU also has 500 GBps of memory bandwidth of 8GBs of HBM2 RAM, with "64kB Shared Memory" per CU having an aggregate bandwidth of 9,000 GBps (which should lead to potentially better cross-thread synchronization structures within a workgroup)

Hard to say where a project would land, but probably within that range somewhere. It'd HAVE to land there to be successful (and hopefully closer to the 5k shader-cycles per node)

The only major instruction GPUs are missing is the pext and pdep instructions from x86. Sliding piece attacks may need a rework to be efficient on a GPU due to the very low amounts of L1 RAM available. Otherwise, GPUs support popcount, AND, XOR, etc. etc. that chess bitboard use.

BTW: Threadripper 1950x running Stockfish processes 2-million Nodes/Second, or roughly 32k core-cycles per node (4GHz 16-cores)). Which served as the basis of my estimate (+/- a magnitude)

mot524 · on June 2, 2019

I already told you that intra-thread synchronization is to be avoided but you're still talking about it (promises, futures, tasks, etc.) so I don't know what to say. Nothing you're saying is going to work like you think it is, but it seems like you're going to have to figure that out for yourself.

Clearly you want to talk more than you want to program. Maybe reimplementing Stockfish on a GPU is too daunting of a task. Why don't you get something simple like TSCP running on a GPU and see how far you get with that? TSCP is only about 2k lines of code and probably half of those lines are comments. With the state of GPU cores and compilers these days maybe you can get the C code running on a GPU core directly without much modification. TSCP also uses less than 64k of RAM so it should fit nicely in an Nvidia SM.

dragontamer · on June 2, 2019

> Clearly you want to talk more than you want to program.

And clearly you want to insult me, to the point where you've basically created an account on this website just to post what you've been posting. This isn't the first unnecessary swipe you've levied either. But... insults are fine as long as I'm getting useful information out of you.

Anyway, the insult to information ratio is off whack now. I guess its time for the discussion to die. I'm not here to argue dude, just discuss, but you keep wanting to turn this into an argument for some reason.

mot524 · on June 2, 2019

I told you that intra-thread communication will ruin chess engine efficiency, and in your reply, all you do is talk more about stuff that requires intra-thread communication (tasks, futures, promises, etc.). You might think that all I'm here to do is insult you, but from my point of view it seems like all you're here to do is ignore me, even though I'm the one with domain-specific knowledge. It's not a "discussion" if you're not going to consider anything I have to say.

dragontamer · on June 3, 2019

Look, I appreciate domain expertise and all. But seriously, level with me here.

Do you know how GPGPU raytracers work?

Rays are recursively defined: a ray starts at the screen (usually ~1000 points per pixel. A 1920 x 1080 picture with 1000 points per pixel is ~2-billion starting rays). When a Ray hits an object, it will spawn a new ray. When a Ray hits a light source (and the background is the most common light source) it will light up all the objects that were in its path (or its parent's rays path). The life of one ray may be very short (it hits the background), while the life of another ray may have 20+ bounces (if it hits a mirror, then a shiny object, then is subsurface-scattered to a 4th object, etc. etc.)

So we have an example that is both recursively defined (rays bouncing and spawning new rays), requires a bit of synchronization (each ray causes the 1920x1080 picture to light up in its own way), and is efficiently computed by GPUs.

You claim that I'm not listening to you. And of course from my perspective, you aren't listening to me. So give me a bone here. I've talked a LOT about GPGPU raytracers in the posts above, and its synchronization algorithm above already. Did you follow any of what I was talking about?

Or are you just deflecting to Chess Engines because that's what your expertise is? Because if you're honest about this discussion, you would have been talking about GPGPU Raytracers, and how its Raytracing Synchronization algorithm is applicable to the (similarly recursive) alpha-beta minimax tree (or I guess... from your perspective, why the two recursive definitions fail to line up).

--------

Summary: do you know how GPU Raytracers work. If so, then please reply in a way that demonstrates your knowledge. Because from my perspective, you've been avoiding the key point of my argument this entire time.