Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Engines are able to search millions of positions per second... most multithreading frameworks (e.g., the ones that give you your "futures" and "promises") rely on locks that can halt threads for milliseconds at a time. It's absolutely disastrous for efficiency.

Yeah, that's what I mean as "efficiently create a future" on the GPU.

GPUs absolutely cannot spinlock, and atomics are way slower than CPU atomics. So mutex / spinlock / atomic based synchronization absolutely won't work.

So... what's my plan? Well... as I said, task-based cooperative multithreaded futures. I've talked about some ideas for how to implement it above... but I won't post another wall of text. Also: some advanced synchronization primitives are free on SIMD-based GPUs (ex: thread-barriers within a warp are literally free and compile down to a no-op). I think there's a rich set of efficient GPU synchronization primitives.

Lazy SMP works for Stockfish, but clearly won't work with 163,840 threads (Vega 64 at max thread occupancy). So I'd have to do something else. (163840 threads locking the RAM one-at-a-time to access a shared hash table? Probably not going to work)

> How much were you hoping to speed up Stockfish?

The Vega64 GPU has 4096 shaders at 1.5 GHz. 5k to 50k shader-cycles per node would give 122 Million nodes/second to 1220 Million nodes/second. Vega64 GPU also has 500 GBps of memory bandwidth of 8GBs of HBM2 RAM, with "64kB Shared Memory" per CU having an aggregate bandwidth of 9,000 GBps (which should lead to potentially better cross-thread synchronization structures within a workgroup)

Hard to say where a project would land, but probably within that range somewhere. It'd HAVE to land there to be successful (and hopefully closer to the 5k shader-cycles per node)

The only major instruction GPUs are missing is the pext and pdep instructions from x86. Sliding piece attacks may need a rework to be efficient on a GPU due to the very low amounts of L1 RAM available. Otherwise, GPUs support popcount, AND, XOR, etc. etc. that chess bitboard use.

BTW: Threadripper 1950x running Stockfish processes 2-million Nodes/Second, or roughly 32k core-cycles per node (4GHz 16-cores)). Which served as the basis of my estimate (+/- a magnitude)



I already told you that intra-thread synchronization is to be avoided but you're still talking about it (promises, futures, tasks, etc.) so I don't know what to say. Nothing you're saying is going to work like you think it is, but it seems like you're going to have to figure that out for yourself.

Clearly you want to talk more than you want to program. Maybe reimplementing Stockfish on a GPU is too daunting of a task. Why don't you get something simple like TSCP running on a GPU and see how far you get with that? TSCP is only about 2k lines of code and probably half of those lines are comments. With the state of GPU cores and compilers these days maybe you can get the C code running on a GPU core directly without much modification. TSCP also uses less than 64k of RAM so it should fit nicely in an Nvidia SM.


> Clearly you want to talk more than you want to program.

And clearly you want to insult me, to the point where you've basically created an account on this website just to post what you've been posting. This isn't the first unnecessary swipe you've levied either. But... insults are fine as long as I'm getting useful information out of you.

Anyway, the insult to information ratio is off whack now. I guess its time for the discussion to die. I'm not here to argue dude, just discuss, but you keep wanting to turn this into an argument for some reason.


I told you that intra-thread communication will ruin chess engine efficiency, and in your reply, all you do is talk more about stuff that requires intra-thread communication (tasks, futures, promises, etc.). You might think that all I'm here to do is insult you, but from my point of view it seems like all you're here to do is ignore me, even though I'm the one with domain-specific knowledge. It's not a "discussion" if you're not going to consider anything I have to say.


Look, I appreciate domain expertise and all. But seriously, level with me here.

Do you know how GPGPU raytracers work?

Rays are recursively defined: a ray starts at the screen (usually ~1000 points per pixel. A 1920 x 1080 picture with 1000 points per pixel is ~2-billion starting rays). When a Ray hits an object, it will spawn a new ray. When a Ray hits a light source (and the background is the most common light source) it will light up all the objects that were in its path (or its parent's rays path). The life of one ray may be very short (it hits the background), while the life of another ray may have 20+ bounces (if it hits a mirror, then a shiny object, then is subsurface-scattered to a 4th object, etc. etc.)

So we have an example that is both recursively defined (rays bouncing and spawning new rays), requires a bit of synchronization (each ray causes the 1920x1080 picture to light up in its own way), and is efficiently computed by GPUs.

You claim that I'm not listening to you. And of course from my perspective, you aren't listening to me. So give me a bone here. I've talked a LOT about GPGPU raytracers in the posts above, and its synchronization algorithm above already. Did you follow any of what I was talking about?

Or are you just deflecting to Chess Engines because that's what your expertise is? Because if you're honest about this discussion, you would have been talking about GPGPU Raytracers, and how its Raytracing Synchronization algorithm is applicable to the (similarly recursive) alpha-beta minimax tree (or I guess... from your perspective, why the two recursive definitions fail to line up).

--------

Summary: do you know how GPU Raytracers work. If so, then please reply in a way that demonstrates your knowledge. Because from my perspective, you've been avoiding the key point of my argument this entire time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: