> Another scenario, say you are working on 3D game, you have some trick physics math where you need to crunch numbers, maybe adding realistic building physics to Minecraft. But separate threads are already handling procedural generation of new chunks, rendering, networking, and player input. If these are keeping most of the system’s cores busy, then parallelizing your physics code isn’t going to help overall.
Generally game engines have been migrating towards work-stealing based tasks architectures. Having monolithic thread based systems(one per physics, rendering, gameplay, etc) were great for migrating from the single-threaded games of old, however it leads quite often to idle threads.
This was even more critical in the PS3 era where you had SPUs with just ~256kb of RAM. Overall it leads to an architecture that scales well to whatever platform you end up targeting since the CPU/compute capabilities of various platforms can be pretty disparate.
How are Golang goroutines for implementing work-stealing?
For a game engine where task based structure is important Go would be a pretty poor choice. You don't want a GC in your inner engine loop. In addition Go has pretty poor semantics for explicit memory layout, something that can be critical for performance. It's very common to issue cache prefetches for the next task as one is spinning down for instance.
I have a game server written in Go.
You don't want a GC in your inner engine loop.
Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7. But if my server had pause times of 20ms, I wouldn't care, so long as it didn't happen more often than once a second or so.
It's very common to issue cache prefetches for the next task as one is spinning down for instance.
Yeah, I've had Go guys say I should give every agent its own goroutine. As it turns out, this can cost 150-300 microseconds every time one of them wakes up. I have a traditional game loop right now.
> Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7. But if my server had pause times of 20ms, I wouldn't care, so long as it didn't happen more often than once a second or so.
We're talking about client-side latency here. You've got 16.6ms to do everything in a game engine. Even 2ms is 1/8th of your frame and a significant amount of work. 20ms is two missed frames which is really bad.
The types of engines where work stealing and task based architecture are prevalent are usually of the AAA variety, where performance is critical and the scenes are either very complex or very large. In this space GC based languages have been limited in scope to gameplay scripting and even then are usually very fast(Lua or equivalent).
And this is targeting a mere 60hz. Oculus DK2s overclocked their LCD panels to run higher than spec (75hz) and shipped even higher refresh rates in CV1 (90hz) because of it's importance in reducing nausea. To say nothing of the enthusiasts running 120hz or 144hz displays - in which case you've blown through more than 1/4th of your frame budget 'at random' (read: you must assume it could happen in any frame) unless you can control when GC occurs.
> 20ms is two missed frames which is really bad.
If you're doing VR, this is bad enough that some of your customers may hurl. I've gone after fixing single frame hitches which were caused by as little as an extra 1ms spike at exactly the wrong time in non VR games. Unfixable 20ms spikes would be a total non-starter.
There's a lot of applications where you can tolerate an extra 1-2ms extra spike. A game server, where network latency is an order of magnitude or two larger, will probably almost always count. And if your improved productivity from using the language lets you optimize away an additional 1-2ms cost elsewhere that you wouldn't have time for otherwise, it can even be worth it.
For me, there's so much stuff that I use to boost my productivity, that's missing from Go - by design no less - that my productivity is going to be going the wrong way.
There was a design document about this from 4 years ago, but I'm not sure how well it reflects the current implementation:
For the uninitiated -- there's probably a large list of "don'ts" that would be good here, too. But maybe a prerequisite is ideal: why are you trying to write [this code] to execute in parallel? "It's slow" isn't good enough. You can save yourself a lot of time and headache if you use profilers to understand what the system is doing that makes it slow.
Using that profiler will at least help partition the problem into one of two big domains: compute resource bound (cpu/bus/mem/cache etc) or maybe it's just pending results from some other async task in the system. The latter happens much more often than you might expect. Filesystems, critical locks/exclusion, networks, databases, IPC -- there's lots of stuff that having multiple tasks in parallel for might not help much.
- Always profile first! Then go for low-hanging single-thread fruit. If you also parallelise later you'll get a multiplicative effect with speedup.
- Use task manager/activity monitor/top frequently to check what CPU usage your program is actually getting. If your code is using N * 99% (for N cores) continuously then you're doing fine. You should be suspicious if your usage has spikes, implying e.g. an IO bottleneck somewhere.
- Rule of thumb: parallelisation alone will at best give you a linear speedup. Super-linear speedup is possible, but only for very specific problems like linear algebra and even then it's usually some weird memory trick like you get more stuff happening in L1 vs RAM.
- Build some intuition for how fast your code ought to run. I've chased down many bottlenecks only to realise that the thing I thought was the culprit was actually extremely fast (I'm sure we've all done this). http://norvig.com/21-days.html#answers , but also make your own measurements of time required to save files, modify images, zip things, etc. You can catch this definitively by profiling, of course, but it's nice to be able to guess where to start prodding.
- On the flipside, learn to recognise when things are embarrassingly parallel and act accordingly. Here a profiler won't help you. Sometimes you're just lucky and you can get linear gains for free.
- Think about what level you want to parallelise. If you know that your single threaded code is as good as it gets, can you get away with running multiple instances of it? Often it's far simpler to write a wrapper than explicitly paralellising low-level code.
> What you forgot was that the web server was already parallelizing things at a higher level, using all 24 production cores to handle multiple requests simultaneously.
The reason you didn't see improved performance is because you server's load is high; it's maxed on the work it can do. Obviously no amount of parallelism could fix that.
Double the servers processing requests and your then your nifty in-memory parallelism can make the difference you intended. Note: had you doubled the servers _without_ adding your extra parallelism, the extra hardware would not necessarily make a difference.
You did a good thing; you were just dealing with overloaded hardware.
At the same time, one bit that the post didn't talk about (and for good reason, it's not the point of the post) is that a strength of Rust is that Rayon enjoys the same safety guarantees as the threading stuff that _is_ built into the standard library, thanks to the trait system. Rayon will still prevent data races at compile time, just like all Rust code.
SIMDArray for F#/C# will be similarly easy when it is a nuget package. It just puts extension methods on Array.
I wrote a gradient descent solver a while back, before I realized Vowpal Wabbit existed.
Vowpal Wabbit had a dedicated thread to read in records and pass them through a queue to another thread doing the math. I didn't, because it was a quick POC impl and I couldn't be arsed. My impl was faster.
This advice probably goes for most usages of auto-parallel collection impls that the author was pointing out.
The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.
That would be interesting to explore.
Linux has controls to set the affinity of threads for cores.
I love the brief interjection in the otherwise objective discussion.
Generally this is a extremly bad advice. Especially with:
> What you forgot was that the web server was already parallelizing things at a higher level
Apache parallized like that. and it was bad.
Nginx then introduced a eventloop, which was way better. however your application still needs to deal with it. i.e. if one request is blocking (taking too much time) your nginx will block, too. if that happens on all worker threads your website will be extremly slow.
A EventLoop mechanism always means that your stuff should be non-blocking and that is hard. this is especially hard when dealing with external resources, i.e databases, external services, caches, file io. when you just say "well my webserver will handle that" you will be in a really really bad shape.
>A EventLoop mechanism always means that your stuff should be non-blocking and that is hard.
So you are saying you need to think about the system your code is running in?
Even an Event Loop server, if CPU utilization is sufficiently highly, paralellizing a function can be counter productive due to overhead and cache thrashing.
that won't be the case in a EventLoop system, it could be the case if you just have too less hardware. But than even your "non paralized" code will fail.
Though, one thing I wish was included is memory utilization. I would expect Rust and C++ to be significantly better than Java in this realm. (Due to the overhead of the JVM)
> For some reason people give it a bad rap,
Java has been fast for at least ten years.
Except start-up time. Java still has way too long a startup time, which you notice in desktop apps implemented in Java.
However, there are lots of crap desktop apps written in Java that don't care about startup time. You can write apps with bad startup time in any language: look at OpenOffice (C++).
-  https://github.com/icodeforlove/task.js