This depends a lot on how you define "concurrent tasks", but the article provides a definition:
Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.
Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.
Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.
I have no idea what Go is doing to get 2500 bytes per task.
> I have no idea what Go is doing to get 2500 bytes per task.
TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.
One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.
The argument then, is what if we DO load 2K worth [0] of randomized data into each of those 1m goroutines (and equivalents in the other languages), and do some actual processing. Would we still see the equivalent 10x (whatever math works it out to be) memory "bloat"? And what about performance?
We, as devs, have "4" such resources available to us, memory, network, I/O and compute. And it behooves us to not prematurely optimize on just one.
[0] I can see more arguments/discussions now, "2K is too low, it should be 2MB" etc...!
So the argument is “if you measure something completely different from and unrelated to the article you do not get the same result”?
I guess that’s true.
And to be clear, I do agree with the top comment (which seems to be by you), TFA uses timers in the other runtimes and go does have timers so using goroutines is unwarranted and unfair. And I said as much a few comment up (although I’d forgotten about AfterFunc so I’d have looped and waited on timer.After which would still have been a pessimisation).
And after thinking more about it the article is in also outright lying: technically it’s only measuring tasks in Go, timers are futures / awaitables but they’re not tasks: they’re not independently scheduled units of work, and are pretty much always special cased by runtimes.
You know what I mean. If this was a real world program where those million tasks actually performed work, then this stack space is available for the application to do that work.
It’s not memory that’s consumed by the runtime, it’s memory the runtime expects the program to use - it’s just that this program does no useful work.
I am not u/masklinn - but I don't know what you mean. Doesn't the runtime consume memory by setting it aside for future use? Like what else does "using" ram mean other than claiming it for a time?
I think he means that if the Go code had done something more useful, it would use about the same amount of memory. Compare that to another implementation, which might allocate nearly no memory when the tasks don't do anything significant but would quickly catch up to Go if they did.
If the example was extended to, say, once the sleep is completed then parse and process some JSON data (simulating the sleep being a wait on some remote service), then how would memory use be affected?
In the Go number reported, the majority of the memory is the stack Go allocated for the application code anticipating processing to happen. In the Node example, the processing instead will need heap allocation.
Point being that the two numbers are different - one measures just the overhead of the runtime, the other adds the memory reserved for the app to do work.
The result then looks wasteful for Go because the benchmark.. doesn’t do anything. In a real app though, preallocating stack can often be faster than doing just-in-time heap allocation.
Not always of course! Just noting that the numbers are different things; one is runtime cost, one is runtime cost plus an optimization that assumes memory will be needed for processing after the sleep.
I mean, so I guess you are saying that other languages are better at estimating the memory usage than Go - as go will never need this memory it has allocated? Like Go knows everything that will happen in that goroutine, it should be able to right-size it. I don't think it "looks" wasteful for Go to allocate memory it should know it doesn't need at compile time - I think it is wasteful. Though it's probably not a meaningful waste most of the time.
I agree that it would also be interesting to benchmark some actual stack or heap usage and how the runtimes handle that, but if you are running a massively parallel app you do sometimes end up scheduling jobs to sleep (or perhaps, more likely, to prepare to act but they never do and get cancelled). So I think this is a valid concern, even though it's not the only thing that amtters.
Allocating virtual memory is distinct from actually consuming physical memory (RAM).
If a process allocates many pages of virtual memory, but never actually reads or writes to that memory, then it's unlikely that any physical memory backs those pages. In this sense, allocating memory is really just bookkeeping in the operating system. It's when you try to read or write that memory that the operating system will actually allocate physical memory for you.
When you first try to access the virtual memory you've allocated, there will be a page fault, causing the OS to determine if you're actually allowed to read or write to it. If you've previously allocated it, then all is good, and the OS will allocate some physical memory for you, and make your virtual memory pointers point to that physical memory. If not, well, then that's a segfault. It's not until you first try to use the memory that you actually consume RAM.
I'll yield it would be interesting to have a similar benchmark but instead of sleeping - which indeed by itself is nonsense, to instead each task compute a small fib sequence, or write a small file; something like that.
If that memory isn't being used and other things need the memory then the OS will very quickly dump it into swap, and as it's never being touched the OS will never need to bring it back in to physical memory. So while it's allocated it doesn't tie up the physical RAM.
Aha, 2k stacks. I figured that stacks would be page size (or more) so 2500 seemed both too small for the thread to have a stack and too large for it to not have a stack.
2k stacks are an interesting design choice though... presumably they're packed, in which case stack overflow is a serious concern. Most threading systems will do something like allocating a single page for the stack but reserving 31 guard pages in case it needs to grow.
Goroutines being go structures, the runtime can cooperate with itself so it doesn't need to do any sort of probing: function prologues can check if there's enough stack space for its frame, and grow the stack if not.
In reality it does use a guard area (technically I think it's more of a redzone? It doesn't cause access errors and functions with known small static frames can use it without checking).
Yeah it’s the drawback, originally it used segmented stacks but that has its own issues.
And it’s probably not the worst issue because deep stacks and stack pointers will mostly be relevant for long running routines which will stabilise their stack use after a while (even if some are likely subject to threshold effects if they’re at the edge, I would not be surprised if some codebases ballasted stacks ahead of time). Also because stack pointers will get promoted to the heap if they escape so the number of stack pointers is not unlimited, and the pointer has to live downwards on the stack.
A goroutine stack can grow. (EDIT: With stack copying AFAICT... so no virtual pages reserved for a stack to grow... probably some reason for this design?)
Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.
Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.
Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.
I have no idea what Go is doing to get 2500 bytes per task.