
Fibers aren’t useful for much any more - mappu
https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989
======
dang
Some of the comments refer to the paper at [http://www.open-
std.org/JTC1/SC22/WG21/docs/papers/2018/p136...](http://www.open-
std.org/JTC1/SC22/WG21/docs/papers/2018/p1364r0.pdf), which was the original
submitted URL.

~~~
johannkokos
"Response to 'Fibers under the magnifying glass'" from the authors of
boost.fiber, at

[http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2019/p086...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2019/p0866r0.pdf).

And Response to response to "Fibers under the magnifying glass", at

[http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2019/p152...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2019/p1520r0.pdf)

~~~
gpderetta
The saga of stackful corutines vs stackless coroutines Vs zero overhead
coroutines has been going on in C++ land for a while now.

Gor so far seems to be ahead as stackless coroutines are part of the standard.

------
sseth
It is best to treat the linked paper (Fibers under the magnifying glass) as a
review of some implementations of fibers and not a repudiation of fibers in
general. In particular, for this paper to apply to golang, several things
would need to change :

Memory footprint : It says fiber user stack is 1 MB and so fibers have
comparable memory footprint to threads. This is not true for goroutines which
typically use a 4K stack.

Context switching overhead : gives numbers for architecture, but goroutines do
not use the expensive switching instructions listed in the paper. Instead
golang basically saves just the PC, SP and DX registers, significantly
reducing the overhead.

Dangers of N:M model : The dangers mentioned of corrupting memory etc is
specific to C++ libraries and do not apply to golang.

Dangers of the 1:N model do not apply to goroutines either.

My conclusion from the paper is as follows : Fibers are bound to fail as an OS
feature, or as a library. To make fibers work you need to do what golang does
i.e. make it part of the language with compiler support to reduce the context
switching overhead and the memory footprint. You will however, pay a price in
higher FFI cost. That may be a tradeoff which may or may not work for you,
depending on the nature of your application.

~~~
bullen
But if coroutines can't share memory without copying; they don't allow you to
scale (in parallel on one task across cores efficiently).

What is then the purpose of coroutines as opposed to just having more
machines?

OS threads can scale (ipootace^), but then you need proper concurrent data
structures and a complex memory model below to support them.

Does go have those?

~~~
dickeytk
If you watch just one video on Go, even if you don’t plan to ever write Go and
just want to learn about it.

Make it this one: [https://blog.golang.org/concurrency-is-not-
parallelism](https://blog.golang.org/concurrency-is-not-parallelism)

It explains in detail how this is possible.

------
dmytroi
Using heap as sole savior of stackless coroutines just doesn't scale. I'm
working on a game engine, we can have 10-100k+ jobs running per frame, modern
monitors have refresh rates of 144/240/300 Hz, so worst case: 30 million jobs
per seconds! There is no time for heap - is just too slow.

So we need to preallocate everything for all systems, but hey, if you
preallocate everything for worst case - you are out of RAM (on PS4/Xbox/etc).

What we need is a mix: scratchpad memory that lives longer than stack, but
costs as cheap as stack. Tagged heap with stack/arena allocator comes close
performance wise but not ergonomics wise.

Ergonomics of writing code with such constrains is very painful. Stack is the
most ergonomical/fastest scratchpad you can have, and as soon as you have
async/await/etc in a middle of your function - you need to think about
unwind/rewind of every stack variable.

~~~
bullen
Ah, thanks, so is the problem cache misses?

Java had to rewrite the memory model of the whole JVM when they introduced the
concurrent package.

Even C++ has a hard time making good use of concurrency and threads because of
locking.

My hunch is that because of the way memory works there is no benefit for non
GC languages when implementing threads and concurrent memory over many cores.

~~~
ekidd
> _My hunch is that because of the way memory works there is no benefit for
> non GC languages when implementing threads and concurrent memory over many
> cores._

I've been pretty happy with Rust on multi-core servers, probably because Rust
never allows mutable memory to be seen by more than one function.

I've been using async Rust at work, which turns out to particularly
interesting, because most of the async executors for Rust can actually move
suspended async functions between cores. So you might have dozens and dozens
of async routines running across an 8 CPU pool.

There was definitely a bit of a learning curve involved, but we've only ever
encountered a single concurrency-related bug, which was a deadlock. (Rust
protects against memory corruption, undefined behavior and data races, but not
deadlocks.)

~~~
bullen
Ok, but when you say that they can move between cores, isn't that the OS
moving them?

If rust does not allow concurrent memory reads, then that's a big problem in
my world.

It seems to me we have a collision between "data driven" and "parallelism on
the same memory" in terms of progressing on performance and in that case since
I can get parallelism with memory and execution safety with hot-deployment on
a VM; the data driven approach does not fit my server side performance needs
to the point where I'm able to give those other features up.

On the client, I'm all for C with arrays though.

~~~
btschaegg
Disclaimer: I've done some reading on Rust's concepts but not yet done any
substantial coding. Corrections welcome :)

I'm not informed enough to comment on how async works exactly in Rust.
However:

> If rust does not allow concurrent memory reads, then that's a big problem in
> my world.

Note that this is not the goal of the borrow checking system of Rust. You can
read memory concurrently just fine (given immutable references to something),
you're just not allowed to write to it while you're doing that.

Basically, references in Rust come as immutable (like `const` in C) and
mutable, and you're only allowed to have multiple immutable _or_ one mutable
reference to the same thing at a time. If you have a mutable reference, you
can derive multiple immutable ones from that, but the borrow checker will
prevent you from accessing the mutable reference as long as one of the
immutable ones is still active (which Rust manages with the concept of
"lifetimes").

------
newnewpdro
I only skimmed the article, but is the argument basically that fibers suck
because library/DLL authors suck and use TLS instead of passing around an
explicitly caller-initialized context pointer? So you can't freely use DLLs
from your fibers because they might rely on TLS that isn't fibers-aware?

If so, that's a pretty weak argument. I use coroutines/fibers quite a bit in
personal projects, but learned long ago for a variety of reasons to avoid
depending on third party libraries I didn't have source for - especially ones
that try to do too much magic like TLS behind the scenes just to save me the
trouble of supplying an instance/context pointer to every call.

Usually when I'm using fibers it's so I can have _many_ of them, which means
I'm using tiny stack sizes, which means I'm not casually calling into third
party libraries I can't easily audit and control _anyways_. If I weren't
making many, I'd just use full-blown threads.

~~~
fluffy87
> especially ones that try to do too much magic like TLS behind the scenes
> just to save me the trouble of supplying an instance/context pointer to
> every call.

I really hate that this is the solution that Rust ended up pursuing. There are
claims that you can work around TLS by just no using TLS if it is not
available, but I have yet to see someone removing TLS and still be able to use
a multi-threaded executor.

~~~
steveklabnik
The real solution without TLS will come, in my understanding, after generators
improve. Specifically, they cannot currently take arguments on resume.

------
new_realist
This paper is rather biased, in that the downsides to stackless coroutines are
not mentioned, namely a more complicated control flow and associated increased
difficulty of debugging.

------
jchw
So I read through this, and the conclusion was to not use fibers. However,
most of the reasoning seems to surround issues with things like TLS,
allocators, and stack memory usage in C++. There is no explicit recommendation
here to not use goroutines for scalable, concurrent software as far as I can
tell, just to not use fibers.

~~~
pcwalton
Goroutines and fibers are the same thing: M:N threading.

It is true that many of the fiber issues presented here are C++-specific.
However, what a lot of the comments here are missing is that C++ issues have a
way of becoming your issues whenever you use an FFI, even if you aren't using
C++. Go's solution is generally to try to avoid using cgo as much as possible,
because of these performance issues. That can work for the areas Go is
generally used in today. But, as the article points out, that does not work
for all applications. For example, I would not want to write graphics code in
any system with M:N threading due to FFI cost, including Go.

~~~
jchw
Please give an example from the document that applies to Goroutines. (The best
I can see is the bits about issues with split stacks, but it was resolved.) I
think my reading of the document holds up.

~~~
pcwalton
Section 3.6, page 8, talks about the FFI overhead of Go.

~~~
jchw
Sure, but if that is truly the only part of the document that contains
reasoning to not use goroutines, I can’t imagine how one could read the
conclusion as suggesting goroutines are unsuitable for scalable software. In
fact, I’ve now worked at multiple companies doing exactly this in Go. With
Docker it was often preferably to explicitly disable CGo. It would be abnormal
in say, C#, to dock points because of C interop.

It’s also worth noting that FFI is not the only way to have Go and C++
interop. For many use cases a lightweight RPC layer between two apps will give
better throughput, something that also is done in production to great effect.

~~~
pcwalton
> It would be abnormal in say, C#, to dock points because of C interop.

Not really. WinForms is a lot of the reason for C#'s existence, and WinForms
is just a wrapper around pinvoke'd Win32. You're crossing the boundary a lot.

> For many use cases a lightweight RPC layer between two apps will give better
> throughput, something that also is done in production to great effect.

I have a hard time believing that RPC can possibly be faster than cgo. You
have the overhead of message serialization and deserialization, two message
copies (into the kernel and out of the kernel), two context switches, and a
trip through the OS scheduler.

~~~
mwcampbell
> Not really. WinForms is a lot of the reason for C#'s existence, and WinForms
> is just a wrapper around pinvoke'd Win32. You're crossing the boundary a
> lot.

I wonder if that's why WPF does so very much on the managed side. And I wonder
if using UWP XAML from C# is less efficient than WPF in some scenarios because
of this FFI overhead.

~~~
pjmlp
According to React Native for Windows team it is hardly noticeable.

On their benchmarks comparing XAML/C++, XAML/C#, RN and Electron, it is hardly
a few percentile more than C++.

It is Electron that goes sky high in performance loss.

------
flakiness
The headline is inaccurate. It doesn't say inappropriateness of the goroutine.
It just says what Go does for goroutine doesn't fit what C++ needs. The author
studies and knows more than average about Go, but still modest enough not to
make any judgement. Let us the readers respect it.

------
bigdubs
It seems like the determination was made by how they interop with C++
libraries, not that they were inappropriate for every situation.

------
mehrdadn
Question for Win32 experts: does anybody know _how_ a DLL receives thread
notifications? Is there any way to make an EXE get the same thing directly
(even if it's undocumented)? It's a little weird for me because DLLs are
loaded in user-mode -- why can't an EXE request the same notifications?

~~~
barrkel
DllMain gets called with a reason code. The OS loader calls it by enumerating
the loaded DLLs. It won't call anything in the EXE because that's how it was
coded, and there's nothing the EXE can do about the loader specifically short
of patching OS code.

~~~
mehrdadn
I'm wondering what happens before all this -- how is the OS loader even
notified about the new thread? Is it the thread entrypoint itself that tells
the OS loader about the new thread's creation (and destruction)?

~~~
barrkel
Both the loader and thread management are part of the OS. It's an
implementation detail, but I'd expect CreateThread to do it - perhaps by
delegating to the loader, perhaps by navigating the loader's list of loaded
modules, whatever.

See these pages:

[https://docs.microsoft.com/en-
us/windows/win32/api/processth...](https://docs.microsoft.com/en-
us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createthread)

[https://docs.microsoft.com/en-
us/windows/win32/dlls/dllmain](https://docs.microsoft.com/en-
us/windows/win32/dlls/dllmain)

~~~
mehrdadn
The thing about CreateThread doing it is that then a thread created in a
different matter (CreateRemoteThread from another process,
RtlCreateUserThread, etc.) would cause a missed notification. I feel like it
has to be the entrypoint, but not sure...

~~~
barrkel
Sure, but when I say CreateThread I mean the implementation of CreateThread,
not the function CreateThread.

(This feels like a weird autistic conversation, I'm going to step out now.)

------
toolslive
I don't think people using monadic concurrency in haskell or ocaml will agree
with this.

~~~
maxdamantus
IO in Haskell (and presumably Lwt in OCaml, though I'm less familiar with it)
doesn't have much to do with fibers.

An _IO_ value is just something that the Haskell runtime is able to invoke
somehow. Haskell functions can not directly run _IO_ values (ignoring
_unsafePerformIO_ ). A fairly elegant implementation would probably simply
make _IO_ values be asynchronous operations (procedures that take an "on
complete" callback that receives the result)—again, since there's no way for a
Haskell function to actually run the operation, all it can do is return such
an operation to the runtime to be called.

~~~
ben0x539
Monads don't have much to do with it, but the GHC runtime uses fibers for
concurrency, no?

------
tptacek
This is a wildly editorialized and misleading title. I clicked through just to
see how it was going to rationalize the fact that people demonstrably have
been building scalable concurrent Go software, with goroutines, at truly huge
scale. But of course, the paper says nothing of the sort; it makes an aside
about how an earlier design of the Go runtime was less scalable than the
current one, and that's it.

This is a textbook example of why people shouldn't editorialize titles.

The right title here is "Fibers Under A Microscope".

~~~
mappu
Sorry, I agree it's not a great title. I was worried the paper's title was
misleading (just sounds like textiles), so I chose a representative statement
from the author's abstract + conclusion.

The link comes via
[https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10...](https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989)
which summarized the PDF as """a fantastic write-up of the history of fibers
and why they suck. Of particular note is that nearly all of the original
proponents of fibers subsequently abandoned them [...] fibers are basically
dead""".

By restricting itself to the TIOBE top 10, the paper also misses a discussion
of BEAM which successfully offers N:M threading.

~~~
dang
Oh, in that case let's just switch to the URL to that blog post and let it
make its point directly.

Changed from [http://www.open-
std.org/JTC1/SC22/WG21/docs/papers/2018/p136...](http://www.open-
std.org/JTC1/SC22/WG21/docs/papers/2018/p1364r0.pdf) above.

~~~
eloff
I had to scroll through all the comments to find this to understand why the
comments don't match the article. I think you should have left it, now it's
very confusing.

~~~
dang
That's why I posted
[https://news.ycombinator.com/item?id=21230286](https://news.ycombinator.com/item?id=21230286)
and pinned it to the top.

Experience has shown that switching to a better URL is generally better for
discussion, though there can be a lag before the thread catches up.

