
Fibers: An Elegant Windows API - ingve
https://nullprogram.com/blog/2019/03/28/
======
jepler
Ouch. POSIX did an awful job when it wrote the API for "makecontext". You
can't even pass a pointer to the context's function in the natural way,
because only 'int's can be passed, and on any real computer, sizeof(int) <
sizeof(void*)!

And anyway, POSIX removed makecontext in POSIX.1-2008 so new code shouldn't be
using it. This is too bad, as there are legitimate reasons to use these kinds
of cooperative threading abstractions. I guess we still have gnu pth, which
hasn't seen a release in 13 years, but it also has a much bigger (pthreads-
like) interface compared to the mere 3 functions in POSIX for cooperative
threading.

~~~
elteto
That's too bad indeed. How would one implement coroutines in Linux/Unix today
then?

~~~
spc476
Usually using C's `setjmp()` and `longjmp()` functions (but I consider that an
ugly hack). I did an implementation for both x86-32 and x86-64 [1] that works
for Linux and Mac OS-X that's for C (written in assembly). I doubt it will
work for C++, and I'd be a bit hesitant to use it for production, but it's
surprisingly simple.

[1]
[https://github.com/spc476/C-Coroutines](https://github.com/spc476/C-Coroutines)

Edit: clarification of what language my coroutines are implemented in.

------
CodesInChaos
Fibers have some annoying edge cases related to thread local storage, which
can bite fiber unaware code. Which is why Microsoft recently added User-Mode
Scheduling, which behave much closer to actual threads, except that they still
have the manual yielding behaviour of fibers.

[https://docs.microsoft.com/en-
us/windows/desktop/procthread/...](https://docs.microsoft.com/en-
us/windows/desktop/procthread/user-mode-scheduling)

~~~
jstimpfle
That's because what you need is Fiber-local storage no?

~~~
dblohm7
What happens when you call an API that isn't fiber-aware and uses thread-local
storage in its implementation?

Basically your entire stack needs to be fiber-aware, or you can't use them.

------
blattimwind
> POSIX certainly has its own ugly corners, but those are the exceptions. In
> the Windows API, elegance is the exception.

Makes me wonder how seriously OP used POSIX APIs, and across how many
platforms... because large parts of POSIX are pretty bad. And as an OS
interface, the Windows API is clearly superior in almost every regard (if
there wasn't this tiny portability issue).

~~~
jstimpfle
I don't know what parts of Win32 you mean, but one task I did recently that
should be dead simple was listing the contents of a directory. Why did I end
up appending "* . *" (spaces inserted to avoid triggering HN's italic
formatting) patterns to the directory name, and why do I need to find the
first file with FindFirstFile() and all the remaining files with
FindNextFile()? Maybe I was particularly dumb that day, but it's the best I
could figure out and I don't think that's elegant.

~~~
magicalhippo
> Why did I end up appending "* . *" (spaces inserted to avoid triggering HN's
> italic formatting) patterns to the directory name

Because sometimes you just want get all the XML files in a directory, and it's
nice to not have to filter yourself.

> why do I need to find the first file with FindFirstFile() and all the
> remaining files with FindNextFile()?

Why do you consider this a poor API and what would you consider a better API?

~~~
humblebee
What is the important of distinguishing between the first file and files that
follow it?

Why not FindFile()?

~~~
magicalhippo
How would FindFile() work then?

As far as I can see, either you have some separate call to initialize the find
handle (just like FindFirstFile does), or you pass the directory/filename mask
redundantly over and over again, introducing the potential for suddenly
calling it with an already initialized find handle but with a different
directory.

~~~
blattimwind
The classic UNIX approach would of course be making FindFile non-reentrant and
not thread-safe either, then later adding a FindFile_r which takes an void* *
where FindFile_r will store some intermediate state that you need to free
yourself except when an error occured, because when you free it then you get a
double free. (However, not all UNIX systems have FindFile_r, and some of those
instead have FindFile2 with slightly different semantics).

~~~
jstimpfle
You are snarky, but I don't think it's warranted. Any examples of non-
reentrant or thread-unsafe system calls in POSIX?

There are a few _library_ calls that have a cousin with an _r suffix in POSIX,
but these are not system interfacing calls. Just don't use them if they don't
match your requirements. Most of them aren't even non-reentrant or thread-
unsafe, but simply lack an additional parameter for some use cases (the use
cases mostly come when you program in object-oriented style).

The one offender (not a system call, either) that I know from the top of my
head is strtok() which is of course terrible, but why would you use it?

~~~
Someone
_”There are a few library calls that have a cousin with an _r suffix in POSIX,
but these are not system interfacing calls.”_

Neither are most Windows functions. _FindFirstFile_ , for example, is defined
in _kernel32.dll_ , and calls through to _ntdll.dll_ , and eventually makes
system calls such as _NtOpenDirectoryObject_ and _NtQueryDirectoryObject_.

What IMO matters is the usability of the stable API. It doesn’t matter whether
that is the kernel interface or not.

~~~
jstimpfle
Yes. The distinction I made was meant to be between calls that end up calling
into the system and ones that do not (like strtok()). I agree with what you
say and my statement was exactly that the stable APIs into the system offered
by POSX are mostly unproblematic.

------
newnewpdro
It's not that elegant.

Win32 fibers can't return like UNIX/ucontext.h supports with the uc_link
member. So you end up having to wrap your win32 fiber functions with your own
trampoline to let them return without exiting the program and Do The Right
Thing.

Furthermore there's this thread-global fibers-enabled state win32 toggles with
ConvertThreadToFiber() requiring any transparent coroutines library supporting
win32 fibers to carefully initialize once per thread while also supporting
reentrant use. It's mildly annoying when you want to generically support
things like nested bundles or heirarchies of coroutines.

This last bit seems completely strange considering a fiber's state should be a
subset of what forms a thread, and you're always running on a thread. There
should be nothing to do when "converting" to a fiber. Just make the fiber
state always accessible to userspace via GetCurrentFiber() and always allow
calling CreateFiber() and SwitchToFiber().

Another strange wart in win32 fibers is the C srand()/rand() state is per-
fiber for some reason. It's undefined by C since C says nothing of coroutines,
but it's very strange and generally unexpected behavior to see your coroutines
producing identical sequences of values coming out of rand().

On *NIX the rand() state is per-process, pthreads share user memory so they
all share the rand() state, and ucontext shares the thread so of course its
contexts share the single state as well.

I much prefer the UNIX/ucontext API, they just fucked up the operands for
makecontext(). Instead of deprecating ucontext POSIX should have just fixed
makecontext() to explicitly take a pointer. C programs have no supported
system-supplied coroutine API on UNIX now, which is bullshit.

------
uasm
Hmm, let's read this one.

> "March 28, 2019"

> "Fibers: the Most Elegant Windows API"

> "The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and
> lacking good taste. Microsoft has done a pretty commendable job with
> backwards compatibility, but the trade-off is that the API is filled to the
> brim with historical cruft. Every hasty, poor design over the decades is
> carried forward forever, and, in many cases, even built upon, which
> essentially doubles down on past mistakes. POSIX certainly has its own ugly
> corners, but those are the exceptions. In the Windows API, elegance is the
> exception."

I'm glad OP just discovered in mid-2019 an API that has been there since
Windows XP (ca. 2001). Why then, does OP start by crapping all over the
Windows codebase?

You think the Fibers API is elegant? Have a look at IOCP. IO in Windows in
general actually, since the days of Windows 95. Event management. Threading
and parallelism. The kernel subsystem. The APIs in Windows have remained
stable, consistent, forward and backwards compatible - from pretty much day
one.

~~~
PaulHoule
IOCP is particularly strong when you compare it to select, poll, epoll, kpoll,
and all of the other attempts to make asyncio work in Linux.

~~~
kentonv
Eh. I used to think that, until I actually wrote event systems based on all of
the above. epoll turns out to be my favorite.

Buffer allocation with IOCP feels incredibly ugly. You must allocate a buffer
before you initiate a read(), and then you must leave that buffer alone until
the read() completes. You would _think_ that this has the advantage that the
kernel doesn't need to allocate its own buffer, and you get some sort of zero-
copy magic where bytes land directly in userspace. But that's not really the
case. The socket still needs a buffer on the kernel side in case bytes arrive
while no read() is pending. So now you have to allocate a buffer for each
socket _and_ the kernel also has to allocate a buffer for each socket, which
seems like a waste.

This rabbit hole gets deeper. What happens if the socket receives two packets
in rapid succession? In a naive implementation, the first packet completes the
read(), and signals the completion on the IOCP. The app now has to process
that event and start a new read() when it's ready. But in the meantime, the
second packet arrives. Whoops, no read() is pending, so now the packet has to
go into a kernel-side buffer only to be copied later.

But that's a naive implementation. Apparently, in reality, the kernel
implements some sort of nagle-like algorithm where it tries to wait a bit for
additional packets before it actually signals completion of a read(). But this
introduces delay, and many projects (e.g. Chrome) have discovered this delay
is rather harmful to certain kinds of performance. I read somewhere that
Chrome and others have given up on IOCP and use WSAPoll instead -- but I can't
remember where I read this because all the lore about Windows event handling
is hidden in random forum threads and Github gists rather than proper
documentation.

It seems to me that the _right_ way to do what Windows was trying to do here
would be for userspace to allocate a ring buffer for the kernel to use, and
then the app and the kernel would coordinate the start and end pointers of the
ring buffer. Then the kernel needs no buffer of its own; it can always deliver
to userspace. If the buffer fills up, the kernel can do exactly what it would
do in the case of a regular kernel-side buffer filling up -- apply
backpressure and force the peer to retransmit later.

~~~
kentonv
I should add: Even what I wrote above is not the whole story. The deeper you
get into kernel internals, the more you realize that the story is nuanced.

On Linux, apparently, the kernel does not actually allocate static socket
buffers. Instead, the "socket buffer" is actually something like an array of
pointers to packets. When the read buffer is empty, it's not taking any space.

So on Linux, an idling socket is very cheap (by my understanding, at least).

The Windows kernel might do something similar internally (I don't know), but
there's no way to avoid the redundant buffer on the userspace side when using
IOCP.

So basically, it seems that IOCP turns out to be a rather-poor interface for
networking in practice, despite at first appearing theoretically superior.

~~~
gpderetta
I've never used IOCP, but my understanding is that they can sort of emulate
readiness notification (as opposed to completion notification) by passing a 0
sized buffer. On the callback then you would perform the (non-blocking) read
yourself.

In general I think that readiness notification is superior to completion
notification for network reads. Completion notification works fine for network
writes and it is superior for disk IO (where readiness is ill defined).

The biggest advantage of IOCP though is that it basically acts as a
dynamically sized threadpool, as it keeps the amount of running threads to a
minimum. MacOS does something similar with GCD. I still haven't found a non-
hacky way to do the same in Linux (you can sort-of emulate it with either with
sched_fifo realtime threads or by using sched_idle to detect idelness).

------
supernes
I recently had an idea about an app and a clever way to hook it into fairly
modern, user-facing parts of Windows and macOS (Search/Cortana and Spotlight,
respectively.)

I did some feasibility research on Windows first, and very soon hit some
ancient win32 interfaces I had to implement, used by parts of the shell that
have existed since the earliest days, something I have no intention of wasting
my time on.

I was under the impression that macOS would be much better, but was surprised
to find that there's no public or documented API, just a 3rd party open source
framework implemented as a hack that combines javascript and python somehow,
but seems to get the job done.

This got me thinking that you need a serious amount of expertise to implement
new experiences on modern operating systems, never mind crossing platforms.
Puts the appeal of technologies like Electron that do platform integration for
you in perspective.

~~~
Crinus
On Windows, you do need some effort if you have no idea what is going on, but
it isn't like those systems were sloppily put together, there are some design
ideas and you get used to them... eventually. But once you do, things more or
less fall into place.

A biggest issue with working on older Windows APIs is that modern web-based
MSDN documentation is simply awful and the best you can do is to find old MSDN
CDs/DVDs that often contained more information than the site (especially
articles), better organization and -especially- _much_ faster interface (being
local and all, but also the cdroms around late 90s / early 2000s were in CHM
format are lightweight). And 99.999% of the documentation you'll find on these
CDs will have the same text as the documentation (if it still there) on MSDN
(and very often the same text that is on the much older WinHelp WIN32SDK.HLP
files too).

Like most things Microsoft on desktop, everything went downhill since .NET
people took over and decided to redo things their way (yes, .NET brought some
nice things too, but it created an internal schism - before .NET became a
thing, desktop tech in Windows was mostly coherent) and now history is
repeated with UWP.

~~~
maxxxxx
Totally agree. I worked for a while with pure Win32. Once you get the hang of
it it’s quite consistent and pretty well thought out. Also agree about the
documentation. Seems MS is determined to make it worse every year, break more
links and provide less useful information. The old MSDN CDs were a piece of
beauty.

~~~
scarface74
Consistency?

Let’s start with strings...

\- the standard null terminated single byte char C string

\- the double wide null terminated C string required for dome frameworks -
especially Windows CE

\- the COM BSTR where the first two bytes are the length

\- the ATL bstr_t

\- CString that isn’t a null terminated C string

\- of course the C++ string class. I’m not sure if this is ever used by the
Win32 APIs

I’m sure I’m missing one.

~~~
Impossible
If you count various flavors of UWP (C++/CX, C++/WinRT, WRL) you can add
std::wstring and Platform::String^

~~~
asveikau
Don't forget HSTRING, which Platform::String^ is built on. They added those
two in Windows 8 because there was no one standard string type. (Not joking.)

------
gpderetta
I wouldn't call it elegant at all, the convertthreadtofiber is a pain for a
library. I like swapcontext much more in theory.

In practice the issue with swapcontext is that it does not integrate well with
the rest of posix (what happens if you move a fiber to another thread?) and
the requirements to restore the signal mask pretty much require a very slow
implementation.

On the other hand, Windows Fibers are somewhat efficient (one can still do
significantly better), but most importantly, the interaction with the rest of
the system is well understood.

I wrote my thoughts on the API here about (/checks calendar) 13 years ago:
[http://www.crystalclearsoftware.com/soc/coroutine/coroutine/...](http://www.crystalclearsoftware.com/soc/coroutine/coroutine/fibers.html)

------
dblohm7
Fibers still have a lot of problems. They're incompatible with any APIs that
have thread affinity. Windowing, single-threaded apartment COM,
CRITICAL_SECTIONs, etc...

------
DarkWiiPlayer
I just fell in love with that API. That's seriously the first thing windows
has which I really REALLY wish linux would just copy as is. It's simple and to
the point, but there's nothing missing either. Perfect.

~~~
jstimpfle
I once built something on it and remember it as quite a nice experience.

------
amelius
How do you manage stack memory (allocation, increasing capacity, freeing,
moving, etc.)? Is it possible to suspend a fiber to disk?

~~~
CodesInChaos
A Fiber has its own OS stack and cannot outlive the process (just like a
thread). So the memory cost is the same as a normal thread. The main
difference is that it uses cooperative scheduling.

------
rwmj
The big news seems to be that makecontext et al are deprecated. Does anyone
know why that happened? Lots of "green threads" / coroutines implementations
are based on it.

~~~
gpderetta
Basically their signature can no longer be expressed in ISO C, so instead of
fixing it POSIX decided to just deprecate them.

------
aidenn0
FWIW Russ Cox wrote a similar API on top of the -context() calls named
libtask.

