Hacker News new | past | comments | ask | show | jobs | submit login
Fibers: An Elegant Windows API (nullprogram.com)
136 points by ingve 24 days ago | hide | past | web | favorite | 112 comments



Ouch. POSIX did an awful job when it wrote the API for "makecontext". You can't even pass a pointer to the context's function in the natural way, because only 'int's can be passed, and on any real computer, sizeof(int) < sizeof(void*)!

And anyway, POSIX removed makecontext in POSIX.1-2008 so new code shouldn't be using it. This is too bad, as there are legitimate reasons to use these kinds of cooperative threading abstractions. I guess we still have gnu pth, which hasn't seen a release in 13 years, but it also has a much bigger (pthreads-like) interface compared to the mere 3 functions in POSIX for cooperative threading.


> You can't even pass a pointer to the context's function in the natural way, because only 'int's can be passed, and on any real computer, sizeof(int) < sizeof(void*)!

I checked the API. It looks like a variadic function, so you can pass any arguments to it?


Yes, IIRC the trick is to split the pointer I two integers and pass the halves separately.


Which would imply that you know that sizeof(int) * 2 == sizeof(void*). That's a good assumption for x86-64 architectures but doesn't sound very portable at all.


There are many ways to recover portability.


POSIX seems like the kind of API that it shouldn't be necessary to "recover" portability from...


For those unaware of the finer nuances of native platform APIs, could you help shed a light on the difference between a “regular” thread and a pthread?


A "regular" thread I guess is whatever structure the OS has for parallel execution contexts. Sometimes they are subordinate to processes, i.e. a thread dies when its containing process dies and such.

pthreads is a standardized API (part of POSIX) that exposes a particular notion of threads. On each platform, it's naturally implemented as a wrapper around whatever the OS can offer. Since it must be widely implementable and have an easy to use API, that also means that it's rather limited.

As an example, if you don't want to use the pthreads API, I think on Linux you would use the Linux-specific clone() system call to create a thread (clone() is not part of POSIX). If you go down on that level you will find a wealth of additional Linux-specific thread-related tuning knobs.


Thanks for the detailed answer!

So basically if I’m used to threads as a general cross-platform concept in higher level languages, I’m probably relating to pthreads?


I suppose it depends on what your concept of threads is.

If you're designing a complex system, you're more likely to look at what platforms you must support, and then make up your own (less abstract) notion of threads such that it can be implemented on these systems and that it lets you take more advantage of the capabilities of these systems.

I'd be surprised for example if Postgres used pthreads and not its own abstraction, supported by platform-specific implementations.


Most threading related functions in posix are named with the pthread_* pattern, so pthread (short for posix threads) has become the name of regular (usually OS level) threads on a posix platform.


What should new code be using instead?


That's too bad indeed. How would one implement coroutines in Linux/Unix today then?


Usually using C's `setjmp()` and `longjmp()` functions (but I consider that an ugly hack). I did an implementation for both x86-32 and x86-64 [1] that works for Linux and Mac OS-X that's for C (written in assembly). I doubt it will work for C++, and I'd be a bit hesitant to use it for production, but it's surprisingly simple.

[1] https://github.com/spc476/C-Coroutines

Edit: clarification of what language my coroutines are implemented in.


I guess wait for libstdc++ to implement the Coroutines TS that got merged into the C++20 standard?


Python's greenlet extension implements coroutine using inline assembly, which has been widely ported (at least ARM and x86, both 32-bit and 64-bit). It's nicely packaged so that you can use it without Python part.


boost has a coroutine api. It provide both low level (untyped C-like) and high level APIs.


ints are really the best. They are handles, i.e. identities. All my programs use ints almost exclusively. I have a few pointers in function signatures and for global arrays, but the rest is int.

When you need a void pointer or whatever information from some fiber, just have an array to map from the handle to that information.

Of course, if you are not willing to make a global array for lookup, you are screwed because some additional context is missing.


... and now you have to synchronize access to your handle table, grow it to meet demand -- let me tell you, that is a fun thing to do when there are thousands of things in flight, all expecting high performance -- and so on. And what NUMA node was that big global array on? Hmmm, I guess you could partition it, but it's kind of awkward and you have to recompute the geometry on every launch and where did you stash that information? TLS?

The natural thing for a context type is something that can be used to directly refer to any object. In today's common architectures this is a pointer. Context APIs that don't support "intptr" or something like it are broken.


> The natural thing for a context type is something that can be used to directly refer to any object. In today's common architectures this is a pointer. Context APIs that don't support "intptr" or something like it are broken.

I want to note the huge disadvantage of using pointers, which is that they aren't mathematical identities, but physical identities. It's place vs value. If you use pointers you are absolutely required to have stable addresses, which means you cannot treat data as what it is: information - and you cannot copy it, move it, duplicate and modify it and such. Because the best handle to it that you have is the memory address where it's supposed to live.

You cannot even have parallel arrays (which give modularity) but will end up with god objects amassing unwashed unsorted unrelated incoherent junk data because that's the best to get from an address to actual data.

While there are valid uses for pointers (and I don't claim to have any relevant experience with threaded applications), today's popular idea that "something that can be used to refer to any object" is best implemented by a pointer is what really leads to terrible, unmaintainable software.


Why are stable addresses a problem when you have virtual memory? And even if you didn't, couldn't you just use pointers to pointers?


It's an idea I've had for a long time for my applications where (almost) all dynamically are global arrays. 64-bit application have a huge address space and I could simply place each array in a huge subsequence of this address apce. When an array needs to grow I simply mmap() more memory at the end, so I never have to move memory, and all pointers stay valid.

I try to avoid pointers to pointers because of the redirection. They are not really solving a problem but only delaying it, at the cost of performance.

But first and foremost the biggest problem with pointer is the identity thing. Using a pointer to access various kinds of information either requires that all this information is clumped together (loss of modularity) or at least that we maintain additional lookup structures that bind the disparate information together through indirection (loss of performance, more memory use, more code, and still some loss of modularity).


> I try to avoid pointers to pointers because of the redirection. They are not really solving a problem but only delaying it, at the cost of performance.

Passing an array index and doing an array lookup is also a form of indirection. How is that any different than passing a pointer and doing a pointer dereference? Either way you are not passing the value itself, you are passing a value that represents some kind of memory offset where the real data is stored. It's just that with the array approach, it is a relative offset rather than an absolute one.

> Using a pointer to access various kinds of information either requires that all this information is clumped together (loss of modularity)

Why does putting all the information together imply a loss of modularity? You could have structs inside structs, defined in separate files, and there would still be no need to dereference twice. Their values will just be stored as part of the enclosing struct's values automatically. You could even reference the specific sub-structs with pointers if you want to keep your modules unaware of the enclosing struct type.

Example: https://onlinegdb.com/BJLxqWhuV


Just plain global arrays are technically an indirection, but I suppose the performance impact is hard to measure in practice. You have a small number of global pointers (most project will have < 100) that can be expected to be in the L1 cache. The compiler can easily optimize the code by not loading the value of the pointer on each access to an element of the array.

Pointers to pointers are one level more of indirection, both the properties I just described are probably not achievable in practice.

> Why does putting all the information together imply a loss of modularity? You could have structs inside structs, defined in separate files,

Because there is a place that must know them all, bind them all together. And all the client code that wants to use any of the things in there now depends on EACH of them (for a liberal but very real definition of "depends").


Unless the objects you're working with happen to be contiguously arranged in the arrays then it might be 100 cache lines for those 100 pointers. That's another reason why the "all the data together" approach would probably win performance wise - you'd get 100 whole objects for those 100 cache lines

> Because there is a place that must know them all, bind them all together. [...]

True, that could be a problem in some circumstances which might warrant a map-like structure like you describe. But I don't think it's the most obvious pattern otherwise


Oops, yeah, I misread your post, and while my sibling comment is technically correct, it wasn't really a reply to your comment.

Yes, the way I write it is absolutely that my global buffers are all allocated sequentially, so it's very cache friendly. And I partly agree that this is kind of "all the data together" in this instance. But note that there is not a struct that binds them together, and so there's no code that depends on them being layed out sequentially in memory.

Actually I can achieve the effect without losing modularity at all, by having each module specify its global data categorized as types of data (one category of which would be "global buffer pointers"). I'll then just have the build go over all my modules and put them sequentially in the built image.

This is exactly what ECS do, by the way. What you want is aspect-oriented program performance-wise (and maintainability-wise in general), but sometimes you still want to maintain parts as optional "packages". So what ECS do is they weave each aspect of each package into the final product.

By the way, one very simple way to achieve this in C is to declare the data like

    BUFFERDATA int *myIntArray;
in the header file of each module. And I define BUFFERDATA as "extern" for all modules except one data module, which "implements" all data declared BUFFERDATA by defining BUFFERDATA to an empty string and including all of the project's header files.


> it might be 100 cache lines for those 100 pointers

No, the usual recommendation is that you should split your structures into smaller structures such that each contains only fields that are likely to be accessed together. In other words, you should err on the side of SOA (parallel arrays) rather than AOS.

That's because the unused field apply pressure on the cache.


I don't disagree. I might be in the minority but personally I don't do OS threads as far as possible. They are hard. (And I don't do fibers, either, if I have a choice).

But if you do threads, then you could still put that "handle table" into TLS, couldn't you?


Fibers have some annoying edge cases related to thread local storage, which can bite fiber unaware code. Which is why Microsoft recently added User-Mode Scheduling, which behave much closer to actual threads, except that they still have the manual yielding behaviour of fibers.

https://docs.microsoft.com/en-us/windows/desktop/procthread/...


That's because what you need is Fiber-local storage no?


What happens when you call an API that isn't fiber-aware and uses thread-local storage in its implementation?

Basically your entire stack needs to be fiber-aware, or you can't use them.


Yes this is actually a significant issue since TLS is used in many cases to implement locks and mutex.


> POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.

Makes me wonder how seriously OP used POSIX APIs, and across how many platforms... because large parts of POSIX are pretty bad. And as an OS interface, the Windows API is clearly superior in almost every regard (if there wasn't this tiny portability issue).


I don't know what parts of Win32 you mean, but one task I did recently that should be dead simple was listing the contents of a directory. Why did I end up appending "* . *" (spaces inserted to avoid triggering HN's italic formatting) patterns to the directory name, and why do I need to find the first file with FindFirstFile() and all the remaining files with FindNextFile()? Maybe I was particularly dumb that day, but it's the best I could figure out and I don't think that's elegant.


FindFile, FindNextFile are an iterator pattern.

In C++, you have

    auto iterator = container.begin();
    while(iterator != container.end()){
      ++iterator;
    }
In Java you have

    Iterator<String> iterator = container.iterator();
    while(iterator.hasNext()) {
        iterator.next(); 
    }
Similarly with FindFile you have (leaving off some parameters)

  auto handle = FindFirstFile();
  while(FindNextFile(handle)){}
  CloseFile(handle);
Separating the creation of the iterator from advancing the iterator is a good thing.


You are correct. The actual pattern you want to use in practice with this API is more of a do ... while:

    WIN32_FIND_DATA data;
    HANDLE handle;

    handle = FindFirstFile(str, &data);
    // TODO - error check

    do
    {
        // TODO: process "data" which is a file.
    } while (FindNextFile(handle, &data));

    FindClose(handle);
So the loop body is the same for the "first" and "next" files.


> Why did I end up appending "* . *" (spaces inserted to avoid triggering HN's italic formatting) patterns to the directory name

Because sometimes you just want get all the XML files in a directory, and it's nice to not have to filter yourself.

> why do I need to find the first file with FindFirstFile() and all the remaining files with FindNextFile()?

Why do you consider this a poor API and what would you consider a better API?


Yeah, but please let me just do the filtering myself and don't force me to allocate a new string to append a silly pattern that makes me wonder if matching things without a dot in their name is supported.

Yes, "sometimes" the API consumer just wants to match XML files. For example, when creating one of these silly file dialogs that never show what I'm looking for.

> Why do you consider this a poor API and what would you consider a better API?

POSIX gets that just right. No fuss, expose the technical reality in straightforward abstractions. You call opendir() and get an DIR* handle (representing an open directory). Then you call readdir() with that handle in a loop to get all the files in a directory.

I'm not saying I'm strictly against bells and whistles, but often these only work for a first prototype, and if API designers don't offer a straightforward way so I can simply do what needs to be done, they really failed at their job.


> Yeah, but please let me do the filtering just myself and don't force me to allocate a new string

I will concede that I've often wondered why the file mask was not a separate parameter, so that it could be NULL if you just wanted to iterate over all the files.

As for "sometimes" wanting to filter, the majority of the times I use FindFirstFile, I do with a mask different from "all entries". Guess it depends what kind of programs one writes.

It certainly is a good thing that the capability is there, as it leads to consistency. But I agree it would be better if they had split the mask into a separate parameter.

> POSIX gets that just right.

It certainly is cleaner to separate the two. Though I've never had the need to split the code opening vs iterating, so at least for me, the practical difference is minor.


What is the important of distinguishing between the first file and files that follow it?

Why not FindFile()?


In theory there could be some FindFiles() that just returns an array of files. But how would you handle the memory to pass that data? The convention is that the caller allocates and owns the memory (for example you pass FindFirstFile a FIND_DATA structure to write the result into). But you don't know in advance how many files FindFiles() will return. It's not even trivial to have another API that tells you how many files FindFiles will return, because between two API calls files can be created and deleted by anyone on the system. The other option is to allocate some space, and if it wasn't enough ask for the rest of the data. And that's basically what FindFile() and FindNextFile() implement.


How would FindFile() work then?

As far as I can see, either you have some separate call to initialize the find handle (just like FindFirstFile does), or you pass the directory/filename mask redundantly over and over again, introducing the potential for suddenly calling it with an already initialized find handle but with a different directory.


The classic UNIX approach would of course be making FindFile non-reentrant and not thread-safe either, then later adding a FindFile_r which takes an void* * where FindFile_r will store some intermediate state that you need to free yourself except when an error occured, because when you free it then you get a double free. (However, not all UNIX systems have FindFile_r, and some of those instead have FindFile2 with slightly different semantics).


You are snarky, but I don't think it's warranted. Any examples of non-reentrant or thread-unsafe system calls in POSIX?

There are a few library calls that have a cousin with an _r suffix in POSIX, but these are not system interfacing calls. Just don't use them if they don't match your requirements. Most of them aren't even non-reentrant or thread-unsafe, but simply lack an additional parameter for some use cases (the use cases mostly come when you program in object-oriented style).

The one offender (not a system call, either) that I know from the top of my head is strtok() which is of course terrible, but why would you use it?


Probably true, but there is in fact the apropos readdir function, which does have a readdir_r variant to address thread safety issues (now apparently deprecated due to other unrelated design issues). So it seems the joke’s on us.

Getting API design right is _hard_.


”There are a few library calls that have a cousin with an _r suffix in POSIX, but these are not system interfacing calls.”

Neither are most Windows functions. FindFirstFile, for example, is defined in kernel32.dll, and calls through to ntdll.dll, and eventually makes system calls such as NtOpenDirectoryObject and NtQueryDirectoryObject.

What IMO matters is the usability of the stable API. It doesn’t matter whether that is the kernel interface or not.


Yes. The distinction I made was meant to be between calls that end up calling into the system and ones that do not (like strtok()). I agree with what you say and my statement was exactly that the stable APIs into the system offered by POSX are mostly unproblematic.


In POSIX, there are two sets of functions related to directories. The first set, `opendir()` (open a directory for reading), `readdir()` (return next filename upon each call) and `closedir()` (close the directory). `readdir()` does not guarantee an ordering of filenames and any filtering must be done by the application.

Then there's `wordexp()` and `wordfree()`. The first takes a file glob (file pattern?) and returns an array of filenames matching the pattern. Any memory allocated by `wordexp()` is freed by calling `wordfree()`.

Both have their uses.


I don't know, I was just asking for clarification to the question, because you did not provide any information on the distinction of the two methods in the question "why do I need to find the first file with ...".

Your reply here could of been a great reply to the original question.


I misunderstood the poster I was replying to. To me it sounded like he wanted a singular call to be able to iterate over all the files in the directory.

I then assumed you proposed something similar with your FindFile call, and I didn't and still don't see how to do that in a clean way, hence asking how you would imagine it working.

In a reply the OP clarified that it was the fact that FindFirstFile does two things (initializing the handle, and getting the first result if any), rather than cleanly separating the handle initialization from getting the results. This is different, and a point I can agree with.


Is there such a thing? What does it do? I can only find a perl implementation that seems to use FindFirstFile() / FindNextFile().


No, it was a question.

Edit: I should clarify. It was a low effort question to another low effort question. I should state that I have no experience with the API in question. I was just asking a question of the difference between the two, and the poster above me answer in another reply that the answer is that FindFirstFile is used to initialize / reset the query, where FindNextFile is use to continue the query.

To further answer the question I was replying to above, I think another function that could be paired with a hypothetical FindFile would be a method to initialize the query.

I do not know what would be a better API.


> To further answer the question I was replying to above, I think another function that could be paired with a hypothetical FindFile would be a method to initialize the query.

This is the obvious and correct way to do it, and it's the way it's done in POSIX with opendir() / readdir().


Agreed, if you're comparing apples to apples, it should be Win32 vs. POSIX and X Window.

> (if there wasn't this tiny portability issue)

Win32 is impressively portable given the sheer breadth of hardware it's supporting, especially all the kiosks and displays and such that run it.


Hmm, let's read this one.

> "March 28, 2019"

> "Fibers: the Most Elegant Windows API"

> "The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception."

I'm glad OP just discovered in mid-2019 an API that has been there since Windows XP (ca. 2001). Why then, does OP start by crapping all over the Windows codebase?

You think the Fibers API is elegant? Have a look at IOCP. IO in Windows in general actually, since the days of Windows 95. Event management. Threading and parallelism. The kernel subsystem. The APIs in Windows have remained stable, consistent, forward and backwards compatible - from pretty much day one.


> I'm glad OP just discovered in mid-2019 an API that has been there since Windows XP

Can I bring to your attention the second para:

"That’s why, when I recently revisited the Fibers API, I was pleasantly surprised."

I don't think it's the author's first rodeo. The article reads like a retrospective look at the fibers API and how well it's aged.


It's been there since NT 3.51, according to https://www.geoffchappell.com/studies/windows/win32/kernel32... and my copy of Win32.hlp (from back when MS documentation was actually proofread before release...)


IOCP is particularly strong when you compare it to select, poll, epoll, kpoll, and all of the other attempts to make asyncio work in Linux.


Eh. I used to think that, until I actually wrote event systems based on all of the above. epoll turns out to be my favorite.

Buffer allocation with IOCP feels incredibly ugly. You must allocate a buffer before you initiate a read(), and then you must leave that buffer alone until the read() completes. You would think that this has the advantage that the kernel doesn't need to allocate its own buffer, and you get some sort of zero-copy magic where bytes land directly in userspace. But that's not really the case. The socket still needs a buffer on the kernel side in case bytes arrive while no read() is pending. So now you have to allocate a buffer for each socket and the kernel also has to allocate a buffer for each socket, which seems like a waste.

This rabbit hole gets deeper. What happens if the socket receives two packets in rapid succession? In a naive implementation, the first packet completes the read(), and signals the completion on the IOCP. The app now has to process that event and start a new read() when it's ready. But in the meantime, the second packet arrives. Whoops, no read() is pending, so now the packet has to go into a kernel-side buffer only to be copied later.

But that's a naive implementation. Apparently, in reality, the kernel implements some sort of nagle-like algorithm where it tries to wait a bit for additional packets before it actually signals completion of a read(). But this introduces delay, and many projects (e.g. Chrome) have discovered this delay is rather harmful to certain kinds of performance. I read somewhere that Chrome and others have given up on IOCP and use WSAPoll instead -- but I can't remember where I read this because all the lore about Windows event handling is hidden in random forum threads and Github gists rather than proper documentation.

It seems to me that the right way to do what Windows was trying to do here would be for userspace to allocate a ring buffer for the kernel to use, and then the app and the kernel would coordinate the start and end pointers of the ring buffer. Then the kernel needs no buffer of its own; it can always deliver to userspace. If the buffer fills up, the kernel can do exactly what it would do in the case of a regular kernel-side buffer filling up -- apply backpressure and force the peer to retransmit later.


I should add: Even what I wrote above is not the whole story. The deeper you get into kernel internals, the more you realize that the story is nuanced.

On Linux, apparently, the kernel does not actually allocate static socket buffers. Instead, the "socket buffer" is actually something like an array of pointers to packets. When the read buffer is empty, it's not taking any space.

So on Linux, an idling socket is very cheap (by my understanding, at least).

The Windows kernel might do something similar internally (I don't know), but there's no way to avoid the redundant buffer on the userspace side when using IOCP.

So basically, it seems that IOCP turns out to be a rather-poor interface for networking in practice, despite at first appearing theoretically superior.


I've never used IOCP, but my understanding is that they can sort of emulate readiness notification (as opposed to completion notification) by passing a 0 sized buffer. On the callback then you would perform the (non-blocking) read yourself.

In general I think that readiness notification is superior to completion notification for network reads. Completion notification works fine for network writes and it is superior for disk IO (where readiness is ill defined).

The biggest advantage of IOCP though is that it basically acts as a dynamically sized threadpool, as it keeps the amount of running threads to a minimum. MacOS does something similar with GCD. I still haven't found a non-hacky way to do the same in Linux (you can sort-of emulate it with either with sched_fifo realtime threads or by using sched_idle to detect idelness).


IOCP in Windows is a good API (although a bit non-obvious), but to be fair, they did have a few tries and missteps before they got there.


It's not. IOCP is the worst of all of them.


I think this is a must read for new developers: https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost... (yes it's from 2004)


I had a hard time with IOCP and eventuall ditched it because I didn't find a way to avoid creating a separate read buffer for each and every handle.

Suppose I want to watch 10K connections, or files, or such. 4K should be a reasonable read buffer size. Do I really need to spend 40M of memory just to read from the handles simultaneously?


What would you have the OS do?

If you got connections with inbound data, the OS has to store it somewhere. Might as well be in your read buffer right away.

If you ask the OS to read 4k from 10k files simultaneously, why are you surprised that the OS wants the buffer to stay around until the read is completed?


It should let me choose which buffer I want to use next, from a pool of buffers that I provide.

There is no point in having a 1:1 relationship between buffers and read handles.


That doesn't sound very performant having to switch back into userspace just to get another buffer, then back into kernel space to continue completing the read.

Though as far as I can see you could get close to what you wish by doing 1 byte reads.


Your snarkiness is not appreciated and I never said you should do constant switching. It's not required.

> If you ask the OS to read 4k from 10k files simultaneously

There is a difference between between copying inbound data from 10k handles simultaneously (which does require as many buffers) on the one hand, and on the other hand being open to read from any M-sized subset of 10k handles at once, which requires only M buffers.

I also never said to have only 4k of (combined) buffer data for 10k handles. "4k should be a reasonable read buffer size" means 4k per buffer. I don't think I've been unclear, but with a tiny bit of good will you could have figured out what I meant anyway.


FWIW I'm not trying to be snarky.

> I never said you should do constant switching. It's not required.

I didn't see any other way to interpret your "It should let me choose which buffer I want to use next".

Though perhaps what you meant was that you want to be able to tell the OS "yo, here's a bunch of buffers, use these to complete my reads"?

> [...] and on the other hand being open to read from any M-sized subset of 10k handles at once, which requires only M buffers.

I interpret that as saying "yo OS, I want to read 4k each from these 10k files, but only use these M 4k buffers when doing so", where M << 10k.

If a request is slow to be fulfilled won't it hog a buffer, reducing the maximum throughput of the fast requests? Why would this be a performance win?


> Though perhaps what you meant was that you want to be able to tell the OS "yo, here's a bunch of buffers, use these to complete my reads"?

Thank you for clarifying, and yes, that would be the best way to do it I think.

> If a request is slow to be fulfilled won't it hog a buffer, reducing the maximum throughput of the fast requests? Why would this be a performance win?

So someone else here said that with IOCP the kernel doesn't have its own read buffers but writes directly to userspace memory (in cases where this is possible, anyway). So there might be a possiblity for buffer hogging, I don't know. On the other hand, when the request really is slow to be fulfilled, the OS could just call it a partial write and notify the program of a (partial) completion. In which case the buffer is free for other uses again.


Thanks for the follow-up. Would be interesting to see if it made much difference in the end, guess we'll have to wait a bit longer for all of Windows to be open sourced :P


> I also never said to have only 4k of (combined) buffer data for 10k handles. "4k should be a reasonable read buffer size" means 4k per buffer.

I never thought the former, only the latter.


Oh I see, it was the other commenter saying "less than 1 byte per buffer" and you only said "by doing 1 byte reads". It seems I really was a little unclear and should try to hold my horses.


No harm done :)

My idea was to exploit the caching mechanism in the OS. Though clearly it wouldn't be as efficient as providing a set of buffers for the OS to use.


I'm confused (but I'm also not familiar with the API). It sounds like you're saying you want to be able to read from 10k connections simultaneously while allocating less than a byte of buffer space to each, so I assume I'm misunderstanding some aspect of what you're trying to say (or some aspect of the API that puts this into the correct context).


On unixy systems kernel manages buffers by itself, using as little memory as possible, not allocating anything when no data is received, and userspace doesn't need to allocate any memory either until it is notified and decides to non-blockingly read the data from the kernel. Meaning that waiting for data from connections costs nothing, while on windows it requires preallocating buffer per connection. Basically the same problem that synchronous APIs have.

The myth that IOCP is elegant has to die.


In the user mode piece, a single thread can get told "these 10K descriptors are ready for you to call read(2) upon them", then that thread can read them all using the same buffer. [In practice you might allocate more depending on what's going on, but this is a crude example.] In the IOCP model you need to allocate space for all read requests upfront, even if they're not writing you any data currently.

Kernel mode would have its own buffers, so memory cost per client is not small in this example... except in terms of user-mode buffers.


No I'm saying I want to read from N handles simultaneously using M buffers, where M <= N. Of course, the OS shouldn't write to one of my read buffers that currently contains other buffered data. But that doesn't mean that it has to be N = M, since that is a waste.

I guess M = 10 * number of OS threads should usually be plenty to achieve good performance. But I'm happy to be corrected by people more experienced with systems performance.


I believe I understand. By not allocating in the api call itself you can pre-allocate your own buffers and keep track of which are in-use and not in-use yourself and use some subset of the total amount that would be allocated in each api call be reusing buffers after a particular connection is done with it. That makes sense.


> Do I really need to spend 40M of memory just to read from the handles simultaneously?

of virtual memory... if it isn't being used it can be paged out.


Paging? I don't do paging because I value my mental sanity. I don't think there is a good reason to use paging in 2019 for the vast majority of applications, and the rest is probably simply badly designed software. And I only have 8G of RAM like everybody else, 4G of which have probably never been written to.

And if you do use paging, it's still 40M wasted disk space without a good reason.


but at some point you will need it (when the socket becomes ready to read) and it be extremely costly to to bring it back in.


I recently had an idea about an app and a clever way to hook it into fairly modern, user-facing parts of Windows and macOS (Search/Cortana and Spotlight, respectively.)

I did some feasibility research on Windows first, and very soon hit some ancient win32 interfaces I had to implement, used by parts of the shell that have existed since the earliest days, something I have no intention of wasting my time on.

I was under the impression that macOS would be much better, but was surprised to find that there's no public or documented API, just a 3rd party open source framework implemented as a hack that combines javascript and python somehow, but seems to get the job done.

This got me thinking that you need a serious amount of expertise to implement new experiences on modern operating systems, never mind crossing platforms. Puts the appeal of technologies like Electron that do platform integration for you in perspective.


On Windows, you do need some effort if you have no idea what is going on, but it isn't like those systems were sloppily put together, there are some design ideas and you get used to them... eventually. But once you do, things more or less fall into place.

A biggest issue with working on older Windows APIs is that modern web-based MSDN documentation is simply awful and the best you can do is to find old MSDN CDs/DVDs that often contained more information than the site (especially articles), better organization and -especially- much faster interface (being local and all, but also the cdroms around late 90s / early 2000s were in CHM format are lightweight). And 99.999% of the documentation you'll find on these CDs will have the same text as the documentation (if it still there) on MSDN (and very often the same text that is on the much older WinHelp WIN32SDK.HLP files too).

Like most things Microsoft on desktop, everything went downhill since .NET people took over and decided to redo things their way (yes, .NET brought some nice things too, but it created an internal schism - before .NET became a thing, desktop tech in Windows was mostly coherent) and now history is repeated with UWP.


Totally agree. I worked for a while with pure Win32. Once you get the hang of it it’s quite consistent and pretty well thought out. Also agree about the documentation. Seems MS is determined to make it worse every year, break more links and provide less useful information. The old MSDN CDs were a piece of beauty.


Consistency?

Let’s start with strings...

- the standard null terminated single byte char C string

- the double wide null terminated C string required for dome frameworks - especially Windows CE

- the COM BSTR where the first two bytes are the length

- the ATL bstr_t

- CString that isn’t a null terminated C string

- of course the C++ string class. I’m not sure if this is ever used by the Win32 APIs

I’m sure I’m missing one.


TCHAR probably counts because, even though it's just a typedef for char or wchar_t, it has its own t-prefixed APIs and mental overhead of having to write code that can work with either character size.


If you count various flavors of UWP (C++/CX, C++/WinRT, WRL) you can add std::wstring and Platform::String^


Don't forget HSTRING, which Platform::String^ is built on. They added those two in Windows 8 because there was no one standard string type. (Not joking.)



So there is also the OEM_STRING and the ANSI_STRING


.NET and WPF are an abstraction of win32, I view it as a step in the right direction. How did .NET people cause everything to go downhill?


Especially WPF is not an abstraction of Win32. Maybe Winforms could be called that but certainly not any of the XAML dialects.


WPF is deeply integrated with Win32. For example it abstracts away the win32 message pump design. I'd call that an abstraction.


Yes, it abstracts it to the extent that it doesn't resemble Win32 anymore. It's something else.


I would have to agree. Win32 is immediate mode, while WPF is retained scene graph.

It's not an abstraction its a complete overhaul.

MFC is an abstraction to me.


As far as I know WPF is built on Direct whereas MFC and Winforms are based on Win32.


That's my understanding as well. Point spy at a WPF app and you see one top level HWND which is just hosting a DX context


WPF is a lot closer to an abstraction over Direct2D than Win32. If you use both of those APIs a lot, you start to realize they have a lot of classes with the same name that do the same thing.


The difference is that WinForms uses some of the actual higher-level UI widgets and layout system from Win32/USER, WPF just uses its message pump and other low-level windowing/IO routing mechanisms, and builds all its own bespoke composition, layout and controls on top of that.


I wonder if there is a real black market for bootleg MSDN cds from the early 2000s.


I wouldn't call it elegant at all, the convertthreadtofiber is a pain for a library. I like swapcontext much more in theory.

In practice the issue with swapcontext is that it does not integrate well with the rest of posix (what happens if you move a fiber to another thread?) and the requirements to restore the signal mask pretty much require a very slow implementation.

On the other hand, Windows Fibers are somewhat efficient (one can still do significantly better), but most importantly, the interaction with the rest of the system is well understood.

I wrote my thoughts on the API here about (/checks calendar) 13 years ago: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/...


It's not that elegant.

Win32 fibers can't return like UNIX/ucontext.h supports with the uc_link member. So you end up having to wrap your win32 fiber functions with your own trampoline to let them return without exiting the program and Do The Right Thing.

Furthermore there's this thread-global fibers-enabled state win32 toggles with ConvertThreadToFiber() requiring any transparent coroutines library supporting win32 fibers to carefully initialize once per thread while also supporting reentrant use. It's mildly annoying when you want to generically support things like nested bundles or heirarchies of coroutines.

This last bit seems completely strange considering a fiber's state should be a subset of what forms a thread, and you're always running on a thread. There should be nothing to do when "converting" to a fiber. Just make the fiber state always accessible to userspace via GetCurrentFiber() and always allow calling CreateFiber() and SwitchToFiber().

Another strange wart in win32 fibers is the C srand()/rand() state is per-fiber for some reason. It's undefined by C since C says nothing of coroutines, but it's very strange and generally unexpected behavior to see your coroutines producing identical sequences of values coming out of rand().

On *NIX the rand() state is per-process, pthreads share user memory so they all share the rand() state, and ucontext shares the thread so of course its contexts share the single state as well.

I much prefer the UNIX/ucontext API, they just fucked up the operands for makecontext(). Instead of deprecating ucontext POSIX should have just fixed makecontext() to explicitly take a pointer. C programs have no supported system-supplied coroutine API on UNIX now, which is bullshit.


I just fell in love with that API. That's seriously the first thing windows has which I really REALLY wish linux would just copy as is. It's simple and to the point, but there's nothing missing either. Perfect.


I once built something on it and remember it as quite a nice experience.


Fibers still have a lot of problems. They're incompatible with any APIs that have thread affinity. Windowing, single-threaded apartment COM, CRITICAL_SECTIONs, etc...


How do you manage stack memory (allocation, increasing capacity, freeing, moving, etc.)? Is it possible to suspend a fiber to disk?


A Fiber has its own OS stack and cannot outlive the process (just like a thread). So the memory cost is the same as a normal thread. The main difference is that it uses cooperative scheduling.


The big news seems to be that makecontext et al are deprecated. Does anyone know why that happened? Lots of "green threads" / coroutines implementations are based on it.


Basically their signature can no longer be expressed in ISO C, so instead of fixing it POSIX decided to just deprecate them.


FWIW Russ Cox wrote a similar API on top of the -context() calls named libtask.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: