Hacker News new | past | comments | ask | show | jobs | submit login
How the Windows Subsystem for Linux Redirects Syscalls (microsoft.com)
359 points by jackhammons on June 8, 2016 | hide | past | favorite | 266 comments

> The real NtQueryDirectoryFile API takes 11 parameters

Curiosity got the best of me here: I had to look this up in the docs to see how a linux syscall that takes 3 parameters could possibly take 11 parameters. Spoiler alert: they are used for async callbacks, filtering by name, allowing only partial results, and the ability to progressively scan with repeated calls.

This is a recurring pattern in Windows development. Unix devs look at the Windows API and go "This syscall takes 11 parameters? GROAN." But the NT kernel is much more sophisticated and powerful than Linux, so its system calls are going to be necessarily more complicated.

Curiosity got the better of me recently when I re-read Russinovich's [NT and VMS - The Rest Of The Story](http://windowsitpro.com/windows-client/windows-nt-and-vms-re...), and I bought a copy of [VMS Internals and Data Structures](http://www.amazon.com/VAX-VMS-Internals-Data-Structures/dp/1...).

Side-by-side, comparing VMS to UNIX, and VMS's approach to a few key areas like I/O, ASTs and tiered interrupt levels are simply just more sophisticated. NT inherited all of that. It was fundamentally superior, as a kernel, to UNIX, from day 1.

I haven't met a single person that has understood NT and Linux/UNIX, and still thinks UNIX is superior as far as the kernels go. I have definitely alienated myself the more I've discovered that though, as it's such a wildly unpopular sentiment in open source land.

Cutler got a call from Gates in 89, and from 89-93, NT was built. He was 47 at the time, and was one of the lead developers of VMS, which was a rock-solid operating system.

In 93, Linus was 22, and starting "implementing enough syscalls until bash ran" as a fun project to work on.

Cutler despised the UNIX I/O model. "Getta byte getta byte getta byte byte byte." The I/O request packet approach to I/O (and tiered interrupts) is one of the key reasons behind NT's superiority. And once you've grok'd things like APCs and structured exception handling, signals just seem absolutely ghastly in comparison.

Since we're going into the history of Windows NT, VMS and Dave Cutler. I'd like to highlight this classic book on the history of all three of the above[1]

It follows the same line of narrative as The Soul of a New Machine

[1] https://www.amazon.com/Showstopper-Breakneck-Windows-Generat...

I freaking love Showstopper! Such a great book. It's probably the most information available on David Cutler anywhere.

The author also e-mailed me saying thanks when I tweeted him how much I liked the book, which I thought was super nice.

I've never met a single person who understood what they were talking about and referred to a "UNIX kernels". It may be true that Linux was once less advanced than NT - this is no longer the case, despite egregious design flaws in things like epoll. It has simply never been true (for example) for the Illumos (nee Solaris) kernel.

Which design faults do you think epoll specifically has?

I know there are lots of file descriptors not usable with epoll or rather async i/o in general and that sucks (e.g. regular files).

For networking, I find epoll/sockets nicer to work with than Windows' IOCP, because with IOCP you need to keep your buffers around until the kernel deems your operation complete. I think you have 3 options:

1) Design the whole application to manage buffers like IOCP likes (this propagates to client code because now e.g. they need to have their ring buffer reference-counted).

2) You handle it transparently in the socket wrapper code by using an intermediate buffer and expose a simple read()/write() interface which doesn't require the user to keep a buffer around when they don't want the socket anymore.

3) You handle it by synchronously waiting for I/O to be cancelled after using CancelIo. This sounds risky with potential to lock up the application for an unknown amount of time. It's also non-trivial because in that time IOCP will give you completion results for unrelated I/Os which you will need to buffer and process later.

On the other hand, with Linux such issues don't exist by design, because data is only ever copied in read/write calls which return immediately (in non-blocking mode).

The cool thing is that by gifting buffers to the kernel as policy, the gets to make that choice.

In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying, and it means with shared buffers, something might poll and not-block, but actually block by the time you get around to using the buffer.

This is annoying, and it generally means you need more than two system calls on average for every IO operation in performance servers.

As a rule, you can generally detect "design faults" by the number of competing and overlapping designs (select, poll, epoll, kevent, /dev/poll, aio, sigio, etc, etc, etc). I personally would have preferred a more fleshed out SIGIO model, but we got what we got...

One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small. In Kqueue you can store both the file descriptor with activity (ident) as well as a few bytes of user data. As a result, people use a heap pointer which causes an extra stall to memory. Memory is so slow...

> something might poll and not-block, but actually block by the time you get around to using the buffer

On Linux if a socket is set to non-blocking it will not block. I don't really understand your point with shared buffers. You wouldn't typically share a TCP socket since that would result in unpredictable splitting/joining of data.

> In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying

I'm not so sure the Linux design where copies are done in syscalls must be inherently less efficient. I'm pretty sure with either design, you generally need at least one memcpy - for RX, from the in-kernel RX buffers to the user memory, and for TX, from user memory to the in-kernel TX buffers. I think getting rid of either copy is extremely hard and would need extremely smart hardware, especially the RX copy (because the Ethernet hardware would need to analyze the packet and figure out where the final destination of the data is!). Getting rid of TX copy might be easier but still hard because it'd need complex DMA support on the Ethernet card that could access potentially unaligned memory addresses. On the other hand, I also don't think you need more than one copy, if you design the network stack with that in mind.

> you need more than two system calls on average for every IO operation in performance servers.

True but it's not obvious that this is a performance bottleneck. Consider that a single epoll wait can return many ready sockets. I think theoretically it would hurt latency rather than throughput.

> As a rule, you can generally detect "design faults" by the number of competing and overlapping designs.

On Linux, I think for sockets, there are only: blocking, select, poll, epoll. And the latter three are just different ways to do the same thing. On Windows, it's much more complicated - see this list of different methods to use sockets (my own answer): http://stackoverflow.com/questions/11830839/when-using-iocp-...

> One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small.

Pretty much universally when you're dealing with a socket, you have some nontrivial amount of data associated with it that you will need to access when it's ready for I/O, typically a struct which at least holds the fd number. Naturally you put a pointer to such a struct into the epoll_data_t. I don't see how one could do it more efficiently outside of very specialized cases.

> I'm not so sure the Linux design where copies are done in syscalls must be inherently less efficient.

Windows overlapped IO can map the user buffer directly to the network hardware, which means that in some situations there will be zero copies on outbound traffic.

> especially the RX copy I also don't think you need more than one copy, if you design the network stack with that in mind.

When the interrupt occurs, the network driver is notified that the DMA hardware has written bytes into memory. On Windows, it can map those pages directly onto the virtual addresses where the user is expecting it. This is zero copies, and just involves updating the page tables.

This works because on Windows, the user space said when data comes in, fill this buffer, but on Linux the user space is still waiting on epoll/kevent/poll/select() -- it has only told the kernel what files it is interested in activity on, and hasn't yet told the kernel where to deposit the next chunk of data. That means the network driver has to copy that data onto some other place, or the DMA hardware will rewrite it on the next interrupt!

If you want to see what this looks like, I note that FreeBSD went to a lot of trouble to implement this trick using the UNIX file API[0]

> On Linux, I think for sockets, there are only: blocking, select, poll, epoll. And the latter three are just different ways to do the same thing.

Linux also supports SIGIO[1], and there are a number of aio[2] implementations for Linux.

epoll is not the same as poll: Copying data in and out of the kernel costs a lot, as can be seen by any comparison of the two, e.g. [3]

Also worth noting: Felix observes[4] SIGIO is as fast as epoll.

> I don't see how one could do it more efficiently

Dereferencing the pointer causes the CPU to stall right after the kernel has transferred control back into user space, while the memory hardware fetches the data at the pointer. This is a silly waste of time and of precious resources, considering the process is going to need the file descriptor and it's user data in order to schedule the IO operation on the file descriptor.

In fact, on Linux I get more than a full percent improvement out of putting the file descriptor there, instead of the pointer, and using a static array of objects aligned for cache sharing.

For more on this subject, you should see "what every programmer should know about memory"[4].

[0]: http://people.freebsd.org/~ken/zero_copy/

[1]: http://davmac.org/davpage/linux/async-io.html#sigio

[2]: http://lse.sourceforge.net/io/aio.html

[3]: http://lse.sourceforge.net/epoll/dph-smp.png

[4]: http://bulk.fefe.de/scalability/

[5]: https://www.akkadia.org/drepper/cpumemory.pdf

Thanks for the info. Yes I suppose zero-copy can be made to work but surely one needs to go through a LOT of trouble to make it work.

I'm curious about sending data for TCP through, don't you need to have the original data available anyway, in case it needs to be retransmitted? Do the overlapped TX operations (on Windows) complete only once the the data has also been acked? Are you expected to do multiple overlapped operations concurrently to prevent bad performance due to waiting for ack of pending data?

Windows was designed for an era where context switches were around 80µsec (nowadays they're closer to 2µsec), so a lot of that hard work has already paid for itself.

> don't you need to have the original data available anyway, in case it needs to be retransmitted? Do the overlapped TX operations (on Windows) complete only once the the data has also been acked?

I don't know. My knowledge of Windows is almost twenty years old at this point.

If I recall correctly, the TCP driver actually makes a copy when it makes the packet checksums (since you have the cost of reading the pages anyway), but I think behaviour this is for compatibility with Winsock, and it could have used a copy-on-write page, or given a zero page in other situations.

I qualified it as "Linux/UNIX kernel" because I wanted to emphasize the kernel and not userspace.

Solaris event ports are good, but they're still ultimately backed by a readiness-oriented I/O model, and can't be used for asynchronous file I/O.

Solaris event ports most certainly can be and are used for async I/O. I'm not sure how you can claim otherwise:



And Solaris, (unlike Linux historically at least), supports async I/O on both files and sockets. Linux (historically) only supported it for sockets. I have no idea if Linux generally supports async I/O for files at this point.

Let me rephrase it: there is nothing on any version of UNIX that supports an asynchronous file I/O API that integrates cleanly with the file system cache -- you can do signal based asynchronous I/O, but that isn't anywhere near as elegant as having a single system call that will return immediately if the data is available, and if not, sets up an overlapped operation and still returns immediately to the caller.

This isn't a terrible recap of async file I/O issues on contemporary operating systems: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/

They're called threads. Overlapped I/O on Windows is based on a thread pool, which is why you can't cancel and destroy your handle whenever you want--a thread from the pool might be performing the _blocking_ I/O operation, and in that critical section there's no way to wake it up to tell it to cancel. Just... like... on... Unix.

The difference between Windows Overlapped I/O and POSIX AIO is that on Windows it's a black box, so people can pretend it's magical. Whereas on Linux there hasn't been interest (AFAIK) to merge patches that provide a kernel-side pool of threads for doing I/O, and the decades-long debates have spilled out onto the streets. If you view userspace code as somehow inelegant or fundamentally slow, then of course all the blackbox Windows API and kernel components look appealing. What has held Linux back regarding AIO is that Linux (and Unix people in general) have historically preferred to keep as much in userspace as possible.

This is why NT has syscalls taking 11 parameters. In the Unix world you don't design _kernel_ APIs that way, or any APIs, generally. In the Unix world you prefer simple APIs that compose well. read/write/poll compose much better than overlapped I/O, though which model is most elegant and useful in practice is highly context-dependent. As an example, just think how you'd abstract overlapped I/O in your favorite programming language. C# exposes overlapped I/O directly in the language, but doing so required committing to very specific constructs in the language.

As for performance, both Linux and FreeBSD support zero-copy into and out over userspace buffers. The missing piece is special kernel scheduling hints (e.g. like Apple's Grand Central Dispatch) to optimize the number of threads dedicated per-process and globally to I/O thread pools. But at least not too long ago the Linux kernel was much more efficient at handling thousands of threads than Windows, so it wasn't really an issue. That's another thing Linux prefers--optimizing the heck out of simpler interfaces (e.g. fork), rather than creating 11-argument kernel syscalls. IOW words, make the operation fast in all or most cases so you don't need more complex interfaces.

The difference between Windows Overlapped I/O and POSIX AIO is that on Windows it's a black box, so people can pretend it's magical.

No. The difference is that in Windows you can check for completion and set up an overlapped I/O operation in one system call. Requiring multiple system calls to do the same thing means more unnecessary context switches, and the possibility of race conditions especially in multithreaded code. That and, as trentnelson stated, the Windows implementation is well integrated with the kernel's filesystem cache. Linux userspace solutions? Hahaha.

Supplying this capability as a primitive rather than requiring userland hacks is the right way to do it from an application developer's perspective.

As load increases, polling in Unix asymptotically approaches one additional syscall per thread for any number of sockets. That's because a single poll returns all the ready sockets--you're not polling each socket individually before each I/O request[1]. That means if your thread has 10,000 sockets, the amoritized cost per I/O operation is <= 1/10,000th of a syscall.

As for IOCP being "well integrated", what does that even mean? In Windows when file I/O can't be satisfied from the buffer cache, Windows uses a thread pool to do the I/O (presuming it just doesn't block your thread; see https://support.microsoft.com/en-us/kb/156932), just like you'd do it in Unix. There's nothing magical about that thread pool other than that the threads aren't bound to a userspace context. Maybe you mean that the kernel can adjust the number of slave threads so that there aren't too many outstanding synchronous I/O requests? But the Linux I/O scheduler can implement similar logic when queueing and prioritizing requests. It's six of one and a half-dozen of the other.

[1] At least, assuming you're doing it correctly. But sadly many libraries do it incorrectly. For example, I once audited for a startup Zed Shaw's C-based non-blocking I/O and coroutine library. IIRC, he had devised an incredibly complex hack to fallback to poll(2) instead of epoll(2) because in his tests epoll(2) didn't scale when sockets were heavily used; he only saw epoll scale for HTTP sockets where clients were long-polling. But the problem was that every time he switched coroutine contexts, he was deleting and re-adding descriptors, which completely negated all the benefits of epoll. Why did do this? Presumably because to use epoll properly you need to persist the event polling. But if application code closes a descriptor, the user-space event management state will fall out of sync with the kernel-space state, which is bad news. He tried to design his coroutine and yielding API to be as transparent as possible. But you can't do that. Performant use of epoll requires sacrificing some abstraction, similar to the hassles IOCP causes with buffer management.

The benefit of IOCP isn't performance--whether it's more performant or not is context-dependent. The biggest benefit of IOCP, IMO, is that it's the defacto standard API to use. You don't need to choose between libevent, libev, Zed's Shaws library, or the other thousands of similar libraries. On Windows everybody just uses IOCP and they can expect very good results.

The myth that IOCP is intrinsically better, or intrinsically faster, is a result of what I like to call kernel fetishism--that things are always faster and better when run in kernel space. But that's just a myth. IOCP nails down a very popular and very robust design pattern for highly concurrent network servers, but it's not necessarily the best pattern. And sticking to IOCP imposes many unseen costs. For example, it makes it more difficult to mix and match libraries each doing I/O because when you're having to juggle callbacks from many different libraries your code quickly becomes obtuse and brittle. It also demands a highly threaded environment with lots of shared state, but that likewise leads to very complex and bug prone code.

It's only the right way to do it from an application developer's perspective if the behaviour of the official OS-approved interface matches what their application needs, which is unlikely to be the case here. For example, according to https://support.microsoft.com/en-us/kb/156932 there's a fixed-sized pool of threads used to fetch data into cache to fill async I/O requests. If you try to have too many async I/O requests for the number of threads (which can be as small as 3) the excess are automatically converted into synchronous, blocking requests. This renders the API useless for something like an event-driven web server, because any file I/O call could block the main thread until it completes.

(Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

I comment on the advice given on that page here: https://news.ycombinator.com/item?id=11867375

I've never had a single issue with "async things suddenly becoming synchronous which blocks the main thread" -- if you architect things properly that just never happens, blocking operations are off the hot path, and when you absolutely must block, IOCP's concurrency self-awareness kicks in and another thread is scheduled to run on the core, ensuring that each core has one (and only one) runnable thread.

> (Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

Because a synchronous operation is always faster than an overlapped operation if it can be completed synchronously.

Lots of stuff happens behind the scenes when an overlapped operation occurs.

Running out of places to upvote you.

Application developer is not the only perspective in the world however.

Your rephrasing still doesn't matter, Solaris event ports are not signal-based.

My understanding is that Solaris event ports were intended to offer equivalent functionality to Windows' I/O completion ports, so this should not be surprising.

Solaris also has its own native async I/O API in addition to supporting POSIX async.

Here's a nice graphical comparison of syscalls between Linux and Windows


Are you saying the Windows flow looks like spaghetti only because the software tested software (Apache) wasn't designed for Windows?

Heh, 10 years old, original link doesn't work, image is tiny. And it sounds like they were comparing Linux and Apache to IIS and Windows.

It's hard to evaluate this in any way more than "yeah that's a cute spaghetti diagram". If I wanted to drag Linux through the mud visually I'd depict how much time every socket I/O op spends in vfs/fsync stuff. (i.e. you can depict anything to make your point)

That might be a cool diagram to show! I honestly know nothing about kernel programming, just happened to come across that a long time ago and bookmarked it.

The second image is of IIS running on Windows, not Apache. Different software on different OSes. Regardless, it doesn't seem like parent is making the argument that the NT kernel isn't complicated - just that it is superior.

It is more complicated because -- surprise -- I/O usage patterns do not fit in one neat little box, and you have to provide special handling of special cases that do occur frequently in the real world.

The joke used to be: VMS++ --> WNT

Yep, that'd be the spiritual predecessor to Bing Is Not Google.

First there was HAL, superior to IBM.

Can I ask what else is on your reading list? I ended up buying the VMS Internals book.

Also do you have an opinion on BeOS?

I was fascinated by BeOS in the late 90s when I had a lot of enthusiasm (and little clue). All their threading claims just sounded so cool. I was also really into FreeBSD from around 2.2.5 so I got to see how all the SMPng stuff (and kqueue!) evolved, as well as all the different threading models people in UNIX land were trying (1:1, 1:m, m:n).

NT solves it properly. Efficient multithreading support and I/O (especially asynchronous I/O) are just so intrinsically related. Trying to bend UNIX processes and IPC and signals and synchronous I/O into an efficient threading implementation is just trying to fit a square peg in a round hole in my opinion.

As for reading list... I've bought so many old books lately. Here's my "makes the short list" bookshelf: http://imgur.com/DfTUVQx

And the more ridiculous one that I use as a cover page on my resume: http://imgur.com/0u9OZcN

What things in particular are you interested in?

Lol, I've been saying something like this for some time.

Thanks for the answer. I guess what I'm interested is somewhat obscure/historical operating systems and also HW that are in some way superior to currently popular solutions. The more comparative the better.

Also your reading list has quite a few Oracle SQL entries so I'm guessing it's your preferred DB of choice. What features are you using that aren't available in MySQL or Postgres?

As far as commercial vendors go I really preferred SQL Server from about version 2000 onwards. I got into Oracle again recently for a consulting project with a finance client (who basically have unlimited Oracle licenses) and really quite enjoyed it since the Oracle 7/8 days.

You can do some phenomenally sophisticated things... I extensively leveraged things like partitioning, parallel execution (dbms_parallel_execute!), lots of PL/SQL using the pipelined table cursor stuff, data mining stuff (dbms_frequent_itemset!), index-organized tables, and my god, bitmap indexes were a godsend, direct insert tricks for bulk data loading, external tables were fantastic (you can wrap a .csv in an external table and interact with it in parallel just like any other table -- great for ingesting large amounts of janky .csv data from other parts of the business).

The parallel execution and robust partitioning options were probably the most critical pieces that have no particularly good counterpart in open source land.

They also have a text mining engine which lets you dump almost any filetype in the database and do sophisticated full text queries, including generating html snippets with highlighted matches. And they have a native column type for storing geometry, so you can do geometric queries and aggregates, for which they then also have a web viewer engine with tiled maps support, so you basically have the core of a GIS engine. There's a ton of cool stuff in there, I haven't even looked at half of it.

Oh man how did I forget the geo stuff! I used that extensively! The custom indexes that could do lat/lon proximity joins! And yeah the text mining stuff is great as well. (Ugh, I've forgotten so much already -- such a shame.)

I always thought the external table functionality was a terrible idea, but if the use case is just for import that's more reasonable. Don't you like SQL*Loader or something ;)

It's great when you need to whip something up quickly. My workflow was to create foo_ext, poke around at the data and get some sensible column defaults, then `create table foo select * from foo_ext`. Worked really well, especially for once-off or infrequent stuff.

For batch-oriented stuff where you're getting pretty consistent data at a regular interview I'd go with SQL*Loader.

Curious if you've ventured into (good/modern? maybe QNX?) microkernels at some point and have some thoughts on them by chance?

I have not I'm afraid. Hard enough keeping how all the parts of NT work in my head at the same time ;-)

But the NT kernel is much more sophisticated and powerful than Linux

That does not follow from the example. All it shows is that Microsoft prefers to put a lot of functionality in one interface, while Linux probably prefers low-level functions to be as small as possible, and probably offers things like filtering on a higher level (in glibc, for example).

Neither explanation has anything to do with sophistication. I personally believe that small interfaces are a better design.

Actually it does, as it mentioned that the extra parameters are for things like async callbacks and partial results.

The I/O model that Windows supports is a strict superset of the Unix I/O model. Windows supports true async I/O, allowing process to start I/O operations and wait on an object like an I/O completion port for them to complete. Multiple threads can share a completion port, allowing for useful allocation of thread pools instead of thread-per-request.

In Unix all I/O is synchronous; asynchronicity must be faked by setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing. It adds complexity to code to simulate what Windows gives you for real, for free. And sometimes it breaks down; if I/O is hung on a device the kernel considers "fast" like a disk, that process is hosed until the operation completes or errors out.

I wrote the windows bits for libuv (node.js' async i/o library), so I have extensive experience with asynchronous I/O on Windows, and my experience doesn't back up parent's statement.

Yes, it's true that many APIs would theoretically allow kernel-level asynchronous I/O, but in practice the story is not so rosy.

* Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

* For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

* Many APIs may support asynchronous operation, but there are blatant omissions too. Try opening a file without blocking, or reading keyboard input.

* Windows has many different notification mechanisms, but none of them are both scalable and work for all types of events. You can use completion ports for files and sockets (the only scalable mechanism), but you need to use events for other stuff (waiting for a process to exit), and a completely different API to retrieve GUI events. That said, unix uses signals in some cases which are also near impossible to get right.

* Windows is overly modal. You can't use asynchronous operations on files that are open in synchronous mode or vice versa. That mode is fixed when the file/pipe/socket is created and can't be changed after the fact. So good luck if a parent process passes you a synchronous pipe for stdout - you must special case for all possible combinations.

* Not to mention that there aren't simple 'read' and 'write' operations that work on different types of I/O streams. Be ready to ReadFileEx(), Recv(), ReadConsoleInput() and whatnot.

IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

You're completely missing how the NT I/O subsystem works, and how to use it optimally.

> * Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

Things like extending a file, opening a file, that's not typically hot-path stuff. If you're doing a network oriented socket server, you would submit such a blocking operation to a separate thread pool (I set up separate thread pools for wait events, separate to the normal I/O completion thread pools), and then that I/O thread moves on to the next completion packet in its queue.

> * For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

Well that's just flat out wrong. You can set your socket buffer size as large or as small as you want. For PyParallel I don't even use an outgoing send buffer.

Also, the new registered I/O model in 8+ is a much better way to handle socket buffers without the constant memcpy'ing between kernel and user space.

> IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

> The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

The Microsoft article cited above (https://support.microsoft.com/en-us/kb/156932) directly contradicts you:

> Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

Microsoft is directly saying that it reserves the right to violate the guarantee you are counting on at any time, and it documents several known cases of this. You can try to guess when this will happen and put those I/O operations on a different thread pool, but you're just playing whack-a-mole. And you're violating Microsoft's own recommendations.

That's not a particularly good article with regards to high performance techniques.

You wouldn't be using compression or encryption for a file that you wanted to be able to submit asynchronous file I/O writes to in a highly concurrent network server. Those have to be synchronous operations. You'd do everything you can to use TransmitFile() on the hot path.

If you need to sequentially write data, wanted to employ encryption or compression, and reduce the likelihood of your hot-path code blocking, you'd memory map file-sector-aligned chunks at a time, typically in a windowed fashion, such that when you consume the next one you submit threadpool work to prepare the one after that (which would extend the file if necessary, create the file mapping, map it as a view, and then do an interlocked push to the lookaside list that the hot-path thread will use).

I use that technique, and also submit prefaults in a separate threadpool for the page ahead of the next page as I consume records I'm writing to. Before you can write to a page, it needs to be faulted in, and that's a synchronous operation, so you'd architect it to happen ahead of time, before you need it, such that your hot-path code doesn't get blocked when it writes to said page.

That works incredibly well, especially when you combine it with transparent NTFS compression, because the file system driver and the memory manager are just so well integrated.

If you wanted to do scatter/gather random I/O asynchronously, you'd pre-size the file ahead of time, then simply dispatch asynchronous writes for everything, possibly leveraging SetFileIoOverlappedRange such that the kernel locks all the necessary sections into memory ahead of time.

And finally, what's great about I/O completion ports in general is they are self-aware of their concurrency. The rule is always "never block". But sometimes, blocking is inevitable. Windows can detect when a thread that was servicing an I/O completion port has blocked and will automatically mark another thread as runnable so the overall concurrency of the server isn't impacted (or rather, other network clients aren't impacted by a thread's temporary blocking). The only service that's affected is to the client that triggered whatever blocking I/O call there was -- it would be indistinguishable (from a latency perspective) to other clients, because they're happily being picked up by the remaining threads in the thread pool.

I describe that in detail here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

> > Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

That's not the best wording they've used given the article is also talking about blocking. If you've followed my guidelines above, a synchronous return is actually advantageous for file I/O because it means your request was served directly from the cache, and no overlapped I/O operation had to be posted.

And you know all of the operations that will block (and they all make sense when you understand what the kernel is doing behind the scenes), so you just don't do them on the hot path. It's pretty straight forward.

> Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

I wrote Windows drivers and file systems for about 10 years, and Unix drivers and file systems also for about 10 years.

I'd rather practice substance agriculture for the rest of my life than deal with Windows drivers again.

Yeah it's not a simple affair at all. It's a lot easier these days though, and the static verifier stuff is very good.

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

Can programming against the userspace interface the I/O subsystem really be compared to programming against the kernel driver interface to I/O subsystem? In Linux, kernel drivers have access to structures, services, and layers that userspace doesn't. And can these be compared between a monolithic and a micro-kernel approach, other than what has been debated ad nauseam for micro/monolithic kernels in general (not just used for I/O)?

I didn't make my point particularly well there to be honest. Writing an NT driver is incredibly more complicated than an equivalent Linux one, because your device needs to be able to handle different types of memory buffers, support all the Irp layering quirks, etc.

I just meant that writing an NT kernel driver will really give you an appreciation of what's going on behind the scenes in order to facilitate awesome userspace things like overlapped I/O, threadpool completion routines, etc.

I have never worked with systems-level Windows programming, so I don't know the answer to this...but, how is what you're describing better than epoll or aio in Linux or kqueues on the BSDs?

I'm guessing you're coming from the opposite position of ignorance I am (i.e. you've worked on Windows, but not Linux or other modern UNIX), though, since "setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing" doesn't describe anything developed in many, many years. 15 years ago select was already considered ancient.

I think IO Completion Ports [1] in Windows are pretty similar to kqueue [2] in FreeBSD and Event Ports [3] in Illumos & Solaris. All of them are unified methods methods for getting change notifications on IO events and file system changes. Event Ports and kqueue also handle unix signals and timers.

Windows will also take care of managing a thread pool to handle the event completion callbacks by means of BindIoCompletionCallback [4]. I don't think kqueue or Event Ports has a similar facility.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3... [2]: https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2 [3]: https://illumos.org/man/3C/port_create [4]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

BindIoCompletionCallback is very old, the new threadpool APIs should be used, e.g.: https://github.com/pyparallel/pyparallel/blob/branches/3.3-p...

Regarding the differences between IOCP and epoll/kqueue, it all comes down to completion-oriented versus readiness-oriented.


The documentation for the new thread pool APIs in Windows say that it is more performant than the old ones. Do you have any experience transitioning from the old thread pool to the new one, and if so did you notice performance differences?

Last decade I wrote a server based on the old API and BindIoCompletionCallback. I can't comment on the performance differences, but the newer threadpool API is much better to work with.

IMO one of the biggest problems with the old API is that there is only one pool. This can cause crazy deadlocks due to dependencies between work items executing in the pool. The new API allows you to create isolated pools.

The old API did not give you the ability to "join" on its work items before shutting down. You could roll your own solution if you knew what you were doing, but more naive developers would get burned.

You also get much finer grained control over resources.

I'm really lucky to have decided to dig into this stuff (for PyParallel) around ~2011, by which stage Windows Vista (where all the new APIs had been introduced) had been out for like, 5-6 years, so no, I never had to deal with the old APIs, thankfully.

I really sympathize with Len Holgate though... he's had to nurse Server Framework through all of that transition, which must have been painful: http://www.serverframework.com/asynchronousevents/2016/06/an...

To quote myself:

> The “Why Windows?” (or “Why not Linux?”) question is one I get asked the most, but it’s also the one I find hardest to answer succinctly without eventually delving into really low-level kernel implementation details. >

> You could port PyParallel to Linux or OS X -- there are two parts to the work I’ve done: a) the changes to the CPython interpreter to facilitate simultaneous multithreading (platform agnostic), and b) the pairing of those changes with Windows kernel primitives that provide completion-oriented thread-agnostic high performance I/O. That part is obviously very tied to Windows currently. >

> So if you were to port it to POSIX, you’d need to implement all the scaffolding Windows gives you at the kernel level in user space. (OS X Grand Central Dispatch was definitely a step in the right direction.) So you’d have to manage your threadpools yourself, and each thread would have to have its own epoll/kqueue event loop. The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O). You need to be able to disassociate the work from the worker. The work is the invocation of the data_received() callback, the worker is whatever thread is available at the time the data is received. As soon as you’ve bound a file descriptor to a per-thread set, you prevent thread migration >

> Then there’s the whole blocking file I/O issue on UNIX. As soon as you issue a blocking file I/O call on one of those threads, you have one thread less doing useful work, which means you’re increasing the time before any other file descriptors associated with that thread’s multiplex set can be served, which adversely affects latency. And if you’re using the threads == ncpu pattern, you’re going to have idle CPU cycles because, say, only 6 out of your 8 threads are in a runnable state. So, what’s the answer? Create 16 threads? 32? The problem with that is you’re going to end up over-scheduling threads to available cores, which results in context switching, which is less optimal than having one (and only one) runnable thread per core. I spend some time discussing that in detail here: https://speakerdeck.com/trent/parallelism-and-concurrency-wi.... (The best example of how that manifests as an issue in real life is `make –jN world` -- where N is some magic number derived from experimentation, usually around ncpu X 2. Too low, you’ll have idle CPUs at some point, too high and the CPU is spending time doing work that isn’t directly useful. There’s no way to say `make –j[just-do-whatever-you-need-to-do-to-either-saturate-my-I/O-channels-or-CPU-cores-or-both]`.) >

> Alternatively, you’d have to rely on AIO on POSIX for all of your file I/O. I mean, that’s basically how Oracle does it on UNIX – shared memory, lots of forked processes, and “AIO” direct-write threads (bypassing the filesystem cache – the complexities of which have thwarted previous attempts on Linux to implement non-blocking file I/O). But we’re talking about a highly concurrent network server here… so you’d have to implement userspace glue to synchronize the dispatching of asynchronous file I/O and the per-thread non-blocking socket epoll/kqueue event loops… just… ugh. Sure, it’s all possible, but imagine the complexity and portability issues, and how much testing infrastructure you’d need to have. It makes sense for Oracle, but it’s not feasible for a single open source project. The biggest issue in my mind is that the whole thing just feels like forcing a square peg through a round hole… the UNIX readiness file descriptor I/O model just isn’t well suited to this sort of problem if you want to optimally exploit your underlying hardware. >

> Now, with Windows, it’s a completely different situation. The whole kernel is architected around the notion of I/O completion and waitable events, not “file descriptor readiness”. This seems subtle but it pervades every single aspect of the system. The cache manager is tightly linked to the memory management and I/O manager – once you factor in asynchronous I/O this becomes incredibly important because of the way you need to handle memory locking for the duration of the I/O request and the conditions for synchronously serving data from the cache manager versus reading it from disk. The waitable events aspect is important too – there’s not really an analog on UNIX. Then there’s the notion of APCs instead of signals which again, are fundamentally different paradigms. The digger you deep the more you appreciate the complexity of what Windows is doing under the hood. >

> What was fantastic about Vista+ is that they tied all of these excellent primitives together via the new threadpool APIs, such that you don’t need to worry about creating your own threads at any point. You just submit things to the threadpool – waitable events, I/O or timers – and provide a C callback that you want to be called when the thing has completed, and Windows takes care of everything else. I don’t need to continually check epoll/kqueue sets for file descriptor readiness, I don’t need to have signal handlers to intercept AIO or timers, I don’t need to offload I/O to specific I/O threads… it’s all taken care of, and done in such a way that will efficiently use your underlying hardware (cores and I/O bandwidth), thanks to the thread-agnosticism of Windows I/O model (which separates the work from the worker). >

> Is there something simple that could be added to Linux to get a quick win? Or would it require architecting the entire kernel? Is there an element of convergent evolution, where the right solution to this problem is the NT/VMS architecture, or is there some other way of solving it? I’m too far down the Windows path now to answer that without bias. The next 10 years are going to be interesting, though.


"The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O)"

Both epoll and kqueue permit multiple threads to poll the same event set. Normally you do this in tandem with edge-triggered readiness (EPOLLET on Linux, EV_CLEAR on BSD) so that only one thread will dequeue an event.

How do think IOCP is implemented in Windows? There's a thread pool in the kernel which _literally_ polls a shared event queue. It's just hidden so you can pretend it's magical. But conceptually it works almost identically to how you would do it in Unix.

The benefit of IOCP is that it's a native API. It's warts and shortcomings notwithstanding, developers never even need to think about how it's actually implemented. Whereas with epoll and kqueue you either have to roll your own framework, or select from various third-party options. Seeing how the sausage is made can turn some people off. But just because you don't see the gory details doesn't mean it's implemented using magical fairy dust.

There's much to recommend Windows, and many things the NT kernel conceptually gets right. But IOCP vs polling? The only real difference architecturally is how much of the stack sits in user-space vs kernel-space, and how much of the stack is delivered by Microsoft (all of if in the case of IOCP) vs other sources (in Linux, glibc does AIO, while all the event loop and callback code is provided by various libraries or written yourself).

Putting more of the stack in kernel-space doesn't magically make it easier to perform optimizations. That's marketing speak and kernel fetishism. You have to first show why those optimizations can't be achieved elsewhere, like in the I/O or process scheduler. Various Linux components traditionally are more performant (e.g. process scheduling) than in Windows, so many of the optimizations wrt IOCP is arguably clawing back performance lost elsewhere in the system.


According to your website pretty much every other technology runs better on Linux than it does on Windows, and of course pyparallel runs better than everything you tested.

How can I run these tests my self? I specifically want to test it against golang.

I don't see where you take advantage of go routines and I don't see a real world use case unless it's operating on concurrent connections.



Patches welcome?

I believe Linux since 2.5 has 'proper' asynchronous system calls, io_getevents(2) and co.

Further information: https://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

Based on what you describe here, I'd say your comparative understanding is about two decades out of date. Async I/O has been a capability in various Unix and Unix-alike kernels for that long.

As in signal based AIO? Have you ever tried to use it in a high performance network server, where you want to have reads also satisfied from the cache if possible?

Because that is like pulling teeth on UNIX. See: https://groups.google.com/forum/#!topic/framework-benchmarks...

I think the problem here is not a syscall taking 11 parameters, it's a syscall that merely lists what is inside a directory taking 11 parameters. ataylor_284 explained the reasons (how convincingly, I'd argue) but on the first sight that surely smells bloat.

I'd also object NT kernel being more "powerful". Sure unixy kernels and NT has their differences but I don't think either one is superior.

11 parameters may seem bloated but in some cases Unix syscalls weren't designed with enough parameters whcih caused a bunch of pain necessitating things like

dup->dup2->dup3 pipe->piep2 rename->renameat->renameat2

Best practice nowadays in linux is to allow overloading syscalls via a flags parameter.

see https://lwn.net/Articles/585415/

So modern linux syscalls may be bloated too.

I remember the struct I had to populate to start a new process in 1997 or 1998...

It may be more "sophisticated" (sounds like a more positive synonym of "complex" to me), but I certainly don't think it's more powerful.

Since when is kernel complexity a measure of quality...? :)


Usage of the adjective "sophisticated" always precedes an outpouring either ignorance or straight bs.

Also UNIX devs seem to forget how cumbersome the X11, Xlib and Motif APIs are.

This was maybe the case at Linux 2.0, but is not the case now.

Also, Windows development is infinitely more painful than Unix/Linux.

"... so its system calls are going to be necessarily more complicated."

Are you implying that an increase in "power" can never be achieved through increasing simplicity?

That's the thing. Just by glancing at the API docs, Windows looks more complicated but where the rubber meets the road in terms of real high-performance application development, Windows is way simpler. In Windows you can do in one syscall what would take several in Linux. You can schedule I/O calls across multiple threads in a completely thread-safe manner without having to manage the synchronization yourself -- and since threads go to sleep entirely while waiting for I/O operations to complete, there is no chewing up CPU cycles in a select/epoll loop. So yes, writing "hello world" or simple filters is simpler in Unix -- but writing multithreaded server applications that maximize throughput is simpler in Windows.

Unix is bristling with features designed to "allow you to save me some time". It was designed to make it easy to write quick, "one-off" programs in C. VMS -- the predecessor to Windows NT -- was designed to run long-lasting, high-performance, high-reliability business applications for real users with money on the line (i.e., not just hackers) and Windows NT inherits this legacy.

"It was designed to make it easy to write quick, "one-off" programs in C."

OK, so it just so happens this is what I love to do. I like writing small programs and continually trying to improve them.

So I guess I should be a UNIX user?

Is NT not good for this too?

BTW, I do like VMS. But despite the NT kernel, using NT feels nothing like using VMS.

I love writing NT-style C-level (no CRT, just pure C and whatever the CNF/Cutler Normal Form style is).

>the NT kernel is much more sophisticated and powerful than Linux


It's not sophisticated enough or powerful enough to be the most used kernel on super computers (and in the world). Windows pretty much only dominates the desktop market. Servers, super computers, mainframes, etc, mostly use Linux.

A few years ago there was even a bug in Windows that caused degradation in network performance during multimedia playback that was directly connected with mechanisms employed by the Multimedia Class Scheduler Service (MMCSS), this is used on a lot of audio setups. If they can't even get audio setups right how can people consider anything Windows releases "sophisticated"?

It's made to do anything you throw at it I guess, it's definitely complicated, but powerful and sophisticated aren't words I would use to describe NT.

If you're arguing in favor of linux, you probably shouldn't use any arguments that deal with getting audio setups right.


I would go so far as to say that a large part of why audio is such a CF under Linux is -- wait for it -- lack of real asynchronous I/O.

Audio is asynchronous by nature, and to do that right under Linux you need a "sound server" with all the additional overhead, jank, and shuffling of data between kernel and at least two different user spaces that implies. Audio under Linux was best with OSS, which was synchronous in nature and not suitable for professional applications. JACK mitigated that somewhat, but for an OS to do audio right you need a kernel-level async audio API like Core Audio or whatever Windows is doing these days.

Windows has a sound server too, you know. I believe Core Audio on Mac does too. A large part of why audio is such a CF under Linux is that PulseAudio is incredibly badly written and poorly maintained. My favourite was the half-finished micro-optimization that broke some of the resamplers because the author got bored and never modified them to handle the change, which somehow made it all the way into a release. I dread to think what they'd do with powerful tools like asynchronous I/O.

Audio on Linux works fine in my experience.

Hey everyone, we found him!

I don't want to take sides in this discussion but share an anecdote. Hey, maybe even someone knows a solution for this.

I have a PC connected via on-board HDMI to a Denon AVR solely for the purpose of getting the audio to the amplifier. Windows doesn't let me use that audio interface without extending or mirroring my desktop to that HDMI port. Since there is no display connected to the AVR I don't want to extend the desktop, and mirroring heavily decreases performance of the system.

On Debian Sid the computer by default allows me to use the HDMI audio without doing anything to my desktop display. It seems the system realizes that there is no display connected to the AVR but it's still a valid sink for audio.

If I remember correctly, NVIDIA did a decent job with their on-board HDMI driver audio-wise; what brand is yours? I'm 100% the functionality is dependent on the driver rather than the OS.

Can you use optical instead of HDMI?

TOSLINK doesn't support uncompressed surround audio, while HDMI does, and I do use it.

Well for various definitions of fine I guess.

It works fine as in "I can listen to audio on my laptop from multiple programs at once, with a centralized way to control audio volume on a per-application or per-sound-device basis." I literally cannot imagine any audio system doing better than that given the hardware I have to work with.

BeOS doesn't run on my hardware, I'm pretty sure, even though it was ported to x86 eventually.

Works very well for me too. Don't know why you got downvoted. It's like it's 1994 in here...

But this is exactly what made me switch, windows was preventing me from accessing my sound card directly in order to record a remote interview.

I use Linux regularly to record and edit audio, it's free , it works, and I dont have to worry my OS is active reducing the functionality of my equipment.

They got audio setups right. The reason the network degradation happened is that video and audio playback were given realtime priority so background processes couldn't cause pops, stutters, etc. At the time Vista was released, most home users didn't have a gigabit network, so the performance degradation would only happen on a small number of users, and most would rather prefer good audio and video performance to a slowdown in network performance in a small percentage of users. With today's massively multicore systems, it's even less of an issue, while linux still has a problem with latency on applications like pro audio.

The reason the network degradation happened is that Microsoft couldn't figure out how to stop heavy network activity causing audio glitches on some systems even after giving audio realtime priority, so they hacked around it by adding a fixed 10,000 packets-per-second cap on network activity regardless of system speed or CPU usage (less if you had multiple network adapters). See https://blogs.technet.microsoft.com/markrussinovich/2007/08/... This was just as much of an issue on multicore systems because the cap was unaffected by the system speed and chosen based on a slow single-core machine.

I don't know why you're getting downvoted since the parent is basically stating random opinions about "power" and "sophistication" without anything to actually back it up.

11 param functions don't say "power" to me. They say "poorly thought out API design". Much can be said for most Windows APIs in general.

Overloaded system call entrypoints are a fact of life on all mainstream platforms. Consider for instance "ioctl".

I've heard that Plan 9 doesn't have ioctl. But I guess that doesn't count as mainstream.

Plan 9 replaces ioctl with special files that require writing magic incantations.

It's similar to the various knobs in Linux /proc which require reading and writing specially formatted data. ioctl is simpler in that you don't need to worry as much about formatting the data (the struct declarations take care of that for you), but a file-oriented interface is nicer in that it's a higher-level abstraction--for example, it maps better to different languages, similar to how ioctl requires C or C-like shims whereas /proc can be used from any language that understands open/read/write/close, including the shell.

A lot of Microsoft APIs and subsystems are similarly bloated. There are probably tons of factors at play, but I believe being closed-source and having to support many individual use cases is one fundamental reason. (See for example CreateProcess vs. fork...)

When it comes to system call interfaces, it's because Dave Cutler has forgotten more than many modern "kernel hackers" will ever know about how to design an OS.

I appreciate the name-dropping.

Dave Cutler's skills aside, Unix predates Windows by decades, and to anyone remotely familiar with kernel development it is clear that the sheer quantity and complexity of subsystems stem from the fact that nobody but Microsoft can actually see, modify, and redistribute Windows' source code.

Unless you can actually say "here's why Windows is qualitatively better" and point out specific tasks Windows does better, I'll just point you to the fact that the internet infrastructure and most of the servers on it, along with every Apple desktop and pretty much every mobile device, run Unix.

> I'll just point you to the fact that the internet infrastructure and most of the servers on it, along with every Apple desktop and pretty much every mobile device, run Unix.

I wonder how much of the internet infrastructure would run Unix if free (as in beer) clones like *BSD and GNU/Linux did not exist in first place.

How much internet infrastructure would run actually Unix if ISPs had to choose between Aix, HP-UX, Solaris, Digital UX, Tru64 and Windows licenses?

Free is always more valued than quality.

I'd guess that, even more important than the cost, up until recently, interacting with Windows in a headless mode was next to useless. Most sysadmins in my experience avoid GUIs like the plague when managing servers.

Are you too young to remember "the network is the computer"?

And of course this worked in reverse, when Netscape released their commercial webserver Microsoft rushed to give away IIS.

I started coding in the 80's and I remember how the UNIX market was afraid of Windows NT workstations, before they actually started losing market share to the free (beer) UNIX versions in form of BSD and GNU/Linux.

Why is it valid to conflate every kernel that runs a *nix OS together under "Unix"? Is there any meaningful overarching "Unixy" way in which, say, Xnu and Linux are similar?

NT's approach to async IO is at least somewhat empirically better as it does not require an extra context switch between receiving a "ready" event and actually performing the IO operation.

Completion notification requires committing a (potentially cache-hot) buffer for an operation that might not complete until some time in the future. With readiness notification you only need to have the buffer ready when you know you will use it.

Also, on an high performance poll/epoll/kevent based system you only need to poll cold fds, while you can do speculative direct read/writes to hot fds, so no need for extra syscalls in the fast case.

That doesn't mean that completion notification doesn't have its advantages, especially when coupled with NT builtin auto-sizing thread pool, but it is not strictly better.

You could do that if you really wanted on Windows; just set a 0 byte user space send and receive socket buffer.

You can do dual synchronous/asynchronous socket I/O in Windows. I use this very approach with PyParallel (and 0 byte send buffers): https://github.com/pyparallel/pyparallel/blob/branches/3.3-p...

Depending on current load, that will either immediately do an asynchronous operation, or attempt synchronous non-blocking ones up to a certain number, then fall back to asynchronous.

Described here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

You must mean "theoretically" because "empirically" implies that you're basing your statement on observations. Where are the numbers? :)

To level the playing field, and if we are to take your opinion seriously, it would be beneficial to know what books or articles or whitepapers you have read to inform yourself about the design of the NT kernel.

I'm sorry, who's "we"? You replied to a comment that presented fact. The fact is that Windows use is mostly limited to desktops. As far as you're concerned, I could be entirely illiterate and my argument would still hold because it's based on fact. If you want to claim Windows is superior, please don't point us to design documentation. Show us actual numbers and use cases where Windows outperforms Unix, or is used in critical infrastructure, etc.

(FYI, I've read the "internals" book for a relatively old version of Windows, along with plenty of books about attacking the Windows kernel through its huge attack surface that exists to accommodate various needs of various software vendors...)

>You replied to a comment that presented fact.

False. You ASSUMED facts based on co-relation - "Its used everywhere" doesn't mean anything. "But people must have a reason to use them" STILL doesn't mean anything. "Well, so why don't they use windows" STILL doesn't get you anywhere.

> I could be entirely illiterate and my argument would still hold because it's based on fact.

No. You can't enter into an argument when you know nothing about the subject. That is not how it works.

Why don't YOU present actual facts about the design? Show us you actually understand the internals of the kernel or have atleast some rudimentary knowledge. Otherwise you'd just be wasting everyone's time.

There are more server deployments of Windows than there are of Linux.

This is because Linux's server workload is mainly the Web. But every departmental office needs an Exchange server...

Well, that's an (unbacked) opinion, and I don't share it. NT design is not too bad (obviously especially in contrast with Consumer Windows), and especially given what was achieved on the first few releases (that was made easier by Cutler serious experience in the area), but now it is far from brillant, and it has it (huge) share of problems every serious users of both Windows and Unix based OS knows.

Now at one point, way in the past, NT was far above Linux, and some Linux fanboys existed that did not even knew what they were talking about, yet had strong opinions of superiority about the kernel they used. Now we are ironically in the opposite situation: Linux has basically caught up on all the things that matters (preemptive kernel, stability, versatility, scalability) and then quickly overtook NT, yet some people like to talk endlessly about the supposed architectural superiority of NT, that did not provide anything concrete in the real world in the long term and widely used, and that MS had to work around and/or redo with an other approach (while keeping vestigial of all the old ones) to do all its modern stuff.

What kernel hackers know to do, is to detect problem in architecture that look neat on paper. Brillant ones are able to anticipate. I don't even have to: history has shown were NT has been hold back by its original design.

You, I like you. You get it.

>As of this article, lxss.sys has ~235 of the Linux syscalls implemented with varying level of support.

Is there a list of these syscalls somewhere? It would be cool to check it against the recent Linux API compatibility paper [0, 1].

[0]: http://oscar.cs.stonybrook.edu/api-compat-study/ [1]: http://www.oscar.cs.stonybrook.edu/papers/files/syspop16.pdf

You piqued my curiosity - just made one by extracting the syscall dispatch table from lxcore.sys and placing it alongside the Linux syscall list: https://goo.gl/QHGe1U

A lot of coverage there, but interesting to see which ones aren't yet implemented, at least in the recent build 14342.

(I used Filippo Valsorda's work from https://filippo.io/linux-syscall-table as the Linux syscall data source.)

A list of (at least partly) supported syscalls is here: https://msdn.microsoft.com/en-us/commandline/wsl/release_not...

Not details on which one are fully or partly supported, though.

I have installed the current fast ring build and have tried installing several packages on Windows. Some do install and work (compilers, build environment, node, redis server), but packages that use more advanced socket options (such as Ethereum) or that configure a deamon (most databases), still end with an error. Compatibility is improving with every new build, and you can ditch/reset the whole Linux environment on Windows with a single command, which is nice for testing.

They've said the initial intent is for developers to use it, not for running servers / etc (which is why they only target Windows 10 client and not Windows Server OSs).

There is "running servers" in production and there is "running servers" in dev.

If I can't run the entire stack I use for dev under the subsystem then I will go the other route, which is to continue using VMs. I am excited about the initial release, and the prospect of being able to use Windows for all of the regular things I do, but it's clear that this isn't ready for primetime even as a dev tool.

Yup, when I'm developing I need to run pretty much most stuff. I guess, I can install say postgres using the windows native version, but then we are back at square zero.

Installing postgres on lxss still ends in a 'syscall not implemented' error.

Since NT syscalls follow the x64 calling convention, the kernel does not need to save off volatile registers since that was handled by the compiler emitting instructions before the syscall to save off any volatile registers that needed to be preserved.

Say what? The NT kernel doesn't restore caller-saved registers at syscall exit? This seems extraordinary, because unless it either restores them or zaps them then it will be in danger of leaking internal kernel values to userspace - and if it zaps them then it might as well save and restore them, so userspace won't need to.

I think that's referring to the prolog/epilog convention and "homing" of parameter registers, e.g.

Frame struct ReturnAddress dq ? HomeRcx dq ? HomeRdx dq ? HomeR8 dq ? HomeR9 dq ? Frame ends


    mov Frame.HomeRcx[rsp], rcx
    mov Frame.HomeRdx[rsp], rcd
    mov Frame.HomeR8[rsp], r8
    mov Frame.HomeR9[rsp], r9

    alloc_stack 64

    ; *do stuff*


    add rsp, 64

    NESTED_END Foo, _TEXT$00

I can't think of much that would benefit from this except for, perhaps, headless command line type applications. The one that comes to mind is rsync. Being able to compile the latest version/protocol of rsync on a Linux machine and then running the same binary on a Windows host would be nice but fun seems to end there plus with Cygwin, this is largely a no-brainer without M$ help.

What about applications that hook to X Windows or do things like opening the frame buffer device. I've got a messaging application that can be compiled for both Windows and Linux and depending on the OS, I compile a different transport layer. Under Linux heavy use of epoll is used which is very different than how NT handles Async I/O - especially with sockets. So my application's "transport driver" is either compiling an NT code base using WinSock & OVERLAPPED IO or a Linux code base using EPOLL and pthreads.

Over all it seems like a nice to have but I'm struggling to extract any real benefit.

Can anyone offer up some real good use cases I may be overlooking?

There are both free and commercial X servers for Windows, and you can get a linux app running under WSL to work with one of those X servers very easily. I played with it a little bit and it worked fine.

With this feature, if you're a Linux developer, you're automatically a Windows developer as well. Almost like being able to run all Android or iOS apps on Windows phones.[1][2]

[1] http://www.pcworld.com/article/3038652/windows/microsoft-kil... [2] https://developer.microsoft.com/en-us/windows/bridges/ios

Edit: Now I am puzzled as to why this got downvoted?

If you disassemble lxcore.sys you can still see hints of the Android subsystem project that it grew from: the \Device\adss and /dev/adss devices, the application name Microsoft.Windows.Subsystem.Adss, various function names containing "Adss", and some other textual references to Android.

It's too bad that x86 hardware doesn't do virtualization as well as IBM hardware. You can't stack VMs. That's exactly what's needed here - a non-kernel VM that runs above NT but below the application.

Windows also now supports nested virtualization.


I thought that a) the conclusion of VMware's "Comparison of techniques" paper [1] was that x86 and possibly everything is Popek-and-Goldberg-virtualizable [2] via binary translation, and b) the last several years of Intel and AMD chips all have hardware virtualization support, including nested virtualization, that made their architectures Popek-and-Goldberg-virtualizable in the obvious way?

[1] https://www.vmware.com/pdf/asplos235_adams.pdf

[2] https://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualizat...

Looking at the way mainframes work, with their higher level languages, JIT compilers at kernel level, object databases, type 1 hypervisors, ....

It is quite interesting to see mainstream OSes increasingly get adopting all those features.

Very, very slowly. Microprocessors still have DMA instead of mainframe-like "channels", although we're starting to see MMUs on the I/O side. With channels, devices can't blither all over memory and neither driver nor device need be trusted.

> the Linux fork syscall has no documented equivalent for Windows

Emphasis is mine. I wonder if this is something that cygwin could (ab)use. Also I wonder why they would need this undocumented call.

It's been tried and failed. See [1].

[1] https://cygwin.com/ml/cygwin-developers/2011-04/msg00036.htm...

> Also I wonder why they would need this undocumented call.

To implement the first NT Posix subsystem, which was a FIPS requirement.

Cygwin is layered above Win32. Win32 has no provision to nicely handle forks. So even if there was an NT API fork syscall (I'm don't think there is on Windows 10, WSL does not use the NT API, there is not any more Posix/SFU/{Whatever Unix NT classic subsys of the day} as far as I know), this would not go anywhere.

> So even if there was an NT API fork syscall

You can do it with NtCreateProcess: https://groups.google.com/d/msg/microsoft.public.win32.progr...

(The Win32 userland won't understand what you did, but you can still do it.)

Well, you can do it on some versions of Windows. On Windows 10, and even future version of Windows 10, not so sure...

Windows 10 is still windows NT. The NT native API is widely used these days. It would be a huge departure for MS to stop supporting it in future versions of windows.

Most parts of the NT API have never been officially documented, officially supported, stable (in the "won't change" meaning), and the tiny parts that have actually been documented come with caveats that they are susceptible to change. Supporting fork through the NT API forever makes no sense if there are no users anymore. They could continue to do it for no specific reason, just because fork is internally needed by WSL for example, and so because it is easy to export the capability through the NT API, but I really don't see why they would necessarily do that.

MS has historically maintained compatibility with even undocumented APIs. Of course that could change.

Cygwin programs technically run under the Win32 subsystem, but they're not that cleanly layered. The runtime calls into a lot of Nt* functions, including undocumented ones. midipix (which another commenter mentioned) is another Unix-like environment for Windows that also runs under the Win32 subsystem, and apparently it has successfully implemented a real copy-on-write fork() on top of undocumented NT syscalls, so it's definitely possible.

midipix is trying to use it, and advertises copy-on-write fork as an advantage over Cygwin, but I don't know how well it works yet. http://midipix.org/#sec-midipix http://midipix.org/git/cgit.cgi/ntapi/tree/src/process

Does anybody know how fork() is implemented? This blog post kind of sounds like fork() would do the slow emulation of it through CreateProcess().

fork() is properly implemented by the NT kernel. WSL is not layered above Win32.

Excellent post, Jack.

Interesting, I wonder how much overhead is added to syscalls to look up the process type. Does NT still do this check when no WSL processes are running?

Pretty sure these are different entry points, so you wouldn't need to do anything different for normal Windows processes whether WSL is running or not.

I don't think so... both linux and windows binaries are using the same SYSCALL cpu instruction, and thus must be going to the same handler in the NT kernel.

Does Microsoft document all system calls?

They document the WinAPI, but how that talks to the kernel is not documented. You can talk to it directly if you want, but there is nothing from Microsoft on how to do that. So if you see those as the true system calls, they are not documented at all.

Well, tiny parts of the NT API (callable from userspace) are documented, but then often with the caveat that they are not stable (in practice, even some undocumented ones can be considered stable if used by enough programs in the wild, especially if they are simple and standalone and have no Win32 equivalent)

The very precise mechanism, though, is extremely unstable. For example virtually every release of Windows (even sometimes SP) changes the syscall numbers. You have to go through the ntdll, which is kind of a more heavyweight version of the Linux VDSO. (The NTDLL approach was invented way before the VDSO, though)

Ntdll is similar to VDSO in the sense that it is loaded into the memory space of every userspace process. Even that I think might have exceptions on the Linux side. Either way, unlike VDSO, Ntdll actually does export functions potentially useful when called from the program. Here is an interesting read. http://undocumented.ntinternals.net/

What do you think the VDSO is used for? It also exports "functions potentially useful when called from the program".

The approach is a little different though; ntdll exports all of the NT API, and you need to go through it to reach the NT API in a somehow more stable way than using syscall numbers. OTOH, the VDSO exports only virtual syscalls that gain (or have gained in the past) from being performed in userspace, and even then corresponding syscalls still exist in the kernel, with both stable numbers and even a stable API.

Yes, yes, but can we run Wine on it?

wtf is "pico process" and "pico driver"?

Step 1: embrace

Funny they don't mention ioctl.

Next step is Microsoft basically needs to turn Windows into a flavour of Linux. If they don't, they're under massive pincer threat from Android and Chrome, which are rapidly becoming the consumer endpoints of the future. Windows is about to "do an IBM" and throw away a market that it created. See PS/2 and OS/2.

They should probably just buy Canonical. That would put the shivers into Google, properly.

Funny years ago i would have reflexively flabbergasted at the thought of microsoft buying canonical (or any linux distro producer)...but actually thinking on that concept, and seeing recent (perhaps less-than-hostile) approach that microsoft has taken towards open source and linux, that wouldn't be a bad idea. I mean if microsoft could have both offerings - for windows servers and ubuntu-installed servers - i suppose that would be a very smart business move. Assuming they don't actually butcher or deny resources to whatever linux company they would buy, i could see several benefits - not only to microsoft but to developers, system integrators, etc. worldwide. Hey if a side benefit is that it would spur the market (a la google, apple, etc.) a little - to the benefit of us civilians - that's cool too.

I think Microsoft should do what Apple did with BSD Unix aka Nextstep and merge it with their old OS.

Microsoft should take the Windows GUI and put it over Linux as a desktop manager. Microsoft could sell the Windows GUI for Linux users that want to run Windows apps.

Could not agree more. Windows WM as an option on Linux is a clear and logical strategy.

I've been heavily downvoted for the view, but the facts are, there are hundreds of billions of dollars being spent in the Linux ecosystem, by corporations. Microsoft cannot afford not to be present in it. It's as simple as that. Canonical is starting to look like hitting Red Hat a bit on support contracts for corpos ets, so that's why I suggested that, but as you say, it could be another big and credible Linux distro (though Ubuntu all over the cloud must surely be tempting). Generally the idea that Microsoft wants to/must go big into Linux is uncontroversial, for me.


We detached this subthread from https://news.ycombinator.com/item?id=11866402 and marked it off-topic.

I'm incredibly biased. But that bias has come from assessing the technical details and concluding that NT really is superior, if that's any consolation.

PyParallel flogs the Windows versions of things like Go, Node and Tornado because none of those were implemented in a way that allows the NT completion-oriented I/O facilities to be optimally exploited.

It's depressing, honestly. In the sense that open source software never really comes close to taking advantage of NT because there are no such paradigms on UNIX. It's also complex as hell... I came from a UNIX background and completion ports were just a bizarre black box of overcomplicated engineering -- but after taking the time to understand the why, that was just a blub paradox reaction. And it's been a couple of years now of concerted study to really start appreciating the little details.

> things like Go, Node and Tornado [does not allow] the NT completion-oriented I/O facilities to be optimally exploited.

nodejs is built on libuv[1], which uses IOCP on Windows (and epoll, kqueue, etc. on other platforms). What's non-optimal about it?

[1] https://github.com/libuv/libuv

"The things that bothers me about all the 'async I/O' libraries out there... is that the implementation -- single-threaded, non-blocking sockets, event loop, I/O multiplex via kqueue/epoll, is well suited to Linux/BSD/OSX, but there's nothing asynchronous about it, it's technically synchronous non-blocking I/O, and it's inherently single-threaded."


Now I'm curious; have you played with reactOS at all? Do they implement the same VMS paradigm? That is, if given decent hw drivers for io on ReactOS, could one expect performance on the same order of magnitude as with a new nt-derived kernel?

I haven't actually. Although now I'm kind of curious.

It pretty much a complete implementation down to the kernel APIs.

One of the (former?) main kernel developers on ReactOS, Alex Ionescu, is even a co-author on the latest editions of Windows Internals.

Indeed, he's got a talk at Black Hat this year about Bash on Windows on Linux, too :-)

Why did you two have this same conversation 383 days ago?


I dunno', I don't remember everyone I reply to, certainly not over a year ago.

You and me both.

Let's definitely reconvene in another 384 days though!

Should give you enough time to test on ReactOS and report back! :-)

Haha. Well, you know what they say, if it is a good idea today, it's a good idea tomorrow too ;-)

So no one has replicated your findings and you're better than everyone else? K.


TBH it looks like you're making multiple accounts to comment on your own posts to make the comments that disagree with you appear lower.

You and the child above mine had this same conversation 383 days ago:


This is pathetic.

for posterity:



We've banned this account for repeatedly violating the HN guidelines.

Well now I have no idea what you're talking about.

I can barely remember the password to this hacker news account let alone managing multiple identities.

I only have superficial experience with it, but from what I can tell there is a huge mismatch between IOCP and the UNIX readiness/poll model, and from my experience most server programs are written primarily for the latter.

You need to design your server code somewhat differently to take advantage of IOCP. What many UNIX-like softwares do instead is bend IOCP or WaitForMultipleObjects() to behave like Linux. It works, but the performance is not there.

Note that I haven't checked the code for any of the softwares in your chart so I could be wrong.

Oh man, seeing UNIX people use WaitForMultipleObjects() as if it were select() (or, gasp, trying to use it for socket server stuff) just kills me.

It's actually a really sophisticated multi-semaphore behind the scenes; but that sophistication comes at an inevitable performance penalty, because the wait blocks are in a linked list and it's generally expensive (in comparison to say, a completion packet being queued to a port) to satisfy that wait (the kernel needs to check lots of things and do locking and unlocking and linked list manipulation).

Here's what I don't understand: If IOCP gives us such a great performance boost why don't I see people using it, even on Windows systems? The first thing that comes to mind is that IOCP can likely only maintain high performance under certain edge-cases that aren't pertinent to the real world, second is that the other I/O models are needed not for unix, but for other aspects of the language its self where IOCP is not appropriate. So IOCP likely causes overhead. I don't know much about it so maybe someone can explain. It sounds revolutionary if it can be applied to the real world and not just edge cases.

People are used to the POSIX readiness model, that's all. And you can kind of make it work on Windows, even though the performance is not great. OTOH, porting an IOCP-oriented application to Linux will give you catastrophic performance because you need to use many threads to replicate the async model on top of the synchronous Linux I/O calls.

And for some reason, people get very defensive when you point out some of the advantages the NT kernel has over Linux. I mean, I use Linux, I don't like to use Windows but I have no problem admitting the IOCP model is better and async I/O on Linux is a sad broken mess. Denying it only serves to keep it that way.

IOCP is the top item on my (short) list of things Windows simply does better. The performance boost you see from designing a server for IOCP from the ground up is jaw-dropping. However, it's hard to grok because it's a very different model, most servers are proprietary code, and the MSDN documentation is barely sufficient (at least, back when I was writing IOCP-based servers in '09).

Yeah, I had to figure out how to leverage all the new threadpool stuff with async sockets from scratch for PyParallel, none of the MSDN docs detailed how AcceptEx() and ConnectEx() and DisconnectEx() and all that jazz were meant to be used with CreateThreadpoolIo(), StartThreadpoolIo() etc.

Even bought a couple of Microsoft books and they only covered it at the file I/O level, too. And the last Winsock book is reaaaaaally getting old.

Also, registered I/O is a similar jump in jaw-dropping performance. Disable completion notifications, pre-register all of your buffers on large pages, and use timers to schedule bulk de-queuing of all available completion events.

You definitely need a feedback loop in there to fine-tune the timings, but holy hell does it fly. You basically get a guaranteed-max-latency state machine (given bandwidth, computational cost of request processing, and number of clients connected).

Inertia and the fact that select/epoll/kqueue loops are "good enough".


Not only that but exploiting NT optimally takes a lot of knowledge about a lot of very low-level kernel details, which you sort of need to make a concerted effort to really learn, which is made harder by no official access to (recent) source code.

So I genuinely think there are just fewer programmers out there that specialize in this sort of stuff. There's definitely a correlation between well-above-the-average programmers trending toward open source contribution, but not really the same level of intellectual curiosity about NT/Windows.

One thing I've noticed is that all the people that grok really low level NT kernel details usually end up as reverse engineering experts.

Oh, except for the OSR internals list, that is usually overflowing with clue (both now and historically).

It also requires enjoying programming on Windows and buying all the Microsoft Press books that have tons of healthy information how the whole stack works.

Which doesn't work in the free (as in beer) culture world of modern UNIX.

So only those of us that feel like rewarding the work of others do get the information how everything works.

The simplest explanation is that the performance benefit of using Windows and implementing just about any Windows-specific design is outweighed by the cost of the Windows licensing fees, when compared to a measurably worse-performing Linux or FreeBSD solution that costs nothing. So very few bother to treat Windows versions of "backend" software as anything but an afterthought.

It's not even the licencing fees in my opinion. It's that it has had the equivalent of 200 Potterings run amok in it for 30 years. There are so many things that will eclipse any advantage from IO completion ports.

Not really... the license model provides support and creates fairly standard releases by which you can more easily leverage. If licensing or selling software is your thing. If all you need is something good enough and cheap then licensing could get in the way

I guess you should also mention on each of your comments that you are extremely biased against and paranoid about Microsoft and that you actively target any popular Microsoft article posted here in order to inject something negative, whether it's relevant or not.


I use to run Linux in a VM on windows and use Chocolatey for package management and cygwin and powershell etc, then I realized I was just trying to make Windows into Linux. Seems to be the way things are going and with the addition of the linux subsystem it kind of proves that Windows really isn't a good OS on it's own, especially not for developers.

I wish Windows/MS would abandon NT and just create a Linux distro. I don't know anyone who particularly likes NT and jamming multiple systems together seems like an awful idea.

Windows services and Linux services likely won't play nice together (think long file paths created by Linux services and other incompatibilities), for them to be 100% backward compatible they need to not only make Windows compatible with the things Linux outputs, but Linux compatible with the things windows services output, and to keep the Linux people from figuring out how to use Windows on Linux systems they'd need to make a lot of what they do closed source.

So I don't see a Linux+Windows setup being deployed for production. It's cool for developers, but even then you can't do much real world stuff that utilizes both windows and Linux. If you're only taking advantage of one system then whats the point of having two?

I went ahead and made the switch to Linux since I was trying to make Windows behave just like Linux.

> I wish Windows/MS would abandon NT and just create a Linux distro. I don't know anyone who particularly likes NT and jamming multiple systems together seems like an awful idea.

I do. The NT kernel is pretty clean and well architected. (Yes, there are mistakes and cruft in it, but Unix has that in spades.) It's not "jamming multiple systems together"; an explicit design goal of the NT kernel was to support multiple userland APIs in a unified manner. Darwin is a much better example of a messy kernel, with Mach and FreeBSD mashed together in a way that neither was designed for.

It's the Win32 API that is the real mess. Having a better officially supported API to talk to the NT kernel can only be a good thing, from my point of view.

Well, large parts of the NT API are very close from Win32 API for obvious reasons, and so are often in the realm of dozen of params and even more crazy Ex functions. Internally there are redundancies that do not make much sense (like multiple versions of mutex or spinlock depending on which parts of kernel space use them, IIRC), and some whole picture aspects of Windows makes no sense at all given the architectural cost it induces (Winsock split in half between userspace and obviously needed kernel support is just completely utterly crazy, beyond repair, it makes so little sense you want to go back in past and explain the designer of that mess how stupid this is). The initial approach of NT subsystems was absolutely insane (hard dep on a NT API core, so can't do emulation with classic NT subsystems - so either limited to OS having some technical similarities like OS/2, or very small communities when doing a new target like the Posix or SFU was) -- WSL makes complete sense, though, but it is maybe a little late to the party. Classic NT subsystems are of so little use that MS did not even use them for their own Metro and then UWP things, even though they would like very hard to distinguish that more from Win32 and make the world consider Win32 as legacy. I've read the original paper motivating to put Posix in an NT subsystem, and it contained no real strong point, only repeated incantations that this will be better in an NT subsystem and worse if done otherwise (well for fork this is obvious, but the paper was not even focused on that), with none of the limitations I've explained above ever considered.

Still considering the whole system, an instable user kernel interface has few advantages and tons of drawbacks. MS is extremely late to the chroot and then container party because of that (and let's remember that the core technology behind WSL emerged because they wanted to solve the chroot aside userspace system on their OS in the first place, NOT because they wanted to run Linux binaries) -- so yet another point why classic NT subsystems are useless.

Back to core kernel stuff, IRQL model is shit. Does not make any sense when you consider what really happens, and you can't really use arbitrary multiple levels. It seems cute and clean and all of that, but Linux approach of top and bottom halves and kernel and user threads might seem messy but is actually far more usable. Another point: now everybody uses multiprocessor computers, but back in the day the multiple HAL were also a false good idea. MS recognize it now and only want to handle ACPI computers, even on ARM. Other OSes do all kind of computers... Cutler pretended to not like the "everything is a file" approach, but NT does basically the same thing with "everything is a handle". And soon enough, you hit exactly the same conceptual limitations (except not in the same places) that not everything is actually the same, so that cute abstraction leaks soon enough (well, it does in any OS).

On a more result oriented approach, one of the things WSL makes clear is that file operations are very slow (just compare an exactly identical file heavy workload under WSL and then under a real Linux)

So of course there are (probably) some good parts, like in any mainstream kernel, but there are also some quite dark corners, and I am not an expert about all architectural design of NT but I'm not a fan of the parts I know, and I strongly prefer the Linux way to do equivalent things.

> Cutler pretended to not like the "everything is a file" approach, but NT does basically the same thing with "everything is a handle". And soon enough, you hit exactly the same conceptual limitations (except not in the same places) that not everything is actually the same, so that cute abstraction leaks soon enough (well, it does in any OS).

Explain? Pretty much the only thing you can do with a handle is to release it. That's very different from a file, which you can read, write, delete, modify, add metadata to, etc... handles aren't even an abstraction over anything, they're just a resource management mechanism.

You are right, but those points are details. FD under modern Unixes (esp. Linux, but probably others) serves exactly the same purpose (resource management). The FD where read/write can't be used just don't define those (same principle for other syscalls) -- similarly if you try to NtReadFile on an incompatible Handle it will also give you an error back. Both are in a single numbering space per process. NT largely makes use of NtReadFile / NtWriteFile to communicate with drivers, even in quite core Windows components (Winsock and AFD). And NT Handles do serve at least an abstraction (I know of): they can be signaled, and waited for with WaitFor*Objects.

So the naming distinction is quite arbitrary.

> You are right, but those points are details.

Uh, no, they are very crucial details. For example, it means the difference between letting root delete /dev/null like any other "file" on Linux, versus an admin not being able to delete \Device\Null on Windows because it isn't a "file". The nonsense Linux lets you do because it treats everything like a "file" is the problem here. It's not a naming issue.

Linux has plenty of file descriptor types that do not correspond to a path, along with virtual file systems where files cannot be deleted...

Your example of device files is hardly universal, and the way it works is useful.

And to give you another example, look at how many people bricked their computers because Linux decided EFI variables were files. You can blame the vendors all you want, but the reality is this would not have happened (and, mind you, it would have been INTUITIVE to every damn user) if the OS was sane and just let people use efibootmgr instead of treating every bit and its mother as files. Just because you have a gun doesn't mean you HAVE to try shooting youself, you know? That holds even if the manufacturer was supposed to have put a safety lock on the trigger, by the way. Sometimes some things just don't make sense, if that makes sense.

How many people really did this compared to eg windows users attacked by cryptolocker?

"The way it works is useful?"! When was the last time you found it useful to delete something like /dev/null via command line? And how many poor people do you think have screwed up their systems and had to reboot because they deleted not-really-files by accident? Do you think the two numbers are even comparable if the first one isn't outright zero?

It literally doesn't make any sense whatsoever for many of these devices to behave like physical files, e.g. be deletable or whatnot. Yes there is an exception to every nonsense like this, so yes, some devices do make sense as files, but you completely miss the point when you ignore the widespread nonsense and justify it with the exceptions.

Your complaint is with the semantics of the particular file. There's no reason why files in /dev need be deletable using unlink. That's an historical artifact, and one that's being rectified.

"Everything is a file" is about reducing most operations to 4 very abstract operations--open, read, write, and close. The latter three take handles, and it's only the former that takes a path. But you're conflating the details of the underlying filesystem implementation with the relevant abstraction--being a file implies that it's part of an easily addressable, hierarchical namespace. Being a file doesn't imply it needs to be deletable. unlink/remove is not part of the core abstraction. But they are hints that the abstraction is a little more leaky than people let on. Instantiating and destroying the addressable character of a file poses difficult questions regarding what the proper semantics should be, though historically they're deletable simply because using major/minor device nodes sitting atop the regular persistent storage filesystem was the simplest and most obvious implementation at the time.

Hm I was more thinking about open FD, not just special file entries on the FS. Well, I agree with you: it's a little weird and in some cases dangerous to have the char/block devices in the FS, and it has already been worked-around multiple times in different ways, and even in some cases simultaneously with multiple different work-around. NT is better on that point. But not once the "files" are open and you've got FD.

> IRQL model is shit. Does not make any sense when you consider what really happens,

On the contrary. It's only when one considers what happens, especially in the local APIC world as opposed to the old 8259 world, that what the model is actually does finally make sense.

* http://homepage.ntlworld.com./jonathan.deboynepollard/FGA/ir...

I don't care about the 8259, I don't see why anybody should care except PIC driver programmers, it's doubtful anybody designing NT cared given the very first architecture it was design against at was not a PC, and this is typically the kind of thing that goes through the HAL.

IRQL is a completely software abstraction thing, in the same way top/bottom halves and various threads are under Linux (hey, in some version of the Linux kernel it even transparently switches to threaded IRQ for the handlers, there is no close relationship with any interrupt controller at this point...). IRQL is shit because most of the arbitrary levels it provides are not usable to distinguish anything continuously from an application point of view (application in the historical meaning, no "App" bullshit intended here), even so in seemingly continuous areas (DIRQL), so there is no value in providing so many levels with nothing really distinguishing between them -- or at some level transitions too many things completely different. It's even highly misleading, to the point the article you link is needed (but does not even provide the whole picture.) I see potential for misleading people with PIC knowledge, people used to real time OSes (if you try to organize application priority by basing on IRQL, you will miserably fail), people with background in other kernels, well, pretty much everybody.

Having a better officially supported API to talk to the NT kernel can only be a good thing, from my point of view.

That's particular interesting now that SQL Server has been ported to Linux. Would be funny if they're going to use the Linux subsystem on Windows too.

Although I suspect SQL Server already talks to the kernel directly.

No, they do have sophisticated user mode library but use only public kernel APIs. That user model library also helped them relatively painlessly migrate SQL Server to Linux.

> Having a better officially supported API to talk to the NT kernel can only be a good thing, from my point of view.

This is what I am looking forward to with WinRT, hence why Rust should make it as easy as C++/CX and C# to use those APIs. :)

Well I've personally seen Microsoft employees themselves complaining about the state of NT while saying it's "fallen behind Linux".

An old HN commenter once wrote (mrb)

> There is not much discussion about Windows internals, not only because they are not shared, but also because quite frankly the Windows kernel evolves slower than the Linux kernel in terms of new algorithms implemented. For example it is almost certain that Microsoft never tested I/O schedulers, process schedulers, filesystem optimizations, TCP/IP stack tweaks for wireless networks, etc, as much as the Linux community did. One can tell just by seeing the sheer amount of intense competition and interest amongst Linux kernel developers to research all these areas.

>The net result of that is a generally acknowledged fact that Windows is slower than Linux when running complex workloads that push network/disk/cpu scheduling to its limit: https://news.ycombinator.com/item?id=3368771 A really concrete and technical example is the network throughput in Windows Vista which is degraded when playing audio! https://blogs.technet.microsoft.com/markrussinovich/2007/08/...

>Note: my post may sound I am freely bashing Windows, but I am not. This is the cold hard truth. Countless of multi-platform developers will attest to this, me included. I can't even remember the number of times I have written a multi-platform program in C or Java that always runs slower on Windows than on Linux, across dozens of different versions of Windows and Linux. The last time I troubleshooted a Windows performance issue, I found out it was the MFT of an NTFS filesystem was being fragmented; this to say I am generally regarded as the one guy in the company who can troubleshoot any issue, yet I acknowledge I can almost never get Windows to perform as good as, or better than Linux, when there is a performance discrepancy in the first place.


"As Brother Francis readily admitted, his mastery of pre-Deluge English was far from masterful yet. The way nouns could sometimes modify other nouns in that tongue had always been one of his weak points. In Latin, as in most simple dialects of the region, a construction like servus puer meant about the same thing as puer servus, and even in English slave boy meant boy slave. But there the similarity ended. He had finally learned that house cat did not mean cat house, and that a dative of purpose or possession, as in mihi amicus, was somehow conveyed by dog food or sentry box even without inflection. But what of a triple appositive like fallout survival shelter? Brother Francis shook his head."

What is this from ?

Walter Miller's A Canticle for Leibowitz.

It's a subsystem of Windows for running part of Linux, so Windows Subsystem for Linux. :)

This title would be clearer: Windows subsystem for Linux apps.

That's the output though, they're not changing anything in Linux. The "Windows subsystem" they're developing will act as the translator to get there.

Linux Subsystem for Windows could also be misread the same way. Both are ambiguous.

I initially thought this one was less ambiguous but I have to admit, I think Microsoft's phrasing is right. Let's try some substitution:

"Russia Factory for England" most likely exists inside of Russia and is for the English.

"John's mail for Sally [try: who is out of town]" even with the addition, I presume that John has authored mail for Sally and is not collecting the parcels to give to her.

Here's a trickier one:

"Sampsons' Dinner for Two". This could be the following:

1. A product named "Sampsons' Dinner for Two" bought from a retail store

2. An item "Dinner for Two" on a menu from a restaurant named "Sampsons"

3. A place named "Sampsons' Dinner for Two" with only two-person tables.

4. A product "Sampsons' Dinner" which comes in multiple sizes, one of them being designed for two people. (which is the ambiguous form - presuming there's also say Annie's Dinner for One/Two and Martha's Dinner for One/Two - each with a brand specific cuisine). Even here though, the ownership of which "Dinner for Two" product is still clear - it's the "Sampsons'" or "Martha's" brand.

Regardless of what kind of substitution, we go back to "Windows Subsystem for Linux" for the most part parsing as

"Windows [Subsystem for Linux]" like "Windows [Media Player]". I don't assume that it's "[Windows Media] Player" - as in some multi-platform software that is tasked with playing the proprietary windows media formats.

It seems weird, but I think it's unarguably the right choice.

Maybe because the way they stated it it would be a much more attractive technology. Seems like an attempt to regaining ground in the server market.

This would be really useful for distributing Windows apps as Linux binaries. It would make it easier to develop from Linux and target Windows. Need the same for OSX.

Pretty neat stuff. I think that MS should just create their own Linux Distribution & port all MS products. Get rid of the Windows NT Kernel. I believe it's outdated & doesn't have the same update cycle that the Linux Kernel has.

Why run a Linux Application/binary on a windows server OS? When you can just run it on Linux OS and get better performance & stability.

What makes you believe it's outdated?

Can you show me the source so I can check?

Actually there was a leak for 2000, most critics said it was surprisingly good.

There are leaks galore. NT4, 2000, and more recently, the Windows Research Kit. Just google something like 'apcobj.c' and see. (Hah, first link was a github repo!)

> Get rid of the Windows NT Kernel. I believe it's outdated & doesn't have the same update cycle that the Linux Kernel has.

Curious why you claim this? What's outdated about the NT Kernel?

Here are some, or maybe this is not part of the NT Kernel... 1. The use of drive letters A-Z for file system access. 2. Creating symbolic links to files and folders, like you can in Unix/Linux. You have to set a setting somewhere to enable this, but there's a security risk. 3. Standard functional/usable non-gui terminal application like Unix/Linux ssh. PowerShell doesn't come close. 4. Ability to SUDO or su Admin like Unix/Linux. Maybe these are not kernel related above, but the OS specific layer.

1. The use of a multi-root hierarchy vs. a single root hierarchy is pretty arbitrary. Drive letters in turn are just an arbitrary way to define the multi-root hierarchy.

2. `mklink` [0] has existed since Windows Vista for NTFS file system. No settings toggling required.

3. What is your argument against PowerShell? In what ways does it fall short? I have been pretty successful with using it for various tasks.

4. This is about the only legit claim. Windows always requires full credentials to execute as another user. Windows does provide `runas.exe`, but you must provide the target user's full credentials.

[0] https://technet.microsoft.com/en-us/library/cc753194%28v=ws....

> `mklink` [0] has existed since Windows Vista for NTFS file system. No settings toggling required.

There are several limitations in the Windows/NTFS implementation, however:

1. You have to specify the target type (file or directory) at link creation time.

2. Creating symbolic links requires either being an Administrator user, or having the "Create symbolic links" group policy enabled for your account.

3. No real directory hard links. (EDIT: For some reason, I forgot that Linux doesn't have these, either. Maybe I was thinking of bind mounts.)

3. No real directory hard links. Junction points are close, but still distinguishable from

Windows isn't the only OS that disallows creating directory hard links.


Huh, I wonder why I misremembered that. Thanks for the correction.

> 3. What is your argument against PowerShell? In what ways does it fall short? I have been pretty successful with using it for various tasks.

Powershell is an acceptable scripting language. It's a horrible interactive shell. They shoudn't have made "shell" part of its name if they weren't going to at least include basic interactive features like readline compatibility and usable tab completion.

I'm not familiar with what you mean by readline compatibility. I see a GNU Readline library with history functions, but I'm not sure what that means being integrated into a shell, and what features you would expect to see.

My experience with tab completion in PowerShell is great. It completes file paths, command names, command parameter names, and even command parameter values if they're an enumeration. Could you describe what else you would expect to see?

Powershell's completion is worthless for the most common use case of saving keystrokes. Since it fills in the entire remaining command name and cycles through the possibilities in alphabetical order instead of completing just what's unambiguous and presenting a list of the possibilities from there, it doesn't help with typing out common prefixes and if you find yourself cycling through an unreasonably large number of possibilities you have to backspace get rid of the unwanted completion (often very long, given the naming conventions) before you can refine your search.

They took a code completion technique that works alright for an IDE and put it on the command line, losing some key usability in the process when they could have just implemented the paradigm that has been standard in the Unix world for decades.

Yes, it does a great job of identifying what the completion possibilities are in almost every context. That's good enough for an IDE, but only half the job when you're making an interactive command line shell.

As for other readline features: it's really annoying to only partially implement a well-known set of keyboard shortcuts.

See, and I personally dislike the system you mention. It bugs me to no end to have only a couple options, but the system will only fill in the common prefix. Now I have to look, figure out what's there already, figure out what the next letter is, hit it, then hit tab again. If I want to save key strokes, that what aliases and symlinks are for, not tab completion.

Also, at least in PowerShell 5, if you run the ISE instead of the cmd-based terminal, it shows an IDE-like overlay of completions while you're typing.

> if you run the ISE instead of the cmd-based terminal

PowerShell depends (or is based) in no way on cmd.

I am fairly certain that they use the same console system. For instance, right-clicking on the title bar gives the exact same options as in `cmd.exe`. I was perhaps a little overzealous in calling it cmd-based; that's a slip-up on my part simply because `cmd.exe` was the only program to use that system before. Thanks for keeping me honest.

Both use the Windows console host, effectively what a terminal emulator is on Unix-likes. It provides the window and a bunch of other functionality (character grid with attributes, history, aliases, selection, drag&drop of files into the window, etc.).

It's just that every console application on Windows uses that host. This includes cmd, PowerShell, Far Manager, or even vi. Sorry, I may have seen it conflated with cmd too often. It just nags me. For Linux users it's probably when everyone starts calling a terminal (emulator) "bash".

https://github.com/lzybkr/PSReadLine perhaps. Personally I find it nice, but it requires a bunch of tweaking to feel comfortable, but maybe less so to people who are used to readline.

Tab completion works nicely and you can even rotate through the different values a parameter can take. Also, ctrl+space lists all the options (parameters, values, files, folders) available at your cursor's position.

I would call it something more than acceptable scripting language. It is object oriented and it can easily use C# libraries, which is pretty neat.

a) That's nothing to do with the kernel.

b) http://mridgers.github.io/clink/ is bloody fantastic.

> 1. The use of drive letters A-Z for file system access.

NT has a root directory like Unix does. Drive letters are symbolic links inside a directory called \DosDevices.

Granted, this is not user-visible but an implementation detail. The needs of Win32 applications dictate a lot of user-visible behavior.

> 2. Creating symbolic links to files and folders,

NT supports symbolic links. Open cmd and type "mklink".

You are confusing the win32 subsystem with the NT kernel. They are not the same, the win32 layer acts as a translation. Also symbolic and hard links are supported by NTFS, they are just not exposed in the UI. There are utilities to create them if you really want to.

The shell itself and the rest of userland has very little to do with the kernel. It seems its the userland you are upset with. Swapping out the kernel won't fix that.

None of these are kernel-related.

You can create hard and soft links. PowerShell is great, just different. UAC is not sudo, but works very well. It's a different OS.

UAC is not sudo, but works very well.

For some values of well. If you logon interactively and start a powershell session, you do not have administrative powers and cannot get them without opening a new shell. If you logon via PS remoting, you have administrative powers by default and cannot lose them.

UAC is a GUI kludge, and is very grating especially in Powershell.

My biggest pet Peeve about Windows is the way it accesses files I'm not sure if this is a kernel or filesystem issue. But when a remote user has a file open as long as that file is open other users are prevented from updating or replacing the file. It happens all the time at my work and I know of no obvious way to work out who has the file open because as far as I can tell nothing like lsof exists.

This is probably the number one cause of me banging my head against the desk and wishing Windows behaved more like Linux.

What you can do while it's open is partially defined by the dwShareMode parameter in CreateFile. Unfortunately a lot of people look at the daunting documentation, shrug, and put 0 there, which is the least permissive mode. A lot of libraries do that too.

OTOH there are other limits that are not dictated by dwShareMode. Such as deleting files while handles are open - this blocks a new file with the same name from being created until all handles are closed. That's probably the worst one. There are some other crappy ones involving directory handles that I don't care to enumerate.

> I'm not sure if this is a kernel or filesystem issue.

It's neither. It's a common misconception. See https://news.ycombinator.com/item?id=11415366 .

Sane OSes don't let people modify files without permission from other users.

Windows isn't the only OS having file locking implemented at kernel level.

> or maybe this is not part of the NT Kernel

> 1. The use of drive letters A-Z for file system access.

Indeed it is not. The kernel sees a single root for the object namespace, not drive letters.

2. Filesystem feature, supported by NTFS for quite a while in differen styles. It has a security policy because a lot of userland software doesn't know about them. I use them all the time and they work, even if you move things like /Users/.

3. Userland issue. You can compile bash and an ssh server for Windows if you want. PowerShell is quite different, yes.

4. exists, both in GUI and commandline.

A-z drive letter is just a win32 thing.

Symbolic links is supported by NTFS, just not exposed to normal users.

That's just your opinion about powershell...

UAT, and runas?

> 1. The use of drive letters A-Z for file system access.

Why is this a problem? As a user, I've always preferred to have drive letters - it makes it immediately clear if, for example, I'm moving files between different physical drives.

Considering mount points, reparse points, and things like subst, I doubt you can ever really know that. Granted, the deviations from the normal scheme are your own making as a user, but so are the places where you mount volumes from different physical drives in a single root hierarchy.

4. Shift+Right click or CPAU http://www.joeware.net/freetools/tools/cpau/ for command line

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact