
Emulating Windows system calls in Linux - mfilion
https://lwn.net/Articles/824380/
======
mrpippy
There was a follow up a few weeks later:
[https://lwn.net/Articles/826313/](https://lwn.net/Articles/826313/)

~~~
Ericson2314
Let's make this the URL? I think the resolution makes for a more interesting
discussion, and it does still talk about the provisional design in brief.

------
nxc18
This seems like it would be a lot of maintenance effort to get right,
considering Windows shuffles their syscall table every release, including
releases within the Windows 10 family.

[https://j00ru.vexillium.org/syscalls/nt/64/](https://j00ru.vexillium.org/syscalls/nt/64/)
Has an excellent table of syscalls by version.

~~~
gizmo686
If applications are calling these syscalls directly; then there needs to be
some form of stability; otherwise, the applications would only work on the
particular version of Windows they were compiled for.

~~~
kazinator
That stability comes from the C interfaces of the Win32 API implemented by a
DLL like kernel32.dll.

I think when the article is talking about Windows system calls, it's about the
interface between the kernel32.dll and the kernel, not about the Win32 API.

~~~
poizan42
The syscall api is in ntdll.dll (and user32.dll for win32k.sys calls).
kernel32 does not know the syscall numbers, it just calls functions in ntdll.

~~~
my123
Before, it used to be user32.dll/gdi32.dll calling directly while also having
tons of logic inside.

Nowadays, it's split up as it should always have been, with the path being
user32.dll -> win32u.dll -> win32k.sys.

~~~
poizan42
Ah yes forgot about that. The kernel side has also been split up into
win32kbase.sys and win32kfull.sys/win32kmin.sys (I think win32kmin may have
only been used for the now discontinued Windows 10 Mobile), and win32k.sys
just calls into those.

------
magicalhippo
Despite having coded for Windows for quite some time, I realized I had never
actually peeked that low before, so didn't have a good understanding of how
system calls really worked.

Found this article[1] which explained it nicely (and hopefully correctly).

[1]:
[https://www.codeguru.com/cpp/w-p/system/devicedriverdevelopm...](https://www.codeguru.com/cpp/w-p/system/devicedriverdevelopment/article.php/c8035/How-
Do-Windows-NT-System-Calls-REALLY-Work.htm)

------
CodesInChaos
Since calling syscalls directly instead of through the wrapper dll is
unsupported, I wouldn't be surprised if MS prevented syscalls from outside the
expected libraries at some point.

OpenBSD already does this to improve security: [https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2](https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2)

~~~
guenthert
I'm a bit at loss how this improves security or why it is necessary. If the
kernel needs an external library as gatekeeper, something fundamentally went
wrong.

And why would, say a Lisp compiler use the C library (or to use a more popular
example, why would a Java VM do the same)? It might be common practice (to
ease FFI/JNI), but making this a requirement seems heading the wrong
direction.

~~~
CodesInChaos
> I'm a bit at loss how this improves security or why it is necessary. If the
> kernel needs an external library as gatekeeper, something fundamentally went
> wrong.

It's not about defending the kernel from the user mode process. It's a
mitigation to make exploiting usermode RCEs like buffer overflows a bit harder
by possibly preventing some forms of ASLR bypass.

MS might do it in part for security, but likely aloo to prevent people from
relying on unstable implementation details, which makes changing those details
harder for MS because they try to avoid breaking popular applications.

> And why would, say a Lisp compiler use the C library

On unix based systems libc doubles as (part of) the OS's API. BSD in
particular has no stable syscall interface, code must go through libc. Its
libc doesn't even have a stable ABI across major versions, which is annoying
if you're not using C. Non C languages are clearly second class citizens on
BSD.

Linux on the other hand does have a stable documented syscall interface, so
you are allowed to call the kernel directly, without relying on any OS
provided usermode libraries.

On Windows it's similar to BSD, but you have to go through different OS
provided libraries (at minimum ntdll, usually the Win32 API). At least these
have a stable ABI. On Windows libc is little more than just another dll.

> but making this a requirement seems heading the wrong direction.

I disagree. IMO an OS provided dynamically linked library is a great
abstraction boundary. It allows the OS to change its kernel interface or even
turn operations that previously needed to transition into the kernel into
purely usermode functions.

For example the OS might provide a mutex which in the first version
transitions to the kernel on every lock attempt, while later it's replaced by
a futex based implementation which only transitions on contented locks.

Or more relevant to the linked article, it allows implementing the same API on
a completely different kernel (like Wine does), without increasing the
kernel's API surface.

~~~
rstuart4133
> IMO an OS provided dynamically linked library is a great abstraction
> boundary.

In theory, yes that's true. In practice it's a question of whether it actually
reduces anybodies work load. If everybody respected the boundary I sure it
would reduce Microsofts work load.

But as has been reported repeatedly here, that's isn't how it's worked out in
practice. Not only do applications not respect it, Microsoft finds itself in
the position of having to ensure some of those applications keep working on
newer kernel versions despite the fact they've wilfully violated the rules. So
we end up the Windows GUI having to expose with "compatibility levels", and
users having to deal with it. That doesn't sound like a win to me.

There are other ways of doing it that wouldn't suffer from that fate.
Trampolines (a page the kernel creates in user space that intercepts some
syscalls) is one way that Linux uses. I don't think Linux does, but it
wouldn't be hard to insist that any syscall come from those pages. Do that and
you have your cake (low overhead user space mediating some calls), and get to
eat it to (you can safely assume every one uses it).

~~~
ChrisSD
Except that the Windows applications that do use syscalls directly (i.e. anti-
cheat software) are intended to break when something changes. It's part of
their way of defending against tampering. It's also part of why they need to
be continually updated and why some old games won't work at all on newer OSes
until a patch is applied to hack out the anti-cheat engine.

If syscalls were stable, they'd just try to use some other means to obfuscate
their OS calls and ensure it's sufficiently fragile as to break easily if
something changes.

tl;dr the lack of stability is a feature not a bug for this use case.

------
kyberias
> Windows applications are increasingly executing system calls directly rather
> than going through the API

I didn't know this. What applications are doing this and why?

~~~
wtallis
Anti-cheating libraries used in games seem to be the primary culprits, and
they're doing it this way because their purpose is to break things while
resisting the user's attempts to make things work nicely.

~~~
mehrdadn
A bit off-topic, but do you happen to know if there is a way _on Windows_ to
intercept syscall instructions? (aside from dynamic
analysis/recompilation/etc.?) If there isn't, I assume that's why cheat
engines do this?

~~~
wtallis
I think it's inherent to the SYSCALL instruction that control gets transferred
directly to the kernel and ring 0, with no possibility of intercepting in
userspace. So on Windows or Linux, you need kernel help to redirect/intercept
syscalls.

~~~
mehrdadn
I asked this as a Windows question rather than an x86 question. I was
wondering if there's any Windows API that could help with this. (This includes
kernel APIs.)

~~~
monocasa
The PicoProcess model used by the first WSL worked by intercepting syscalls
(among other mechanisms). It wasn't generally available for consumption by
developers other than Microsoft, though.

The gnarlier antivirus will do a similar thing by just patching the IDT and
syscall MSRs, but that's a bit frowned upon.

------
badrabbit
Interesting, I would also look into if the usemode-helper function can maybe
used to re-map syscalls and proxy them through a wrapper library just like
LD_PRELOAD but done at the kernel level after this, every process will load
the wrapper lib (not just every glibc process) much like windows programs load
user32/kernel32.dll.

Is this feasible or am I misunderstandig the feature?:

[https://kernelnewbies.org/KernelProjects/usermode-helper-
enh...](https://kernelnewbies.org/KernelProjects/usermode-helper-enhancements)

[https://elixir.bootlin.com/linux/latest/source/security/Kcon...](https://elixir.bootlin.com/linux/latest/source/security/Kconfig#L198)

~~~
cryptonector
Oh, nice, so start an app that doesn't use libc (e.g., a golang app) using
ptrace(2) to control it, inject code to intercept system calls, mark the app's
text as requiring syscall dispatch, then stop ptracing. Presto: LD_PRELOAD for
libc-non-using apps. Granted, to make this easy to use the interceptor should
be able to use normal LD_PRELOAD objects, which means bootstrapping ld.so and
libc, which is going to be very difficult to do.

~~~
monocasa
This is how User Mode Linux and gvisor work more or less (albeit with the
'injected' code running in another process entirely). It's also how a kernel
I'm writing works in a port that works like user mode linux (in addition to
the ports on native hardware).

The issue is that ptrace is unreasonably slow (orders of magnitude slower than
just regular syscalls), so they'd want the fast normal path to not go through
ptrace. But they don't have a way to communicate that to the kernel at the
moment.

~~~
cryptonector
And this scheme gets you a fast path.

------
Ericson2314
If the firsts comment,
[https://lwn.net/Articles/824494/](https://lwn.net/Articles/824494/), is
correct, and this can be made secure, I'm excited that this could mean a much
more maintainable/easy-to-upstream version of Capsicum/CloudABI.

~~~
trasz
Not really. Capsicum provides an actual security architecture (capability-
based security), as opposed to a ad-hoc bandaid mechanism.

~~~
Ericson2314
Yeah to be clear I would rather see a proper implementation. From the LWN
thread I thought Linux didn't have a personality system so this might be a
stop-gap way to make a lighter weight implementation, but somebody commented
that it in fact does (if barely), so that should be used instead.

------
jhallenworld
Linux used to be able to run Xenix and SCO UNIX applications- I have not tried
this in years, but I'm wondering if the mechanism can be shared such that any
OS can be emulated.

~~~
rstuart4133
Linux still can: [https://sourceforge.net/projects/ibcs-
us/](https://sourceforge.net/projects/ibcs-us/)

That's my project. I also did the port of iBCS to amd64, which is what ibcs-us
is based on.

I moved the support out of the kernel because the rate of change of the
internet kernel API's makes maintaining invasive kernel modules like this is
just far too much work. Every time I looked at the kernel the way syscalls
worked internally changed in some way, and broke ibcs64 as a consequence.

Decades ago Linux had full on support for other syscall interfaces via it's
personality mechanism. See personality(2). Most of the functionality has been
ripped out as kernel internal API evolved. I'm guessing maintainers noticed
the personality stuff wasn't used and so it was easier to delete it than port
it to their new shiny API.

At the time I thought that would be a disaster for iBCS. Now after porting
ibcs64 to user space, I think personality(2) was the wrong way to do it. Turns
out iBCS is mostly glue that emulates foreign syscalls using Linux
equivalents, and as ibcs-us demonstrates that kernel space glue runs equally
well in user space. It should not be in the kernel, where it bloats things and
could introduce security holes.

Once you realise the glue code can run anywhere, there is only one problematic
area remaining: directing the foreign syscalls to your user space glue code.
Here ibcs-us could really use some kernel could really help to make it both
fast and safe, but it's a use case that apparently hasn't crossed the Linux
kernel dev's minds.

This is exactly the same problem the Wine developers are having, and are
trying to solve in these LWN articles. To me their solution looks like a
horrible kludge. That isn't a criticism. I don't see how anything that tries
to make minimal changes to the kernel is going to be anything other than a
kludge.

If you want to do it properly, you need to add some mechanism to the kernel
that lets user space redirect whatever syscall method is in use to a user
space trampoline page. There are any number of syscall methods out there:
int's, sysenter's, lcalls. The kernel has to provide a way of trapping each
and redirecting them to the trampoline, but once they are there they are the
emulator's problem. Mostly. The emulator then needs a second mechanism to make
the real Linux kernel syscall that isn't redirected. One way to do that is to
say "syscalls from within this address range aren't to be emulated". I’m
guessing that’s what OpenBSD does now, but replace “aren’t to be emulated”
with “are allowed”.

This could be thought of as a way of the kernel providing a way to virtualise
it’s own syscall interface. There is no need to stop at one level either, as
there there is no good reason one virtualised interface shouldn't end in
another trampoline, unbeknownst to it. So a Xenix 286 emulator could be
running inside of a Xenix 386 emulator running on Linux. It would be lovely
for ibcs-us of course, but LD_PRELOAD's, virtual machines and yes even Wine
all want to do the same thing, so it is a broadly applicable use case.

~~~
jhallenworld
So I remember a big use case for SCO UNIX was small governments (like at the
town level). Do you still see such users of iBCS?

Anyway, this seems like a very good idea. It provides a very general light-
weight VM mechanism.

~~~
rstuart4133
> So I remember a big use case for SCO UNIX was small governments (like at the
> town level). Do you still see such users of iBCS?

I don't know who they are, which is the normal situation in the open source
world. The few I've interacted with seem to use Informix. I, well the company
I work for, is the only exception I've met to that.

