Hacker News new | comments | ask | show | jobs | submit login
Unix Syscalls (john-millikin.com)
287 points by rspivak 6 months ago | hide | past | web | favorite | 13 comments



There is a great deal more information about how sycalls are implemented on Linux in this excellent free GitHub book called linux-insides:

https://github.com/0xAX/linux-insides/tree/master/SysCall

I can highly recommend it to anyone who wants to learn more about the kernel. This book is particularly great because it does not assume any kernel programming background knowledge - superficial understanding of C should be enough.



It appears that Linux x86-64 does not support more than 6 arguments for syscalls[0]. So if you want to pass more than 6 arguments, use a struct... or modify nearly every step of the syscall process in the kernel.

[0]: https://elixir.bootlin.com/linux/v4.18-rc8/source/arch/x86/i...


It’s because args to syscalls are passed in registers rather than the stack. This is a security mechanism I believe, but I’m mostly guessing based on xv6.

Basically, if you want a kernel space and a user space, you have to ensure users can’t breach kernel space. But this is the part where my logic runs dry: could a malicious caller control the return address that’s pushed to the stack? If so, could you redirect the kernel’s execution to an arbitrary physical address? Or does the kernel switch back into user mode just before calling RET?

Sigh... time to re-read xv6. I think interrupts are involved.


> It’s because args to syscalls are passed in registers rather than the stack. This is a security mechanism I believe, but I’m mostly guessing based on xv6.

It's probably for speed reasons. Marshaling from user space is expensive due to all of the checks you have to make to not allow user to crash kernel.

> Basically, if you want a kernel space and a user space, you have to ensure users can’t breach kernel space. But this is the part where my logic runs dry: could a malicious caller control the return address that’s pushed to the stack? If so, could you redirect the kernel’s execution to an arbitrary physical address? Or does the kernel switch back into user mode just before calling RET?

Return from interrupt uses the special iret instruction. That makes sure that the return happens in a user context if need be by atomically setting the flags and ip registers at the same time.


yes exactly this - once upon a time (V6/V7 on the PDP-11) when I was younger sys call parameters were on the stack, I worked for a company that ported Unix to various CPUs/MMUs, we'd knock one out every 6 weeks or so - on some MMUs accessing user space from kernel space (safely) was extremely slow - we discovered that switching syscalls to pass parameters in was a real performance hog, and benchmarking showed that passing in registers was far faster in all systems. Our systems supported both sorts of system calls. When I wrote the original 68k system V ABI I included register passing as the default


I'm learning PDP-11 assembly and would like to play around with some OS stuff. This was very helpful to know. Thanks.


To be utterly pedantic, on Intel Linux only "legacy" 32-bit int 0x80 mechanism uses iret to return.

32-bit "fast" syscalls use sysenter/sysexit. 64-bit "fast" syscalls use syscall/sysret.

Haven't really looked but I suspect sysexit and sysret are somewhat special cased versions of iret.

https://blog.packagecloud.io/eng/2016/04/05/the-definitive-g...


In x86 and x86_64 it doesn't matter whether the caller has control or not over the return address, because there's a change of privilege when the kernel returns from the interrupt (IRET instruction). So at that point it would be equivalent for the userspace app to just jump to whatever address it wants.

The caller does not have control over the return address. When int n or syscall instructions are executed, it's the processor who pushes the current context onto the kernel stack (pointed by ss0:esp0), so when you run iret, everything will go back to normal.

Even if the caller had control over this return address, the CR3 does not change [without taking KPTI into consideration], so the memory mappings will still be the same, and everything would be handled with paging enabled, so there's no "arbitrary physical address". You would only be allowed to jump to anything that you have already mapped, and given that there's a privilege change, you would only be able to access userspace memory.

This has nothing to do with whether the syscall parameters are passed down the stack or not. In x86 and x86_64, when you make a syscall and the kernel handles it, the stacks change, so if you were to pass parameters via the stack, you would need to be able to access the userspace stack from the kernel and it sounds like a mess (but possible). The registers, on the other hand, are available for the syscall handler to use, so it's easier to just set the parameters there.


It depends on the exact architecture; on x86 the exact logic of the INT instruction is quite involved (see e.g. https://x86.puri.sm/html/file_module_x86_id_142.html for details) but when changing privilege levels, the CPU automatically switches stacks too.


Is there a nice xv6 equivalent for x86_64? How do MIT students learn about 64-bit arch?

https://aaronbloomfield.github.io/pdr/book/x86-64bit-ccc-cha...

This was good, but it leaves a lot out. No mention of kernel space.


You will probably find an x86_64 port of xv6 on GitHub. IMHO, there is nothing terribly special about x86_64. The goal of xv6 is not to teach 64-bit computers, but to cover operating system basics (primarily multitasking, virtual memory and filesystems).


> if you want to pass more than 6 arguments

Isn't the number of arguments already determined by the nature of the syscall? Are you talking about a situation where one creates new syscalls?




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: