
Tutorial – Write a System Call - zerognowl
https://brennan.io/2016/11/14/kernel-dev-ep3/
======
warriorkitty
I decided to read ep1[0] too and I saw a picture "use all the memory". I don't
know if it's funnier that I checked if you have an "alt" HTML tag or that you
actually wrote the text from the picture. People with alt tags are MVPs. :)

[0] - [https://brennan.io/2016/10/13/kernel-dev-
ep1/](https://brennan.io/2016/10/13/kernel-dev-ep1/)

~~~
phillc73
Do you remember when mousing over a picture would show the alt tag in
something resembling a tooltip? (Maybe it can still be set like this, but
isn't default in Chrome or Firefox anymore.) I recall being quite amused
sometimes at what people would write as their alt text.

~~~
prashnts
I think the `title` attribute is used to show the tooltips.

> I recall being quite amused sometimes at what people would write as their
> alt text.

XKCD :)

------
thirdreplicator
Thoroughly enjoyed the tutorial, but why would one want to make a custom
system call? What superpowers does this give you? Thanks in advance for your
answers.

~~~
geofft
It's your best interface with the kernel. It's simple and high-performance.
It's specifically what you want if you want to pass structured data in-memory
to the kernel.

In a strict technical sense, there's nothing you _need_ a syscall for, you can
just read/write data (or maybe do an ioctl) on a new device node or something.
In fact, OpenAFS supports routing its "syscall" on Linux through ioctls on
/proc/fs/openafs/syscall, because Linux makes it deliberately annoying to
patch the syscall table from a kernel module so as to make life harder for
rootkits.

However, it's simpler to pass data structures if you can use a syscall. It's
much higher-performance than opening a file node. And if you expect to run in
an environment where you don't know if a particular file will exist (e.g., a
chroot), it's useful to use a syscall directly, because that's always
available. For instance, getrandom was added in July 2014 partly for this
reason, and partly so that if you ran out of file descriptors to open
/dev/urandom you could still get randomness.

Here are all the syscalls added in the last two years:

* pkey_mprotect, pkey_alloc, pkey_free: support for a new Intel processor feature, Memory Protection Keys [https://lwn.net/Articles/643797/](https://lwn.net/Articles/643797/)

* preadv2, pwritev2: add a flags argument so you can do a non-blocking preadv or pwritev without opening the file in non-blocking mode [https://lwn.net/Articles/670231/](https://lwn.net/Articles/670231/)

* copy_file_range: copy data between two file descriptors, using filesystem support for efficient copies if possible [https://lwn.net/Articles/659523/](https://lwn.net/Articles/659523/)

* mlock2: add a flags argument so you can mlock memory when it's next accessed [https://lwn.net/Articles/650538/](https://lwn.net/Articles/650538/)

* membarrier: force a memory barrier on all running threads to help with userspace RCU, garbage collection, etc. [http://man7.org/linux/man-pages/man2/membarrier.2.html](http://man7.org/linux/man-pages/man2/membarrier.2.html)

* userfaultfd: implement userspace paging [https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt](https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt)

* execveat: a version of execve that takes a file descriptor (or a fd and relative path) instead of a string to execute [http://man7.org/linux/man-pages/man2/execveat.2.html](http://man7.org/linux/man-pages/man2/execveat.2.html)

------
eximius
Hmm... This is certainly very interesting. Can anyone think of any neat
kernel-only things that one might implement for kicks as a learning project?
Particularly for someone who hasn't done kernel programming? It could
definitely be a silly thing, but probably more useful than printing to the
kernel log.

~~~
jevinskie
Providing guaranteed access to random numbers has been a recent example of a
new, badly needed, but fairly simple syscall. With getrandom(), you avoid the
complexities of open/read/close and its associated error handling.

[https://lwn.net/Articles/606141/](https://lwn.net/Articles/606141/)

------
voltagex_
Great tutorial. Just a tip - if you change www.kernel.org to cdn.kernel.org
you'll get a closer mirror site.

~~~
brenns10
Oh very nice, thanks for the tip!

------
dezgeg
One correction to the strncpy_from_user part, specifically this:

> The process could try to read another process’s memory by giving a pointer
> that maps into another process’s address space.

This cannot happen, there is no such thing as "a pointer that maps into
another process's address space". A virtual address in Linux (on x86 and
probably almost all arches) accesses either the processes own memory map
(where access to unmapped addresses causes a fault even when done from ring 0)
or the kernel virtual mapping.

------
rogerb
Really cool tutorial, thanks for writing this up !

------
xenadu02
I thought Linux uses sysenter/sysexit, not int 0x80/iret?

Still a good tutorial; there is no magic, it's all just software.

~~~
conductor
You are right, Linux uses INT 0x80 on x86 only when the SYSCALL/SYSENTER and
SYSRET/SYSEXIT instructions are not available.

