
Writing a Simple Linux Kernel Module - daftpanda
https://blog.sourcerer.io/writing-a-simple-linux-kernel-module-d9dc3762c234
======
dottrap
"And finally, you’ll need to know at least some C. The C++ runtime is far too
large for the kernel, so writing bare metal C is essential."

That line reminded me that NetBSD added Lua for writing kernel modules.
([https://news.ycombinator.com/item?id=6562611](https://news.ycombinator.com/item?id=6562611))

Anybody have any experiences to share from this?

~~~
alxlaz
There are plenty of operating systems that use C++ on the kernel side of
things. Some things, like exceptions, are frowned upon and rarely used -- if
at all -- but there is no shortage of C++ kernel code. Just not in Linux.

When it comes to Linux, one could say that most reasons to avoid it are
historical, but this does not quite paint the awkward truth -- namely that,
for most of the kernel's lifetime (since back in 1991), C++ compilers simply
did not have the level of maturity and stability across the breadth of
platforms that Linux required. Linus Torvalds' stance on this matter is pretty
well-known:
[http://harmful.cat-v.org/software/c++/linus](http://harmful.cat-v.org/software/c++/linus)
.

Today, when x86-64 and ARM are the only two families that you need to care
about in the following ten years or so ( _maybe_ RISC-V but I rather doubt
it), it probably makes sense to look at C++ for operating systems work, but
the runtime is certainly heavier than back when Linus was writing about it,
too. A modern C++ compiler has a lot of baggage; C++ was huge back in 1998,
now it's bloody massive. IMHO, all the reasons why you would want to use C++
(templating support without resorting to strange hacks, useful pointer
semantics and so on) are reasonably well-served by cleaner languages with less
hefty runtimes, like Rust. What these alternatives do lack is the amazing
level of commercial support that C++ has.

~~~
jcelerier
> C++ was huge back in 1998, now it's bloody massive.

I don't think any "run-time" feature was added since, though. It's all either
OS support (<thread>, etc, that you wouldn't use in-kernel anyways) or
template stuff that has 0 impact on runtime (and actually sometimes helps
decreasing code size).

[https://istarc.wordpress.com/2014/07/18/stm32f4-oop-with-
emb...](https://istarc.wordpress.com/2014/07/18/stm32f4-oop-with-embedded-
systems/)

[https://www.embedded.com/design/programming-languages-and-
to...](https://www.embedded.com/design/programming-languages-and-
tools/4438660/Modern-C--in-embedded-systems---Part-1--Myth-and-Reality)

[https://hackaday.com/2015/12/18/code-craft-embedding-c-
templ...](https://hackaday.com/2015/12/18/code-craft-embedding-c-templates/)

If some guys are able to run c++ on 8kb microcontrollers, there's hardly a
non-political reason it couldn't be used in-kernel.

See also IncludeOS: [http://www.includeos.org/](http://www.includeos.org/)

~~~
alxlaz
Additions have certainly been made since back in 1998 (things like smart
pointers are relatively new on this scale, as far as I know). Many runtimes
for resource-constrained embedded systems do not support all of C++'s
features. Exceptions are the most usual omission.

You can certainly strip things down to a subset that can fit 128K of flash and
need only 1 or 2K of RAM at runtime, but the question is not only one of
computational resources used for the library itself. Additional code always
means additional bugs, the semantics sometimes "hide" memory copying or
dynamic allocation in ways that many C++ programmers do not understand (and
the ones who do are more expensive to hire than the ones who do not), and so
on. You can certainly avoid these things _and_ use C++, but you can also avoid
them by using C.

I agree that mistrust and politics definitely play the dominating role in this
affair though. I have seen good, solid, well-performing C++ code. I prefer C,
but largely due to a vicious circle effect -- C is the more common choice, so
I wrote more C code, so I know C better, so unless I have a good reason to
recommend or write C++ instead of C, I will recommend or write C instead. I do
think (possibly for the same reason) that it is harder to write correct C++
code than it is to write correct C code, but people have sent things to the
Moon and back using assembly language for very weird machines, so clearly
there are valid trade-offs that can be made and which include using language
far quirkier than C++.

~~~
cmrdporcupine
I absolutely do not understand your point. Anybody doing OS development in C++
is doing so with absolutely no C++ standard library support, same as if you
were using C++ to develop for your microcontroller. If C++ binaries are
compact enough for Arduino or Parallax Propeller development (<32KB RAM), they
are absolutely fine for kernel development.

The real answer is historical, and cultural. On the latter, Unix is a product
of C (well, and BCPL) and C is a product of Unix. They two are intertwined
heavily. The former is as was mentioned a product of the relative crappiness
of early C++ compilers (and the overzealous OO gung-ho nature of its early
adopters perhaps as well...)

C++ without exceptions, RTTI, etc. has a lot to offer for OS development.
Working within the right constraints it can definitely make a lot of tasks
easier and cleaner.

It won't happen in Linux, tho.

~~~
pjmlp
And lets not forget C++ got created, because Bjarne didn't want to touch C
after his experience going from Simula to BCPL, but it had to be compatible
with AT&T official tooling.

------
chatmasta
Does anyone have a story or two of a time you’ve created a kernel module to
solve a problem? I would be interested in hearing real world use cases.

~~~
mbrumlow
I wrote one so I could keep my job...

I work at a company that provides backups for both Linux and windows. The
entire concept was around block level backups. You could just open up the
block device and copy the data directly but it would quickly become out of
sync by the time you finished copying it. We did not want to require LVM to be
able to utilize snapshots to solve the sync problem. On top of that we had a
strong requirement of being able to delete data from the backup.

This resulted in me learning how to build a kernel module and then slowly over
about 6 months creating a kernel driver that allowed us to take point in time
snapshots of any mounted block device with any fs sitting on top.

Other requirements also dictated that we keep track of every block that
changed on the target block device after the initial backup (or full block
scan, after reboot).

I wish I could release the source but my employers would not like that :( So
at least for me learning how to write kernel modules and digging in to some of
the lower stuff has keep me gainfully employed over the years. It is still in
use on about 250k to 300k servers today (it fluctuates).

The hardest part was not writing the module, but getting others interested in
it enough so I don't have to be the sole maintainer. I like working on all
parts of the product and don't want to just be the "kernel guy".

One other time I wrote a very poorly done network block device driver in about
8 hours. You can find it here
[https://github.com/mbrumlow/nbs](https://github.com/mbrumlow/nbs) \-- Note I
am not proud of this code, it was something I did really quick, wanted it on
hand to show to a perspective employer -- I did not get the job, I am also
fairly sure they did not even look at the driver, so I don't think the crappy
code there affected me.

~~~
throwaway613834
So you implemented something like Windows's volume snapshots/shadow copies?
(Why couldn't you use the existing feature?)

EDIT: Thanks for both replies! :)

~~~
mbrumlow
Yes, and no.

Windows Volume Shadow Copy has the advantage of being integrated with he FS a
bit closer. So in Windows VSS can avoid some overhead by skipping the 'copy'
part and just allocating a new block and updating the block list for the file.

For the Linux systems we had the requirement to work with all file systems
(including FAT). So we could not simply modify the file system to do some
fancy accounting when data in the snapshot was about to be nuked. So that
resulted in me writing a module that sits between the FS and the real block
driver. From there I can delay a write request long enough to ensure I can
submit a read request for the same block (and wait for it to be fulfilled)
before allowing the write to pass through.

> (Why couldn't you use the existing feature?) We did on Windows, VSS is used
> with a bit of fancy stuff added on top. For Linux there is no VSS equivalent
> (other than the one I wrote, and maybe something somebody working on a
> similar product may have written). And even if one did come about (or is and
> I am just not aware of) it for sure was not available when I started this
> project.

~~~
ploxiln
It sounds like you implemented something very similar to the LVM snapshot
feature. You didn't want to require LVM but ... is that really worse than
requiring your custom module which is roughly equivalent?

EDIT: ok so it looks like the management of the snapshot space is a _bit_
different. still, you could probably have wrapped LVM management enough to
make it palatable, in less than the time it took to write a custom module

~~~
mbrumlow
The problem is requiring LVM is a NOGO if they did not already have LVM. We
could have standardized on LVM, but then all the people who were not already
using LVM at the time would not be able to use the product. At the time many
hosting providers -- who we sold too -- just were not using LVM. We to this
day still have many fewer people using LVM than raw block devices, or some
sort of raid.

Also at the time LVM snapshots were super slow. I don't have the numbers but
even with the overhead my driver created I was able to have less impact on
system performance.

I was able to do some fancy stuff to optimize some of the more popular file
systems by making inline look-ups to the allocation map (bitmap on ext3). This
allowed me to not COW blocks that were not allocated before the snapshot. This
was a huge saving because most of the time on ext3 your writes will be to
newly allocated blocks.

Wrapping LVM would probably not work, and still require a custom module to do,
the user space tools don't do much. LVM really is a block management system
that needs manage the entire block device, so existing file systems not
sitting on top of LVM would get nuked if you attempted to let LVM start
managing those blocks, and you still had the issue that reads and writes were
coming in on a different block device. Asking people to change mount points
was not a option. There were also some other requirements like block change
tracking that LVM does not have the concept of doing. This is for incremental
snapshots. Without this sort of tracking you will either have to checksum
every block after every snapshot if you wish to to only copy the changes. This
module also was responsible of reporting back to a user space daemon that keep
a map of what blocks changed. So when backup time arrived we could use this
list (and a few other list) to create a master list of blocks that we need to
send back. This significantly cuts down on incremental backup time. Some
companies call this "deduplication" but I feel that is disingenuous -- to me
deduplication is on the storage side and would span across all backups.

So yes, requiring a module is much easier than telling a customer they can't
trial, or use this product until they took their production system off line
and reformatted it with LVM. Many people hated LVM at the time, it was
considered slow and caused performance problems, this was like 8 years ago...
LVM has vastly changed and does not have these type of complaints any more.
But I can tell you people still are going to scream bloody murder if we told
them they had to redo their production images and redeploy a fleet of 200+
servers just to switch to LVM so they could get a decent backup solution.

Also shout out to aseipp! Miss working with you. Have yet to find a bug in the
code you wrote :p

------
spapas82
Replying after checking my notes from the University (NTUA): Before around 14
years (~ 2004) one of the excersies we had in Operatin Systems Lab was to
actually implement a char device driver for Linux called ... lunix as a kernel
module. The actual device was just a buch of bytes in memory.

These times were so happy - implementing a device driver seemed like owning
the world for a student!

If anybody wants I'd be happy to put the source code in github - mainly for
historical reasons because I'm not sure that the code will be still working
today.

~~~
abstractbeliefs
Please do. One of the issues for tfa is how little it actually covers. While
it quite fairly says module development is more like writing an API than an
application, it then proceeds to entirely ignore any hooks other than init and
exit!

~~~
spapas82
Here you go:

[https://github.com/spapas/lunix](https://github.com/spapas/lunix)

Notice - I tried compiling it but was not able to (and I don't have time to
research modernizing this code).

------
ramzyo
>> A Linux kernel module is a piece of compiled binary code that is inserted
directly into the Linux kernel, running at ring 0, the lowest and least
protected ring of execution in the x86–64 processor.

The author seems to be implying that rings are implemented at the processor
level for x86-64 processors. If I’m interpreting the wording correctly that’s
interesting! Coming from the ARM world I’d always thought that rings were an
OS construct.

~~~
joosters
Most chips have different privilege levels... ARM processors have 'modes' like
User, FIQ, IRQ and Supervisor (sort of the equivalent to ring 0). Different
modes can have different access controls, e.g. varying access to memory.

Edit: See
[http://www.heyrick.co.uk/armwiki/Processor_modes](http://www.heyrick.co.uk/armwiki/Processor_modes)

~~~
ramzyo
Right - out of curiosity do you know if the Linux kernel uses the term "ring",
or some other terminology to map to the underlying hardware implementation, be
they rings for x86 or modes for ARM? Maybe it's just "privilege level" or
something similar?

~~~
JoshTriplett
The Linux kernel just talks about "kernel mode" and "user mode" (or
"userspace"). Those then map to ring 0 and ring 3 on x86, or other mechanisms
on other platforms.

~~~
ramzyo
Cool, thanks!

------
djohnston
Kernel newb w/ a question on this. When you register the device with

    
    
      register_chrdev

you also have to create the file in userspace with

    
    
      mknod
    ?

~~~
JoshTriplett
You never need to use mknod yourself anymore. Based on the metadata you
provide to the kernel, the kernel will create the device itself in devtmpfs,
and even set appropriate baseline permissions (e.g. root-only, world-readable,
world-writable). udev will then handle any additional permissions you might
need, such as ACLs or groups.

~~~
anitil
Out of historical interest, why was it necessary before? Was it an intentional
separation between userspace/kernel responsibilities?

~~~
JoshTriplett
A few reasons. Prior to the existence of udev, all devices had to be created
with mknod and have permissions set; a script called "MAKEDEV" did that. That
didn't handle dynamically created devices, so (skipping "hotplug" and an old
attempt called "devfs" which was filled with race conditions) the kernel
started sending events to userspace for dynamic devices, and udev created
those. Then, it made sense to "coldplug" all the existing devices and have
udev create those, to use the same unified path for everything, and use udev
to set groups and permissions. But that did take some time at boot time, and
it meant that you needed either udev or a manually created /dev for tiny
embedded systems. devtmpfs handled that use case, automatically creating
devices from the kernel. But then, if the kernel knows how to create devices,
why not have it do so all the time? So, udev jettisoned the code to create
devices itself, and started requiring devtmpfs.

------
vagab0nd
netcat released their album as a linux kernel module [0]. Browsing their code
might be a more fun way to learn how to write a kernel module with a real "use
case".

0: [https://github.com/usrbinnc/netcat-cpi-kernel-
module](https://github.com/usrbinnc/netcat-cpi-kernel-module)

------
qsdf38100
Kernel code does not operate at "incredible speed". This guy doesn't know what
he is talking about.

~~~
madez
You are correct about the speed, but the short response combined with the
unjustifiedly overly general attack of the authors knowledge is unnecessary.

~~~
qsdf38100
This is somewhat unnecessary, but this article is spreading false claim about
kernel mode "incredible speed" and this should be denounced.

------
fmela
In the device_read function,

    
    
      len — ;
    

should be

    
    
      len --;

~~~
megous
Probably a CMS thing. printk is probably also discouraged, these days.
pr_warn/info/..() or dev_warn/info/err...() should be used.

Also for anyone writing kernel code, this is indispensable:
[http://elixir.free-electrons.com/linux/latest/source](http://elixir.free-
electrons.com/linux/latest/source)

~~~
uep
It doesn't exactly apply here since the author doesn't seem to be allocating
resources dynamically, but another thing that's "newer" and less widely used
are the devm_ family of functions.

Nowadays, lots of code could be using devm_kmalloc, devm_ioremap, etc which
will release the resources automatically when the driver detaches from the
device.

