
Snap: A Microkernel Approach to Host Networking - sandGorgon
https://ai.google/research/pubs/pub48630/
======
nimrody
"While early microkernel work saw significant performanceoverheads attributed
to inter-process communication (IPC)and address space changes [15,16,20], such
overheads areless significant today. Compared to the uniprocessor systemsof
the 80s and 90s, today’s servers contain dozens of cores,which allows
microkernel invocation to leverage inter-coreIPC while maintaining application
cache locality. This ap-proach can evenimproveoverall performance when there
islittle state to communicate across the IPC (common in zero-copy networking)
and in avoiding ring switch costs of systemcalls. Moreover, recent security
vulnerabilities such as Melt-down [43] force kernel/user address space
isolation, even inmonolithic kernels [25]. Techniques like tagged-TLB
supportin modern processors, streamlined ring switching hardwaremade necessary
with the resurgence of virtualization, andIPC optimization techniques such as
those explored in theL4 microkernel [29], in FlexSC [58], and in SkyBridge
[44],further allow a modern microkernel to essentially close theperformance
gap between direct system calls and indirectsystem calls through IPC."

~~~
snvzz
You're right to quote some text that addresses the idea that "microkernels are
slow", which keeps popping up.

To that, I add this article[0], which is sufficient in destroying that idea.

[0][https://blog.darknedgy.net/technology/2016/01/01/0/](https://blog.darknedgy.net/technology/2016/01/01/0/)

~~~
zaarn
IPC is a tad slower than not IPC.

However, stuff like WebAssembly and other sandboxing methods can be used to
leverage two processes into the same address space. Your filesystem driver
then simply lives as a module in address space and it's a normal process. The
IPC turns into a simple jump using a pointer value provided by the kernel
(which depending on the trust level can provide parameter validation or can be
a plain pointer to the correct function).

~~~
hvidgaard
How would that work? The entire premise of Microkernels is that everything
runs in a dedicated processes such that if one crashes, the system can
continue all other processes and only need to recover that single process.

The cost of switching processes are not a problem of address space, the MMU
makes sure that is not a problem as shared memory is a reasonable tradeoff for
certain domains where performance is necessary. Normally you'd rather want use
messages but that is a different topic.

In either case, the cost of a process switch is handling the registers and
cache - that will not go away no matter how you do it, which is why multicore
implementation with messages can actually turn out being faster. Less
switching and more locality.

~~~
zaarn
Well, with webassembly the process can run under ring0 but still be isolated
as if it was running in ring3. The crash mechanism isn't different; if the
driver crashes then the kernel can kill it and deallocate it's resources, a
device manager process can then restart the driver. The advantage is that you
can include small, audited and verified binaries that can run code as ring 0
to remove abstraction from accessing the raw hardware.

The cost of switching processes is significantly reduced when you don't need
to switch privilege levels and not having to invalidate the TLB as well as not
having to change address space at all will make a context switch not
significantly more expensive than a function call.

You get compile-time isolation and can still take advantage of the MMU when
needed.

And using such an implementation does not prevent you from implementing your
driver such that it can run on each core and take advantage of it or even
passes messages. An ethernet driver could still, for example, pass a message
when the "send data to tcp socket" function is called while allowing another
program to use the same function without any message passing, depending on
what is better for your use case.

~~~
hvidgaard
I'm not sure I believe it's a good idea to give up on the hardware protection,
as that leaves it to the software implementation to ensure it's secure. If you
compromise an application in user mode, it will not get you far without
another exploit in something that runs in supervisor mode. The hardware makes
that certain, and it's well verified. We've seen time and time again that
simple buffer overflows exploit things, and the more the runs in supervisor
mode, the larger the attack surface is.

If a driver runs in user mode, an exploit needs to exploit the hardware as
well - and that is for all intents and purposes something that we see very
rarely.

If the same driver runs in "software user mode", but executing as supervisor
(basically inside an VM environment), we need constant security checks in
software, and an exploit now have the VM code to further exploit, if
successful that will automatically grant it supervisor access.

In both cases it's assumed that neither implementation has access to more
interfaces than necessary for it to do it's work. For instance, a driver for a
mouse does not need access to the disks.

~~~
zaarn
The thing is, you don't have to give up hardware protection. If you don't
trust code, you can still run it in ring 3 with all the associated overhead.
The point is being able to choose how close an application runs to "no
overhead" until you're at a level where the driver is a function call away.

From my experience, a lot of hardware is terribly insecure against exploits.
Not necessarily the CPU but stuff like your GPU or HBAs, ethernet cards, etc.

With software containment, the advantage is that you can set it up that
drivers need to declare their interfaces and privileges beforehand. In an ELF
or WASM you have to declare imported and linked functions, it should not be
difficult to leverage that to determine what a driver can effectively do. With
WASM you get the added benefit that doing anything but using declared
interfaces results in a compile-time error.

A driver can be written so that a minimal, audited interface exists to talk to
the hardware almost directly with some security checks and then the WASM part
that handles the larger logic parts and provides the actual functionality.

WASM isn't a supervisor, so exploits on VM code aren't that relevant.
Exploiting the WASM compiler/interpreter/JIT is more interesting but those are
exposed to the daily internet shitstorm exploits, so I think they are fairly
safe.

~~~
hvidgaard
I suppose it remains to be seen if someone can make a PoC. I'm skeptical but
ultimately I do not know enough to decide either way.

> it should not be difficult to ...

Famous last words.

------
Y_Y
There are already so many software projects called Snap.

~~~
jacquesm
For Google that's a non-issue, they will simply promote their own use of the
term over the previous uses.

~~~
tartrate
Even Google has a limit though, they could've named their programming language
"The".

~~~
jacquesm
I would say taking the '+' character and attempting to make it their own is
proof positive that they don't have limits.

------
mkj
Wonder if their MicroQuanta microsecond granularity Linux scheduler would be
useful for other near realtime things like audio or electronics interfacing. I
guess the low latency scheduler is a key part of making Snap work in
userspace.

[https://lkml.org/lkml/2019/9/6/177](https://lkml.org/lkml/2019/9/6/177)

------
sandGorgon
"A change to the kernel stack takes 1-2 months to deploy; a new Snap release
gets deployed on a weekly basis."

~~~
persistent
At many companies rolling out a new kernel in a month would be an outright
miracle.

------
persistent
The actual paper, where their TLS certificate is not yet expired, is
[https://storage.googleapis.com/pub-tools-public-
publication-...](https://storage.googleapis.com/pub-tools-public-publication-
data/pdf/36f0f9b41e969a00d75da7693571e988996c9f4c.pdf)

------
vkaku
Unpopular opinion ahead:

The deal with user facing libraries like this is that I'd rather they
generalize this and expose the existing network library drivers with a uring
interface, and the user processes can take care of packet decap the way they
want it.

Of course, helpful to have the stub to map the PCI BARs to userspace, and
hopefully without any message signalled interrupts.

These two alone, may be hard to do, with all the existing network drivers out
there. Engineering feats like these are good but not helpful to most people
unless they are simple and generic enough to work on most devices.

I hope the guys who wrote this take note and eventually layer out and open
source this library.

------
Gonzih
looks like certificate for ai.google expired today, what a coincidence!

~~~
njhartwell
And with HSTS, modern browsers will make sure you can't ignore it :) In the
meantime [https://storage.googleapis.com/pub-tools-public-
publication-...](https://storage.googleapis.com/pub-tools-public-publication-
data/pdf/36f0f9b41e969a00d75da7693571e988996c9f4c.pdf) works.

~~~
judge2020
Not sure about Firefox but in Chrome you can type "thisisunsafe" on the page
to bypass even HSTS (you can test at [https://subdomain.preloaded-
hsts.badssl.com/](https://subdomain.preloaded-hsts.badssl.com/) )

------
skywhopper
Oops, the TLS cert on this site expired just nine minutes ago.

~~~
ratsimihah
Is it still safe?

------
ZeroCool2u
"Snap has been running in production for over three years, supporting the
extensible communication needs of several large and critical systems."

Are there many systems using microkernels in production environments? I mean
at least at this kind, or a somewhat similar, scale?

3 years seems like a long time, while up until this moment microkernels seemed
fairly niche to me and something reserved for more experimental systems.

The only one I can think of off the top of my head is Fuchsia.

~~~
mnem
QNX has been around for quite a while and used across many areas:
[https://blackberry.qnx.com/en](https://blackberry.qnx.com/en)

Edit: The Wikipedia entry has a little less fluff than the official homepage:
[https://en.wikipedia.org/wiki/QNX](https://en.wikipedia.org/wiki/QNX)

------
beefhash
I'm surprised that microkernel-like IPC still hasn't found its way into any
*NIX system. The closest we've gotten is System V IPC, which has never really
taken off.

Is this just difficult to design well or are people genuinely okay with
socket(AF_UNIX, SOCK_SEQPACKET, 0)?

~~~
Matthias247
I don't think it's difficult to design an IPC mechanism. But I DO think it's
difficult to design an IPC mechanism that everyone is happy with. Some people
want only synchronous IPC, others also asynchronous (out of order responses),
some want multicast, some want events without associated request, some want
pub/sub, etc.

If at the end people continue building their own IPC mechanisms on top of
TCP/IP or unix domain sockets then the in-kernel mechanism will just be
another thing to maintain.

There had been some endaveurs to bring new IPC mechanisms into the Linux
Kernel (AF_BUS, KDBUS, BUS1). I think those failed for similar reasons
(although I'm not sure were BUS1 now is - the others are definitely
discontinued).

~~~
jacquesm
KISS: QnX MsgSend, MsgReceive, MsgReply. That's really all it takes.

[http://www.qnx.com/developers/docs/6.5.0/index.jsp?topic=%2F...](http://www.qnx.com/developers/docs/6.5.0/index.jsp?topic=%2Fcom.qnx.doc.neutrino_lib_ref%2Fm%2Fmsgsend.html)

You can go all the way from the lowest level kernel uses to very high level
application constructs with that. Re-inventing existing wheels badly is
something the software world excels at.

~~~
als0
> That’s really all it takes.

Not quite. Those functions need established connections, so you need
ConnectAttach and ConnectDetach as well. But none of that is useful unless you
can identify clients so you also need ConnectClientInfo.

This isn’t a dig, having worked on a custom IPC system I found the QNX
approach to be the best of all worlds.

~~~
jacquesm
Sure, but the essence of the actual transfer, the part where performance
matters is send/receive/reply.

------
alexnewman
Projects need to stop calling themselves snap

~~~
Gibbon1
ans to dead question.

> Do they pay people to do this?

Yes you fucking bet they do.

~~~
Gibbon1
> 0 points by Gibbon1

Makes my point

