
A new Linux memory controller promises to save lots of RAM - MilnerRoute
https://thenewstack.io/a-new-linux-memory-controller-promises-to-save-lots-of-ram/
======
boulos
Off-topic: it's too bad that The New Stack reproduced verbatim the term
"Memory Controller" instead of "Memory Allocator" (as mm/slab.h agrees). I was
really confused as to how there was going to be a _Linux_ Memory Controller,
and thought some new-fangled AMD/Intel memory controller thing had moved into
software...

~~~
derefr
My first mental image was of an MMU that was running Linux internally—like the
Xeon Phi, but even more ridiculous.

~~~
mlyle
I've used a couple DMA controllers that are Turing complete-- it's pretty cool
what kinds of things you can offload.

~~~
nine_k
Turing completeness means undecidability of termination, not something you'd
enjoy in DMA operations!

~~~
gizmo686
Turing completeness does not mean you cannot reason about the termination (or
performance) of any particular program.

------
xyzzy_plugh
I've taken a brief read through the patch series. It's a noble effort, but
from the light initial feedback there will a few rounds of rework before this
is a candidate in a release. This also feels unlikely to be backported, given
the numerous subtle changes (e.g. cgroups v1 incompatibilities, byte vs. page
accounting).

This really seems like a somewhat obvious caveat of SLUB, in hindsight at
least, given how memory cgroups work. The accounting overhead here seems like
opening a complexity can of worms.

For memory-pressure intensive short-lived applications (e.g. Hadoop jobs with
no strict NUMA affinity), this could net some real benefits when processes
jump across cores, let alone physical cores.

Most serious folks should be setting at least some CPU affinity anyways and
discourage the scheduler from bouncing processes between cores, which
otherwise exasperates the issue mitigated by this patch set.

If you've ever wondered why you struggle to make use of all of your RAM on a
large many-core host, this is potentially a big reason why.

~~~
akdor1154
> exasperates

exacerbates? :)

------
ckdarby
> With this new controller, it’s possible to gain anywhere from 35-42% better
> memory usage in Linux.

Considering the gain I'm surprised there wasn't more discussion or interest in
this.

~~~
makapuf
It is apparently only slab memory usage, not main memory usage.

~~~
sh-run
Slab would be part of main memory, not to be confused with swap.

~10% of the utilization on my system is slab.

[bjames@lwks1 ~]$ grep "Slab\|MemTotal\|Active:" /proc/meminfo

MemTotal: 65976620 kB

Active: 6139820 kB

Slab: 605988 kB

That's hardly insignificant.

It sounds to me like slab allocation is specifically used to allocate small
blocks of memory (less than a page). Ie a slab might be set aside for
integers, then if an application (edit: not applications, see rayiner's
comment below) needs to store an integer it gets stored in that slab instead
of somewhere else in memory. I'm not a Linux developer, so I hope I'm not
spreading bad info. Maybe someone else can chime in.

I assume this would largely be short lived allocations so I can see where
optimizations here would lead to power savings.

~~~
rayiner
The slab allocator is for sub-page-size _kernel objects_. User space programs
allocate entire virtual memory pages and use a different user-level allocator
you allocate sub-page-size objects.

~~~
yellowapple
Memory savings are still memory savings, whether they happen in kernel-space
or user-space.

~~~
coryrc
Most memory usage is not in the kernel. It's not a big effect overall.

~~~
derefr
However, you kind of _want_ most memory usage to be in the kernel. Having
something happen entirely in kernelspace is always better than having half of
it happen in userspace. sendfile(2), for example.

To the degree that you can achieve "getting the kernel to do your work for
you", the kernel memory allocator becomes one of the main determinants on your
scaling requirements. IIRC WhatsApp hit this point with the FreeBSD kernel,
and had to tune the heck out of the kernel to keep scaling.

(Tangent: why is it that we hear a lot about bespoke unikernels, and a lot
about entirely-in-userspace designs like Snabb, but nobody talks about
entirely-in-kernelspace designs? Linux itself makes a pretty good "unikernel
framework": just write your business logic as a kernel driver (in e.g. Rust),
compile it into a monolithic kernel, and then run no userland whatsoever.)

~~~
Matthias247
> However, you kind of want most memory usage to be in the kernel

I disagree on that one. More memory in the Kernel mean more chances for going
OOM, more fragmentation of kernel memory, and less isolation and stability.

In the ideal case I would rather want the least possible amount of memory in
the Kernel (and maybe have that all statically allocated) in order to maximize
stability and determinism.

~~~
derefr
But what’s the difference between in stability between an
autoscaling+autohealing VM running a 100% kernel-mode application, and a VM
hosting a 100% user-mode process (presumably under an init daemon that will
autoheal it)?

And does “fragmenting kernel memory” mean anything, if you preallocate a
memory arena in your kernel driver (taking “everything the base kernel isn’t
using”, like a VM memory-balloon driver), and then plop a library like
jemalloc into the kernel to turn the arena into your driver’s shiny new in-
kernel heap? You’re not messing with the kernel’s own allocation tables, any
more than a KVM VM’s TLB interfere’s with the dom0 kernel’s TLB.

See also: capability-based operating systems (like Microsoft’s ill-fated
Midori), where there’s no such thing as a per-process address space, just one
big shared heap where processes can touch any physical memory, but only
through handles they own and cannot forge. If your OS has exactly one process
anyway, you don’t even need the capabilities. (This also being how managed-
runtime unikernels like Erlang-on-Xen work.)

Also, another example of what I’m talking about re: benefits of this approach.
would you rather that a VM operating as an iSCSI disk server ran as a userland
iSCSI daemon managing kernel block devices; or—as it is currently—as an
entirely in-kernel daemon that can manage and serve block devices with no
context switches or memory copies required?

~~~
speedplane
> But what’s the difference between in stability between an
> autoscaling+autohealing VM running a 100% kernel-mode application, and a VM
> hosting a 100% user-mode process (presumably under an init daemon that will
> autoheal it)?

Running everything (or as much as possible) in Kernel mode has obvious
performance benefits, hard to argue against that.

The counter-argument isn't about performance, it's about flexibility. As far
as I'm aware, no major cloud providers allow you to run apps in Kernel mode.
So if you develop an app on your own hardware, but may eventually deploy it in
the cloud, you better put everything in user mode. If you want to switch from
one cloud provider to another, it's (relatively) easy if everything is in
kernel mode.

Also, while running in Kernel mode is almost certainly faster than user mode,
it's probably _not that much faster_. If your app relies on the network or is
heavy on disk I/O, that's where your bottleneck will be, not OS user/kenel
mode switching.

In short, running things in Kernel mode may sometimes be a good performance
decision, but it's often a bad business decision.

~~~
derefr
I’m confused on what you mean by cloud providers not allowing you to run “in
kernel mode.” I know for sure that both MirageOS and Erlang-on-Xen
(unikernels) can be easily deployed as AWS AMIs. And even on compute providers
less friendly to custom image booting (like DigitalOcean), you can always
bootstrap off of a standardized Linux image by just telling it to fetch and
kexec(2) your custom kernel binary in the instance’s cloud-init script.

~~~
speedplane
> I know for sure that both MirageOS and Erlang-on-Xen (unikernels) can be
> easily deployed as AWS AMIs.

I’m not familiar with this particular deployment process, so don’t want to
speak out of line... but this being the internet, why not.

I’m a bit skeptical that Amazon would let you run anything in pure kernel
mode, it’s likely a VM/ sandbox wrapping an OS that’s operating in Kernel
mode, likely negating much of the performance benefits.

Second, you mentioned two specific images, and I’ll assume they work fine on
AWS, but they are just two, and if your working with them, you probably have
very specific needs, not suitable for general development.

Third, who knows if these images will work on other cloud providers. Once you
get your kernel mode app working on one, your locked in.

Fourth, what are you doing that requires this level of local machine
performance operating in the cloud? It’s probably almost always better to
invest your optimization time/dollers elsewhere.

Fifth, if this was a good / easy idea, many people would be doing it, but they
aren’t. Either you’ve stumbled upon some secret enlightened approach, or your
probably wron.

~~~
derefr
I think you're misunderstanding: I'm talking about instances that are
operating in kernel mode _in a VM_ (which _is_ ring 0, just with the MMU and
IOMMU pre-configured by the dom0 to not let the domU have complete control
over memory or peripherals.) For most IaaS providers, VM instances are all
they'll let you run _anyway_ †. Normally, people are running userland
processes on these VMs. That's _two_ context switches for every system call:
domU user process → syscall to domU kernel → hypercall to dom0 kernel. And
it's _two_ address space mappings you have to go through, making a mess of the
cache-coherence of the real host-hardware TLB.

Writing your code to be run _as_ the kernel _of_ the VM, on the other hand,
reduces this to _one_ context switch and _one_ page translation, as your
application is just making hypercalls directly and directly using "physical
memory" (≣ dom0 virtual memory.)

Think of it this way: from the hypervisor's perspective, its VMs are a lot
like processes. A hypervisor offers its VMs all the same benefits of stability
and isolation that an OS offers processes. In fact, the only reason they _aren
't_ just regular OS processes (containers, essentially), is that IaaS compute
has been set up with the expectation that users will want to run complete
boot-images of existing OSes as their "process", and so a process ABI (the
hypercall ABI) is exposed that makes this work.

But, _if_ you are already getting the stability+isolation benefits just from
how the IaaS compute provider's hypervisor is managing your VM-as-
workload—then why would you add any more layers? You've already got the right
abstraction! A kernel written against a hypercall interface, is effectively
equivalent to a userland process of the hypervisor, just one written against a
strange syscall ABI (the hypercall ABI.)

(And, of course, it's not like you can choose to run directly as a host OS
userland process instead. IaaS compute providers don't bother to provide such
a service, for several reasons‡.)

> Third, who knows if these images will work on other cloud providers.

Hypercall ABIs are part of the "target architecture" of a compiler. You don't
have to take one into account in your source code; compilers handle this for
you. You just tell clang or ocamlcc or rustc or whatever else that you're
targeting "the Xen hypercall ABI", or "the ESXi ABI", and it spits out a
binary that'll run on that type of hypervisor.

(Admittedly, it's a bit obtuse to figure out which hypervisor a given cloud
provider is using for a given instance-type; they don't tend to put this in
their marketing materials. But it's pretty common knowledge floating around
the internet, and there are only four-or-so major hypervisors everyone uses
anyway.)

> Fifth, if this was a good / easy idea, many people would be doing it, but
> they aren’t.

I'm from a vertical where this _is_ common (HFT.) I'm just here trying to
educate you.

\---

† there are in fact "bare-metal clouds", which _do_ let you deploy code
directly on ring 0 of the host CPU, with the same "rent by the second" model
of regular IaaS compute. (They accomplish this by relying on the server's
BMC—ring -1!—to provide IaaS lifecycle functions like wiping/deploying images
to boot disks.) It's on these providers where a Linux-kernel-based (or other
FOSS-kernel-based) unikernel approach would shine, actually, as you would need
specialized drivers for this hardware that Linux has and the "unikernel
frameworks" don't. See [http://rumpkernel.org/](http://rumpkernel.org/) for a
solution targeting exactly this use-case, using NetBSD's kernel.

‡ Okay, this is a white lie. _Up until recently_ none of the big IaaS
providers wanted to provide such a service, because they didn't trust
container-based virtualization technology to provide _enough_ isolation.
Google built gVisor to increase that isolation, though, and so you _can_ run
"ring-3 process on shared direct-metal host" workloads on their App Engine,
Cloud Functions, and Cloud Run services. But even then, gVisor—despite
avoiding ring-0 context switches—still has a lot of overhead _from the user 's
perspective_, almost equivalent to that of a ring-0 application in a VM. The
only benefits come from lowered per-workload _book-keeping_ overhead on the
_host_ side, meaning Google can overprovision more workloads per host, meaning
that "vCPU hours" are cheaper on these services.

~~~
speedplane
> I'm talking about instances that are operating in kernel mode in a VM....
> Normally, people are running userland processes on these VMs. That's two
> context switches for every system call... Writing your code to be run as the
> kernel of the VM, on the other hand, reduces this to one context switch and
> one page translation

Thanks for the clarification, this does indeed make sense. If your app is
already sandboxed by the VM, introducing a second kernel/userland sandbox
within the existing sandbox doesn't make as much sense.

That said, I think there are better ways to fix this issue than putting all of
your code into a VM's kernel space. For instance, imagine there was a way for
a hypervisor to lock down and "trust" the code running in a VM's kernel space,
and could thus put the VM's kernel space into the same address space as the
hypervisor. This could also potentially reduce the two memory translations
down to one.

Another solution is to rely more on special hypervisor hardware that could
conceivably do the two memory translations (VM user -> VM kernel ->
hypervisor) as fast as a single translation.

The main reason that these alternative approaches may be desirable, is that
asking developers to move their programs from userland to the kernel is a big
ask. There's a lot of configuration that needs to be done, and few general
software developers have experience working within unprotected kernel space.
Simple bugs that would normally just crash a single process could bring down
the entire VM, and could potentially affect other VMs on a network (for
example, imagine a bug that accidentally overwrote a network driver's memory).

I'm sure there are performance gains to be had here, but they may be
insignificant. Projects like these are cool, but raise big red flags of
potential over and early optimization.

------
speedplane
This article is so full of Linux jargon, it’s impossible to follow unless
you’re deeply steeped in it. Is there another source? Memory controllers have
been developed for 30+ years, I’m surprised and slightly skeptical there are
still major breakthroughs, would love to read an explanation that I can
actually understand.

~~~
smcl
I randomly stumbled blindly into the world of slab allocation and wrote a
little bit about it here: [https://blog.mclemon.io/discover-a-linux-utility-
slabtop](https://blog.mclemon.io/discover-a-linux-utility-slabtop)

That might help at least with the slab part.

------
aloknnikhil
The optimization is primarily aimed at systems with multiple cgroups, so think
Docker-like container platforms.

~~~
cbarrick
Traditional servers should be positively effected as well [1].

> Also, there is nothing fb-specific. You can take any new modern distributive
> (I've tried Fedora 30), boot it up and look at the amount of slab memory.
> Numbers are roughly the same.

The explanation given in the article is that most distributions with systemd
spin up a bunch of cgroups even without a container-oriented workload.

[1]:
[https://lkml.org/lkml/2019/9/19/628](https://lkml.org/lkml/2019/9/19/628)

------
bcaa7f3a8bbc
> _much-improved memory utilization between multiple memory cgroups_

There will be no improvement if I don't use memory cgroups, which means it's
not related to a typical desktop or server without containers. But still good
news, the use of containers can only expand.

~~~
nicolaslem
Systemd uses cgroups heavily, so chances are your typical distribution runs
hundreds of them under the hood.

~~~
bcaa7f3a8bbc
Thanks for reminding me about systemd. I just used "systemd-cgtop", and yes, I
see 38 cgroups, one cgroup for each service/user. But 88% of the memory is
used by one cgroup: user.slice, for all the program started by me under the
desktop. So I'm not sure it can save a lot of memory on container-less
desktop.

Surely good for all servers.

------
meanderer
Not familiar with this, but I assume measures are taken so that we don’t
accidentally share memory content between different processes through slabs?
That seems a pretty straight forward security consideration.

------
droopyEyelids
Kind of interesting to think how much money/energy this can save across the
planet

~~~
TylerE
Seems like near zero.

RAM is specced based on worse case, generally - and it's not like this is
going to save that much anyway since it's not user-space memory.

~~~
vbezhenar
Unused RAM usually is utilized for filesystem caches. So more cache, less disk
usage, less power usage. Though difference must be truly negligible. Those
trends of computing ecology terrifying me. People are talking about truly
insignificant things. Yes, at large scale even 0.001% of energy saving seems
like a significant number. But large scale is planetary scale and now on that
scale it's again rounding error.

~~~
TylerE
If you want to save energy, do everything you can to kill cryptocurrency. Now
THAT is some obscene waste of energy.

------
Nicci00
I wonder if these improvements will trickle down to desktop Linux users.

~~~
doubleunplussed
Basically if it is accepted into the mainline kernel then yes, otherwise not
unless people decide to install custom kernels. Distros backport a little bit,
but generally don't include anything not accepted into mainline. Then there
are a few popular custom kernels that include changes not included in mainline
because they are not generally useful, and only make sense for specific
workloads.

------
AtlasBarfed
So they are sharing common code segments like libraries? I hope this has some
code/data awareness.

~~~
sverige
Security isn't as important as performance.

~~~
sachdevap
Is this a serious comment? I'm genuinely uncertain whether you mean this
seriously.

~~~
badsectoracula
Not the person you asked that, but yeah, personally i also find performance
more important than security for my personal computer. The chances of me being
attacked are practically zero (theoretical attacks do not count, most of these
security issues you read focus on the possibility and totally ignore
probability) while the chances of me wanting my computer to be faster are
100%. I hate waiting for my computer to do things.

Note BTW that there is a difference between "i find performance more
important" and "i do not care about security at all". I do care about
security, but i am not willing to sacrifice my computer's performance for it.
I simply consider performance more important.

~~~
holy_city
>the chances of me being attacked are practically zero

That's because the OS developers have placed value on security over
performance. Whooping cough is rare too, we still vaccinate against it.

If you want a classic example, bounds checking an array is important to avoid
RCE and sandbox escape attempts. It can also have a hefty performance penalty,
under some scenarios it trashes the branch predictor/instruction pipeline. But
I'm glad that my browser isn't as fast as machine-ly possible when streaming
video, because I'd prefer if there wasn't a risk of having my emails from
various banks, stored passwords in the browser, etc from being collected and
sent to a bad actor.

~~~
badsectoracula
> That's because the OS developers have placed value on security over
> performance.

No, that is mainly because nobody knows nor cares about me personally.

As for your example, i already addressed it with that last part in my message:

> Note BTW that there is a difference between "i find performance more
> important" and "i do not care about security at all". I do care about
> security, but i am not willing to sacrifice my computer's performance for
> it. I simply consider performance more important.

The browser is a case where i'd accept less performance for better security
because it is the primary way where things can get into my computer outside of
my control. However that doesn't mean i'd accept less performance in, e.g., my
image editor, 3d renderer, video encoder or whatever else.

In other words, i want my computer to be reasonably secure, just not at all
costs.

~~~
monocasa
> No, that is mainly because nobody knows nor cares about me personally.

I mean, they do care about you. I assume you have a bank account, or personal
information that can be used to open a credit card under your name?

> However that doesn't mean i'd accept less performance in, e.g., my image
> editor, 3d renderer, video encoder or whatever else.

Most of that is specifically designed with security in mind. For instance the
GPU has it's own MMU so you can't use it to break the boundaries between user
mode and kernel mode.

~~~
badsectoracula
> I mean, they do care about you. I assume you have a bank account, or
> personal information that can be used to open a credit card under your name?

That is not caring about _me_ though. Honestly at that point you are spreading
the same sort of hand-wavy FUD that is used to take away user control "because
security".

> Most of that is specifically designed with security in mind. For instance
> the GPU has it's own MMU so you can't use it to break the boundaries between
> user mode and kernel mode.

Again, i'm not talking about not having security at all.

~~~
monocasa
> That is not caring about me though. Honestly at that point you are spreading
> the same sort of hand-wavy FUD that is used to take away user control
> "because security".

I legitimately don't understand your argument here. Do you not lock your car?
A opportunistic car thief doesn't have to "care about you", and going through
the process of unlocking your car could slow you down.

~~~
badsectoracula
Those comparisons miss important details so they aren't helpful - and also i
do not have a car. Though if you want a comparison that does apply to me - i
lock my apartment's door, though i do not bother with installing a metal door
and window bars despite knowing how easy the door would be to break for
someone who insists in entering my place as the chances of this happening are
simply not worth the cost.

I already repeated that several times, i'm not sure how else to convey it: i
care about security (lock my door), but it isn't at the top of my priorities
(do not have a metal door and window bars).

~~~
monocasa
The problem is you're speaking in metaphors. Which security is getting in your
way that you can't trivially disable?

------
timwaagh
What's the point when we can just download more RAM?

------
denton-scratch
"it could find its way into the mainline kernel as early as 2020"

Gosh. So soon?

~~~
progval
I'm not a kernel dev, but this looks like a pretty big change to a critical
system, with very diverse non-trivial effects on performance depending on
workloads.

It needs to be properly reviewed and tested before landing, and that takes
time.

