You can also flip it around and sample GPIOs at close to 1 MHz. Piscope is an example of that. Once you start pushing 1 MHz, jitter starts to become nontrivial as you’re saturating the bus or competing with other memory transfers.
So yes: a capable DMA controller can do some pretty mind-blowing things!
Perhaps the single biggest advantage of the DMA solution for something like pi-scope is that it means people who want to use the pi as a scope need only download a piece of software: no soldering or flashing of MCUs required. The initial setup is more beginner-friendly, with the drawback that contributing might be more intimidating because the software is doing more complex work.
There was a post (this year?) mapping Godel's incompleteness theorem to the halting problem. I found it inspiring, not in the least because the link revealed they were both in fact statements about the properties of infinity - something that was not obvious by just looking at either problem on its own. It is of those few cases where the whole was obviously bigger than the sum of the parts, for me at least.
> NOTE: The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware.
This really seems like a somewhat obvious caveat of SLUB, in hindsight at least, given how memory cgroups work. The accounting overhead here seems like opening a complexity can of worms.
For memory-pressure intensive short-lived applications (e.g. Hadoop jobs with no strict NUMA affinity), this could net some real benefits when processes jump across cores, let alone physical cores.
Most serious folks should be setting at least some CPU affinity anyways and discourage the scheduler from bouncing processes between cores, which otherwise exasperates the issue mitigated by this patch set.
If you've ever wondered why you struggle to make use of all of your RAM on a large many-core host, this is potentially a big reason why.
Considering the gain I'm surprised there wasn't more discussion or interest in this.
~10% of the utilization on my system is slab.
[bjames@lwks1 ~]$ grep "Slab\|MemTotal\|Active:" /proc/meminfo
MemTotal: 65976620 kB
Active: 6139820 kB
Slab: 605988 kB
That's hardly insignificant.
It sounds to me like slab allocation is specifically used to allocate small blocks of memory (less than a page). Ie a slab might be set aside for integers, then if an application (edit: not applications, see rayiner's comment below) needs to store an integer it gets stored in that slab instead of somewhere else in memory. I'm not a Linux developer, so I hope I'm not spreading bad info. Maybe someone else can chime in.
I assume this would largely be short lived allocations so I can see where optimizations here would lead to power savings.
To the degree that you can achieve "getting the kernel to do your work for you", the kernel memory allocator becomes one of the main determinants on your scaling requirements. IIRC WhatsApp hit this point with the FreeBSD kernel, and had to tune the heck out of the kernel to keep scaling.
(Tangent: why is it that we hear a lot about bespoke unikernels, and a lot about entirely-in-userspace designs like Snabb, but nobody talks about entirely-in-kernelspace designs? Linux itself makes a pretty good "unikernel framework": just write your business logic as a kernel driver (in e.g. Rust), compile it into a monolithic kernel, and then run no userland whatsoever.)
I disagree on that one. More memory in the Kernel mean more chances for going OOM, more fragmentation of kernel memory, and less isolation and stability.
In the ideal case I would rather want the least possible amount of memory in the Kernel (and maybe have that all statically allocated) in order to maximize stability and determinism.
And does “fragmenting kernel memory” mean anything, if you preallocate a memory arena in your kernel driver (taking “everything the base kernel isn’t using”, like a VM memory-balloon driver), and then plop a library like jemalloc into the kernel to turn the arena into your driver’s shiny new in-kernel heap? You’re not messing with the kernel’s own allocation tables, any more than a KVM VM’s TLB interfere’s with the dom0 kernel’s TLB.
See also: capability-based operating systems (like Microsoft’s ill-fated Midori), where there’s no such thing as a per-process address space, just one big shared heap where processes can touch any physical memory, but only through handles they own and cannot forge. If your OS has exactly one process anyway, you don’t even need the capabilities. (This also being how managed-runtime unikernels like Erlang-on-Xen work.)
Also, another example of what I’m talking about re: benefits of this approach. would you rather that a VM operating as an iSCSI disk server ran as a userland iSCSI daemon managing kernel block devices; or—as it is currently—as an entirely in-kernel daemon that can manage and serve block devices with no context switches or memory copies required?
Running everything (or as much as possible) in Kernel mode has obvious performance benefits, hard to argue against that.
The counter-argument isn't about performance, it's about flexibility. As far as I'm aware, no major cloud providers allow you to run apps in Kernel mode. So if you develop an app on your own hardware, but may eventually deploy it in the cloud, you better put everything in user mode. If you want to switch from one cloud provider to another, it's (relatively) easy if everything is in kernel mode.
Also, while running in Kernel mode is almost certainly faster than user mode, it's probably not that much faster. If your app relies on the network or is heavy on disk I/O, that's where your bottleneck will be, not OS user/kenel mode switching.
In short, running things in Kernel mode may sometimes be a good performance decision, but it's often a bad business decision.
I’m not familiar with this particular deployment process, so don’t want to speak out of line... but this being the internet, why not.
I’m a bit skeptical that Amazon would let you run anything in pure kernel mode, it’s likely a VM/ sandbox wrapping an OS that’s operating in Kernel mode, likely negating much of the performance benefits.
Second, you mentioned two specific images, and I’ll assume they work fine on AWS, but they are just two, and if your working with them, you probably have very specific needs, not suitable for general development.
Third, who knows if these images will work on other cloud providers. Once you get your kernel mode app working on one, your locked in.
Fourth, what are you doing that requires this level of local machine performance operating in the cloud? It’s probably almost always better to invest your optimization time/dollers elsewhere.
Fifth, if this was a good / easy idea, many people would be doing it, but they aren’t. Either you’ve stumbled upon some secret enlightened approach, or your probably wron.
And it's quite possible to create new images for these instances for your needs.
You're right about the complexity and lock in. Using such services are often a trade off in dev complexity vs speed. If you're following a well defined process that doesn't change often that's going to peg the CPU of a 72-core machine for days, it might be worth it to eek out every speed improvement you can. If, on the other hand, you're constantly iterating and updating your code, and responding to handful of user requests that barely causes a blip on a single core VM, CPU speed is probably not your primary concern.
People are doing it, I'm one of them. It's a niche though, and most assuredly not for ever case.
Writing your code to be run as the kernel of the VM, on the other hand, reduces this to one context switch and one page translation, as your application is just making hypercalls directly and directly using "physical memory" (≣ dom0 virtual memory.)
Think of it this way: from the hypervisor's perspective, its VMs are a lot like processes. A hypervisor offers its VMs all the same benefits of stability and isolation that an OS offers processes. In fact, the only reason they aren't just regular OS processes (containers, essentially), is that IaaS compute has been set up with the expectation that users will want to run complete boot-images of existing OSes as their "process", and so a process ABI (the hypercall ABI) is exposed that makes this work.
But, if you are already getting the stability+isolation benefits just from how the IaaS compute provider's hypervisor is managing your VM-as-workload—then why would you add any more layers? You've already got the right abstraction! A kernel written against a hypercall interface, is effectively equivalent to a userland process of the hypervisor, just one written against a strange syscall ABI (the hypercall ABI.)
(And, of course, it's not like you can choose to run directly as a host OS userland process instead. IaaS compute providers don't bother to provide such a service, for several reasons‡.)
> Third, who knows if these images will work on other cloud providers.
Hypercall ABIs are part of the "target architecture" of a compiler. You don't have to take one into account in your source code; compilers handle this for you. You just tell clang or ocamlcc or rustc or whatever else that you're targeting "the Xen hypercall ABI", or "the ESXi ABI", and it spits out a binary that'll run on that type of hypervisor.
(Admittedly, it's a bit obtuse to figure out which hypervisor a given cloud provider is using for a given instance-type; they don't tend to put this in their marketing materials. But it's pretty common knowledge floating around the internet, and there are only four-or-so major hypervisors everyone uses anyway.)
> Fifth, if this was a good / easy idea, many people would be doing it, but they aren’t.
I'm from a vertical where this is common (HFT.) I'm just here trying to educate you.
† there are in fact "bare-metal clouds", which do let you deploy code directly on ring 0 of the host CPU, with the same "rent by the second" model of regular IaaS compute. (They accomplish this by relying on the server's BMC—ring -1!—to provide IaaS lifecycle functions like wiping/deploying images to boot disks.) It's on these providers where a Linux-kernel-based (or other FOSS-kernel-based) unikernel approach would shine, actually, as you would need specialized drivers for this hardware that Linux has and the "unikernel frameworks" don't. See http://rumpkernel.org/ for a solution targeting exactly this use-case, using NetBSD's kernel.
‡ Okay, this is a white lie. Up until recently none of the big IaaS providers wanted to provide such a service, because they didn't trust container-based virtualization technology to provide enough isolation. Google built gVisor to increase that isolation, though, and so you can run "ring-3 process on shared direct-metal host" workloads on their App Engine, Cloud Functions, and Cloud Run services. But even then, gVisor—despite avoiding ring-0 context switches—still has a lot of overhead from the user's perspective, almost equivalent to that of a ring-0 application in a VM. The only benefits come from lowered per-workload book-keeping overhead on the host side, meaning Google can overprovision more workloads per host, meaning that "vCPU hours" are cheaper on these services.
Thanks for the clarification, this does indeed make sense. If your app is already sandboxed by the VM, introducing a second kernel/userland sandbox within the existing sandbox doesn't make as much sense.
That said, I think there are better ways to fix this issue than putting all of your code into a VM's kernel space. For instance, imagine there was a way for a hypervisor to lock down and "trust" the code running in a VM's kernel space, and could thus put the VM's kernel space into the same address space as the hypervisor. This could also potentially reduce the two memory translations down to one.
Another solution is to rely more on special hypervisor hardware that could conceivably do the two memory translations (VM user -> VM kernel -> hypervisor) as fast as a single translation.
The main reason that these alternative approaches may be desirable, is that asking developers to move their programs from userland to the kernel is a big ask. There's a lot of configuration that needs to be done, and few general software developers have experience working within unprotected kernel space. Simple bugs that would normally just crash a single process could bring down the entire VM, and could potentially affect other VMs on a network (for example, imagine a bug that accidentally overwrote a network driver's memory).
I'm sure there are performance gains to be had here, but they may be insignificant. Projects like these are cool, but raise big red flags of potential over and early optimization.
A while back, I worked for an embedded systems company that developed real-time OSes, compilers, and debuggers. Everything in the Kernel had to be statically allocated, malloc was simply not allowed. Even user-land programs, which could dynamically allocate memory, had to check the return value of malloc to make sure the memory was actually allocated (malloc returns null if it can't allocate memory).
It was painful to program in this environment, but over time, I grew to love it. Software developers did not feel like they were developing "Soft"-ware; memory, CPU, I/O, interrupts, and even heat dissipation were not infinite resources you could use without care. These restrictions elevated the development process, and made the "developers" feel more like "engineers".
But, you can always write a five-line "trivial init" in C that just starts and then goes to sleep forever. If you didn't know, init processes don't have any special kernel-imposed requirements. You can run /bin/bash as init; this is what many distros' single-user rescue modes do!
Or, if you're even lazier than that, I believe Busybox has a configuration wizard that allows you to just deselect all the features. That'd probably spit out a "trivial init" binary.
¹ On my machine it's currently at only about 165MB, so not quite as extreme, but still a decent chunk of memory.
That might help at least with the slab part.
> Also, there is nothing fb-specific. You can take any new modern distributive (I've tried Fedora 30), boot it up and look at the amount of slab memory. Numbers are roughly the same.
The explanation given in the article is that most distributions with systemd spin up a bunch of cgroups even without a container-oriented workload.
There will be no improvement if I don't use memory cgroups, which means it's not related to a typical desktop or server without containers. But still good news, the use of containers can only expand.
Surely good for all servers.
RAM is specced based on worse case, generally - and it's not like this is going to save that much anyway since it's not user-space memory.
I make use of VFIO on my home Threadripper, and while it's "only" 12 cores and 64GB RAM, it's NUMA so I have to use thread pinning to keep cores on the same die so they're not reaching across the Infinity Fabric to the other memory controllers.
With better memory allocation, I could assign >12 vCPUs on performance oriented VMs or use more than half my memory without incurring a latency penalty.
The big challenge would be making sure that the different containers' resource limits are respected, but it looks like they're addressing that head on.
Note BTW that there is a difference between "i find performance more important" and "i do not care about security at all". I do care about security, but i am not willing to sacrifice my computer's performance for it. I simply consider performance more important.
That's because the OS developers have placed value on security over performance. Whooping cough is rare too, we still vaccinate against it.
If you want a classic example, bounds checking an array is important to avoid RCE and sandbox escape attempts. It can also have a hefty performance penalty, under some scenarios it trashes the branch predictor/instruction pipeline. But I'm glad that my browser isn't as fast as machine-ly possible when streaming video, because I'd prefer if there wasn't a risk of having my emails from various banks, stored passwords in the browser, etc from being collected and sent to a bad actor.
No, that is mainly because nobody knows nor cares about me personally.
As for your example, i already addressed it with that last part in my message:
> Note BTW that there is a difference between "i find performance more important" and "i do not care about security at all". I do care about security, but i am not willing to sacrifice my computer's performance for it. I simply consider performance more important.
The browser is a case where i'd accept less performance for better security because it is the primary way where things can get into my computer outside of my control. However that doesn't mean i'd accept less performance in, e.g., my image editor, 3d renderer, video encoder or whatever else.
In other words, i want my computer to be reasonably secure, just not at all costs.
I mean, they do care about you. I assume you have a bank account, or personal information that can be used to open a credit card under your name?
> However that doesn't mean i'd accept less performance in, e.g., my image editor, 3d renderer, video encoder or whatever else.
Most of that is specifically designed with security in mind. For instance the GPU has it's own MMU so you can't use it to break the boundaries between user mode and kernel mode.
That is not caring about me though. Honestly at that point you are spreading the same sort of hand-wavy FUD that is used to take away user control "because security".
> Most of that is specifically designed with security in mind. For instance the GPU has it's own MMU so you can't use it to break the boundaries between user mode and kernel mode.
Again, i'm not talking about not having security at all.
I legitimately don't understand your argument here. Do you not lock your car? A opportunistic car thief doesn't have to "care about you", and going through the process of unlocking your car could slow you down.
I already repeated that several times, i'm not sure how else to convey it: i care about security (lock my door), but it isn't at the top of my priorities (do not have a metal door and window bars).
I don't think there are many such machines anymore. I am not seeing this as a common case at all. And in the present context of the Linux memory controller, they would never sacrifice general kernel security for performance.
Gosh. So soon?
It needs to be properly reviewed and tested before landing, and that takes time.