
Common Systems Programming Optimizations and Tricks - wheresvic3
https://paulcavallaro.com/blog/common-systems-programming-optimizations-tricks/
======
kabdib
"Repurposing Top Bits" \- don't do that. Honest.

The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey,
there's a whole byte up top that nobody's using today, let's put some stuff
there!" When they wanted the address space IBM found themselves
architecturally hamstrung, and the cost to dig out was significant.

The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were
decoded. "Hey, there's a whole byte up top that nobody's using today, let's
put some stuff there!" When Apple needed the address space they found
themselves hamstrung by pieces of MacOS that _did_ use those bits, and many
applications that did, too. The cost to dig out was significant.

It is practically guaranteed that "Hey, there's 16 whole bits up there that
nobody's using today" will wind up the same, because this industry just never
learns.

You can do things with lower bits and usually get away with it; many systems
put GC tags and type bits down there. But those upper address bits do not
belong to you.

~~~
muricula
Armv8 has an opt-in feature you can turn on to ignore the top byte:
[https://en.wikichip.org/wiki/arm/tbi](https://en.wikichip.org/wiki/arm/tbi)
This is also where the pointer authentication code goes for arm pointer
authentication:
[https://lwn.net/Articles/718888/](https://lwn.net/Articles/718888/)

On x86_64 and arm without those features enabled, the top bits of the pointer
must be sign extended. This means that x86_64 by default gives you the top two
bytes to play with as long as you don't use the values 0xffff or 0x0000.
Attempting to access a pointer whose top 16 bits aren't a sign extension of
bit 48 will fault. You can still safely play this game as long as you fix up
the pointer before dereferencing it.

~~~
jnwatson
"Attempting to access a pointer whose top 16 bits aren't a sign extension of
bit 48 will fault."

Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing
and 5-level page tables [1]. It would be quite a bug that would only crash on
new Intel hardware on probably quite full memory maps where your pointer fix
up code wouldn't restore bits 48-57 correctly.

[1] [https://www.phoronix.com/scan.php?page=news_item&px=Linux-
De...](https://www.phoronix.com/scan.php?page=news_item&px=Linux-
Default-5-LVL-Paging-Def)

~~~
vardump
It'll probably be quite a while before operating systems rush to turn on extra
level of page walk fun, increasing TLB miss penalty even more.

If only x86 could have 64 kB pages...

Of course you're right it's not a great idea to use those bits. Eventually
they will be in use, although it's probably 10+ years.

~~~
simcop2387
The linux kernel has patches for it submitted now, ready for when the hardware
arrives [https://www.phoronix.com/scan.php?page=news_item&px=Linux-
De...](https://www.phoronix.com/scan.php?page=news_item&px=Linux-
Default-5-LVL-Paging-Def)

~~~
vardump
My point was that very few need more than 48 bits of virtual address space.
Having extra layer of lookup will reduce page walk performance.

52-bit physical & 57-bit virtual address space is a no brainer if you have
more than 128 TB of RAM installed, of course. :-)

------
SlySherZ
For everyone that enjoyed this, there's an entire free online MIT course
called Performance Engineering of Software Systems[1] where you'll learn
plenty more tricks and common pitfalls like these. You'll also learn how to
use tools to debug the low level performance of your programs: looking at
cache misses, cpu utilization, time spent per assembly operation and so on.
It's pretty cool :)

[1] [https://ocw.mit.edu/courses/electrical-engineering-and-
compu...](https://ocw.mit.edu/courses/electrical-engineering-and-computer-
science/6-172-performance-engineering-of-software-systems-fall-2010/)

~~~
kgs42
Thanks for this, first lecture looks cool!

------
vardump
Very good article, facts looked correct and it had useful advice.

I'd add, keep things local. Don't access memory (or cache) outside core (L1 &
L2), NUMA region or processor socket boundary unnecessarily.

Keep networking, GPU, etc. code in same NUMA region where the physical
adapters are.

Use memory like tape, stream through. CPU branch predictors love that kind of
access pattern.

Oh, and perhaps most importantly: use a profiler that can access CPU internal
performance counters. Do this on different system types, from low power
laptops to servers with 2 or more CPU sockets.

One annoying thing, though. Remember that the fastest thing in a
microbenchmark might not be the fastest thing on a real system when different
code modules fight for shared limited resources, like memory bandwidth, caches
and inter-core communication links.

~~~
ibrault
Could you elaborate on what it means to "use memory like tape"?

~~~
vardump
Sequential access patterns, forward or backward. Repeating predictable gaps
are ok, but do remember minimum unit that can be read from memory is a cache
line. So if you read one byte, you'll read 64 bytes on modern x86.

------
ncmncm
Last time this came up
([https://news.ycombinator.com/item?id=20808778](https://news.ycombinator.com/item?id=20808778))
and disappeared almost instantly, I wrote:

A discussion of systems programming optimization that doesn't start with
single-writer ring buffers starts on the wrong foot.

Those other tricks are excellent, and I use all of them, in cases where they
work at all. But, e.g., seeking a way not to need to take a lock at all should
come before discovering a low-contention locking protocol.

Readers should note that packing spare bits into the bottom bits of suitably
aligned pointers is more portable than using high bits. Any page-aligned
pointer has at least 12 bits free at the bottom, and any malloc'd pointer has
at least 2, more often 4.

Ring buffer counters into a power-of-2 sized buffer can be incremented without
bound, enabling use of ordinary arithmetic on them, and high bits masked off
cheaply on each use. [But use 64 bits!]

Probably the most neglected primitive data structure is the bitmapped set. A
`uint32_t` gives you a universe of 32 elements; a byte is enough for the days
of the week. The popcount native instruction is very valuable here, usually
expressed as `__builtin_popcount` in source code. C++98's `std::bitset`
provides Standard, portable access to it, but C++20 offers `std::popcount`
directly.

[I add here that storing things in high bits of addresses is very likely to
fail on systems with ASLR, and that I have learned MSVC bitsets have a very
slow popcount.]

~~~
mamcx
bitmapped set is one I don't remember at all. It have another name? A quick
google not give me a clear idea of what is or how could be usefull...

~~~
ncmncm
std::bitset is one. But you can just use an unsigned int, or array of them. In
C++ or C, set intersection is &, union is |, complement is ~. Cardinality is
__builtin_popcount, or std::bitset<>::count(). Membership test for m is (s &
(1<<m)). For a larger set, represented as an array of unsigned, it's (s[m>>5]
& (1<<(m&0x1f)). std::bitset encapsulates all this, optimally. There are
similarly efficient implementations for lots of other useful operations:
consult HAKMEM.

Bitsets are useful in more places than you might guess, because they can be
used to filter out, with typically a single instruction, a majority of
uninteresting cases in searches, leaving fewer cases to be examined with more
precise and expensive operations. You record interesting properties of
elements of a collection upon insertion as bits in a word stored alongside
(or, better, in an index); and then first check the bits during any search. It
is easy to speed up an amazing variety of programs by an order of magnitude,
sometimes much more, with a tiny change.

For example, you can store an int value for each of a (possibly very large)
collection of lines of text, representing the set of less-common letters that
appear in the line. Searching for lines that contain a string, you first check
it against the set for the string. Any line that doesn't have them all
certainly doesn't have the string.

Leibniz (inventor of calculus) used this method very heavily in his own work.
Before computers--and even into the 1960s--it was the most important way of
automating data processing. Back then, you used cards that had a row of either
holes or notches along one edge, and selected matching cards from a stack by
inserting a rod through the stack at a selected spot, and lifting.

~~~
pasabagi
>Leibniz (inventor of calculus) used this method very heavily in his own work.
Before computers--and even into the 1960s--it was the most important way of
automating data processing. Back then, you used cards that had a row of either
holes or notches along one edge, and selected matching cards from a stack by
inserting a rod through the stack at a selected spot, and lifting.

Hey, do you have a reference for that? I've been doing some research into
Leibniz's calculators, and I've been finding few sources.

~~~
ncmncm
There is quite a lot about Leibniz's calculating machine designs on Wikipedia.

I think I found out about Leibniz's bitwise activities in Neal Stephenson's
_Quicksilver_ , but he invented the modern notions of both sets and digital
logic, according to Wikipedia. He would have used the cards in catalogging
libraries.

------
dbcurtis
Good article. Basics that everyone can benefit from knowing.

Just one nit/warning... breaking coarse locks into fine-grained locks _can_ be
taken too far. There is a point of diminishing returns where you end up
spending increased time acquiring/releasing/waiting-for locks. At some point
you want to clump together under a single lock resources that tend to often be
used together, even if you often end up locking an extra resource or two
unnecessarily. As always, benchmark workloads are your friends.

~~~
jacobush
Taken to the extreme, ONE lock in Python. :)

~~~
josalhor
I know this is a joke, but you still need locks in Python

~~~
ajross
I didn't read it as a joke, you're just operating at a different abstraction
level. The CPython interpreter famously uses a single global interpreter lock
to protect the language internals and runtime, so it has trouble scaling
beyond a single CPU on interpreter-heavy workloads. You're saying that threads
in python scripts can be arbitrarily preempted, and so locking is required to
protect them against each other, which is also true.

------
Symmetry
In most cases your compiler should do the clever work of turning your division
or modulo operation into easier to do bit banging... but only if you're
operating on a reasonable constant. Powers of two are best but you can do
other constant divisions by stringing together 1-latency operations in ways
that are still far faster than division.

~~~
caf
Yes - as long as x is unsigned, then (x % 1024) will be compiled to the same
thing as (x & 1023) with any reasonable compiler.

If your hashtable dynamically resizes though, (x % size) will use a full
divide. You could keep the log2 of the size around instead, and rely on (x %
(1U << size_shift)) being optimised (this works on gcc:
[https://godbolt.org/z/HbGnJH](https://godbolt.org/z/HbGnJH)) but (x & (size -
1)) might be easier at that point.

------
legulere
Instead of repurposing top bits you can also repurpose the Bits beyond
alignment. E.g 32 bit integers are aligned to 4 bytes, so you can use the
lower two bits of pointers to them instead.

~~~
eschneider
As someone who's worked on old Macs and has also done lots of 32 -> 64-bit
porting, this is the sort of trick that works wonderfully...until it doesn't.
And then you've got a nightmare on your hands.

I'm not saying never do that (ok, maybe I am...) But definitely think long and
hard about how long your code will be around before you do it.

~~~
jcelerier
> As someone who's worked on old Macs and has also done lots of 32 -> 64-bit
> porting, this is the sort of trick that works wonderfully...until it
> doesn't. And then you've got a nightmare on your hands.

That's why you hide the trick behind a zero-cost abstraction which checks at
compile-time if the platform supports this

~~~
notacoward
One of my favorite system programming tricks is to never believe that a "zero
cost abstraction" lives up to the name.

~~~
jjuhl
Modern optimizing C++ compilers (especially with Link Time Optimization
enabled) are pretty amazing and can _very often_ actually achieve that
abstraction collapsing.. But, of course, _always_ measure.

~~~
notacoward
While it's true that modern compilers are wondrous things, checking whether
they're clever enough to optimize away a particular construct - and to do so
correctly, and to continue doing so in the next release - still takes time. If
the same optimization can be done at a higher level, such that it will apply
for any _correct_ (but not necessarily clever) compiler, that's preferable. In
my experience that's practically all the time. The best compiler optimizations
IMO are the ones that _can 't_ be easily done at the source level.

------
ufo
> Part of why the change couldn’t be enabled by default is because various
> high performance programs, notably various JavaScript engines and LuaJIT,
> use this repurposing trick to pack some extra data into pointers.

Does any one know if this sentence can be backed up by a citation?

I know that the NaN-tagging trick assumes that pointers have 48 bits (small
enough to fit inside a floating point mantissa), but was this ever a factor
for deciding whether 5-level page tables should be added to the Linux kernel
or not?

~~~
caf
5-level page tables have actually been in the kernel for a couple of years
now.

The issue listed was definitely a concern, but was worked around by having the
kernel only allocate userspace linear addresses that aren't 48-bit-canonical
in response to a mmap() call that supplies such an address as the hint
argument. See the commit message in this commit for example:
[https://lore.kernel.org/patchwork/patch/796025/](https://lore.kernel.org/patchwork/patch/796025/)

So for a program to get an address above the 56-bit limit, its memory
allocator has to specifically indicate to the kernel that it supports that.

------
vymague
Interesting article. Is there a reason why there isn't a book/article that has
a more comprehensive list?

~~~
groby_b
I guess partially because it can quickly become very dependent on the system
you work on. But if you're specifically interested in caching issues, one good
keyword to look for is "cache aware algorithms" \- or "cache oblivious
algorithms".

On the locking side, the counterpart would probably be "lock-free algorithms",
but I still don't believe their complexity means that in most cases you
shouldn't look at those :)

------
holy_city
The false-sharing macro in the example expands to __attribute__((alligned(/*
etc _/ )) or __declspec(align(/_ etc*/). Is there a reason these are preferred
over the alignas specifier introduced in C++ 11?

~~~
saagarjha
I believe there's a note that recommends the use of alignas when available:
[https://github.com/abseil/abseil-
cpp/blob/fa00c321073c7ea40a...](https://github.com/abseil/abseil-
cpp/blob/fa00c321073c7ea40a4fc3dfc8a06309eae3d025/absl/base/optimization.h#L107)

------
y7
> Now, to support multiple processors on a single machine reading and writing
> from the same memory in a coherent way, only one processor on a machine can
> have exclusive access to a given cache line.

Does this also apply when multiple processors are only reading memory?

~~~
atq2119
No. That's what MESI-based cache protocols are all about. Multiple
cores/processors can have the same cache line in a shared/read-only state.
Only writes require exclusive access.

------
floatboth
> The Magic Power of 2: Division Is Slowaloo

Is LLVM not smart enough to optimize this?

~~~
BeeOnRope
Yes, for constant divisors.

However, for signed division, the C semantics (round towards zero) are
different than the semantics when you apply an arithmetic shift (round towards
negative infinity).

If you are fine with the latter behavior explicit shifts remove several
extraneous instructions dealing with the difference.

~~~
caf
Or you can do the division in unsigned types - reasonable for something like
an array index.

------
SomaticPirate
Are there any golang implantation a of high performance hash maps? Does the
the standard lib do this?

~~~
bboreham
Go’s built-in ‘map’ is very good.

But “high performance” is not a single dimension - if you say a bit more about
what you care about, maybe another choice would fit.

