The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.
The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that did use those bits, and many applications that did, too. The cost to dig out was significant.
It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.
You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.
On x86_64 and arm without those features enabled, the top bits of the pointer must be sign extended. This means that x86_64 by default gives you the top two bytes to play with as long as you don't use the values 0xffff or 0x0000. Attempting to access a pointer whose top 16 bits aren't a sign extension of bit 48 will fault. You can still safely play this game as long as you fix up the pointer before dereferencing it.
Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing and 5-level page tables . It would be quite a bug that would only crash on new Intel hardware on probably quite full memory maps where your pointer fix up code wouldn't restore bits 48-57 correctly.
If only x86 could have 64 kB pages...
Of course you're right it's not a great idea to use those bits. Eventually they will be in use, although it's probably 10+ years.
52-bit physical & 57-bit virtual address space is a no brainer if you have more than 128 TB of RAM installed, of course. :-)
Yeah, I'd hope that code that uses such tricks would have an error check for pointers that use the bits it wants to use. Better a clean crash than corruption.
It'd be neat if a program could tell the OS what address range is acceptable. Linux has mmap(..., MAP_32BIT, ...) but obviously that's pretty limited. Maybe something like map(addr, ..., MAP_MAXADDR, ...) which would tell it addr represents the maximum acceptable address to return. So if you intend to use the top 16 bits, you could tell it this mapping can't use those.
Edit: oh, actually, I see they do something kind of like this. https://lwn.net/Articles/717293/ "An application that needs [virtual address beyond 48-bits], and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible." Still not quite as flexible as I was imagining but not bad.]
I don't often get to do work as cool as that, but I've really enjoyed pursuing more researchy things in my spare time. Right now I'm working on adding SQL:2011 temporal features to Postgres, so I guess I'm still getting my C fix from somewhere. :-)
I once shrugged this of as paranoid BS and got bitten by it in the very same project. It did cost me some all-nighters to fix the results of my arrogance.
With x86_64, the top bits have to be all 0, or all 1. You can still reuse them as long as you clear them before actually using them as pointers. Of course that might break at a later date is memory spaces increase in size and they are used, but I'll deal with that when it happens.
"We know this is wrong, but we'll deal with it when it happens" is a demonstrably expensive road. I can't think of many clearer cases of refusal to learn from history in computing.
Also, your suggestions are a bit strict, in my opinion. Every VM or interpreter I know about uses this kind of trick, usually to squish things like integers. Not doing this requires adding an extra allocation for every integer.
Well, I'm going from the processor reference manuals from Intel and AMD. Existing implementations are 48 bits, and future implementations WILL have more. I trust these folks to make good on their promise. (Similarly, the 68000 had 24 bits, and when the 68020 came out it had full 32 addressing, and an MMU, and badly written software broke).
There are pretty safe ways to do tagging; common techniques usually distinguish limited range integers and pointers with a low bit, and pointers often have extra bits since those are relatively safe to manipulate. I have not seen runtimes that use high-order bits for tags in quite some time, probably the late 80s; you haven't seen that software recently either, because it doesn't work anymore.
The x86 segmented architectures (which is all we had from Intel until the 386's flat address space came out) had several variations on pointers, basically a menu of terrible tradeoffs between fast and small code versus small and large address spaces. Far pointers were formed with a segment and an offset, and the game was all about making these multipart pointers efficient for things like comparison. In the face of pointer normalization, you mucked with high bits at your peril.
The one thing I'm sure of in this industry is that software is always going to be getting bigger. :-)
So to use these dirty tricks successfully, you have to mask out those bits before you dereference your pointer.
Seriously, with 64 bit addresses available on the iPhone I’m typing this into, this is an excellent trick, especially for applications. Even ARM’s TBI leaves 56 bits which is more address space than a data center’s DRAM.
log2(16 billion * 100,000) is about 50
See also: survivor bias.
OTOH DEC placed a bet on new and cool Alpha architecture, and died switching (bought by HP).
DEC was acquired by Compaq not HP. Compaq was later acquired by HP but before that happened Compaq sold the Alpha line to Intel because it would have competed with the Itanium. The Alpha was then licensed by Intel to the Chinese (according to Gustafson) and was used as the basis for the Sunway Taihu Light.
Can I help you with anything else?
I don't remember how many Mac applications were '32-bit clean' and ran without modification when VM was turned on. I do know that some apps had to do significant rework, and that there was a whole generation of abandonware where developers didn't think it was worth the effort and just walked away from their customers.
I imagine that the guy on the MacOS team who stuffed some desk accessory bits into the high byte of a pointer parameter thought he was being clever. Four or five years down the line he was paying for it, and it took years to fix his original moment of convenience. He could have added a second parameter on the call in question, or allocated another byte in a struct, and the original system would have been slightly bigger but he wouldn't have hatched a nightmare. Go read about '32-bit clean' Mac applications if you don't believe me. It sucked real hard.
A bunch of companies with a bunch of smart people have made the decision to use "unused" high bits in pointers and have later regretted it, to the tune of a ton of expensive remedial engineering. I see a lot of talk here along the lines of "b-b-b-but we know what we're doing" and "oh, 48 bits is enough, and if it's not then we'll deal with it later" and I'm thinking, "This is precisely how the software industry just never learns."
I figure that I have about 20 years left in my career writing software. I've seen address spaces grow from 16 bits (and we used to think "wow, that's BIG") to 64 bits [okay, 48 bits if you're still in denial :-) ] and wow, that's big. But I'll bet I'll see 64 bits generally considered kind of tight before I retire, and I'll bet there are at least half a dozen people on the planet who are up against a 64-bit wall today. (High probability at least a couple of those folks are on HN. Any hands?).
Sorry if this is a silly question but is the "upper" in these examples the most significant bit end of the address then?
(The world numbers these from 0 to N, least significant to most significant. IBM numbers them from 1 to 32, with 1 being most significant. I guess that made sense to accountants).
I'd add, keep things local. Don't access memory (or cache) outside core (L1 & L2), NUMA region or processor socket boundary unnecessarily.
Keep networking, GPU, etc. code in same NUMA region where the physical adapters are.
Use memory like tape, stream through. CPU branch predictors love that kind of access pattern.
Oh, and perhaps most importantly: use a profiler that can access CPU internal performance counters. Do this on different system types, from low power laptops to servers with 2 or more CPU sockets.
One annoying thing, though. Remember that the fastest thing in a microbenchmark might not be the fastest thing on a real system when different code modules fight for shared limited resources, like memory bandwidth, caches and inter-core communication links.
My brain's "branch predictor" had some sort of issue. CPU prefetchers of course. :-)
A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.
Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.
Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, more often 4.
Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use. [But use 64 bits!]
Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.
[I add here that storing things in high bits of addresses is very likely to fail on systems with ASLR, and that I have learned MSVC bitsets have a very slow popcount.]
Bitsets are useful in more places than you might guess, because they can be used to filter out, with typically a single instruction, a majority of uninteresting cases in searches, leaving fewer cases to be examined with more precise and expensive operations. You record interesting properties of elements of a collection upon insertion as bits in a word stored alongside (or, better, in an index); and then first check the bits during any search. It is easy to speed up an amazing variety of programs by an order of magnitude, sometimes much more, with a tiny change.
For example, you can store an int value for each of a (possibly very large) collection of lines of text, representing the set of less-common letters that appear in the line. Searching for lines that contain a string, you first check it against the set for the string. Any line that doesn't have them all certainly doesn't have the string.
Leibniz (inventor of calculus) used this method very heavily in his own work. Before computers--and even into the 1960s--it was the most important way of automating data processing. Back then, you used cards that had a row of either holes or notches along one edge, and selected matching cards from a stack by inserting a rod through the stack at a selected spot, and lifting.
Hey, do you have a reference for that? I've been doing some research into Leibniz's calculators, and I've been finding few sources.
I think I found out about Leibniz's bitwise activities in Neal Stephenson's Quicksilver, but he invented the modern notions of both sets and digital logic, according to Wikipedia. He would have used the cards in catalogging libraries.
But they used to be essential for automating chess. A 64-bit word represents a board, and a set of words represents the positions of each type of piece, one word for white pawns, etc. Another word shows places threatened by the white pawns, and ORing them shows all the threatened places.
Just one nit/warning... breaking coarse locks into fine-grained locks can be taken too far. There is a point of diminishing returns where you end up spending increased time acquiring/releasing/waiting-for locks. At some point you want to clump together under a single lock resources that tend to often be used together, even if you often end up locking an extra resource or two unnecessarily. As always, benchmark workloads are your friends.
My favorite too-many-locks story goes back quite a long time, a former coworker is an old-time Unix guru, who was at Sequent back in the day when 8 CPU's was kind of a big deal. He spent about 9 months on a project putting lots of fine-grained locks into their Unix kernel. After shipping that, his next project was 6 months taking about 25% of them out :)
If your hashtable dynamically resizes though, (x % size) will use a full divide. You could keep the log2 of the size around instead, and rely on (x % (1U << size_shift)) being optimised (this works on gcc: https://godbolt.org/z/HbGnJH) but (x & (size - 1)) might be easier at that point.
I'm not saying never do that (ok, maybe I am...) But definitely think long and hard about how long your code will be around before you do it.
That's why you hide the trick behind a zero-cost abstraction which checks at compile-time if the platform supports this
† Offer only valid in the contiguous United States. Offer cannot be used in combination with any other offers. See stores for details.
‡ When not in use, return Happy Fun Ball™ to its protective lead case.
And if you're willing to check your optimizations by reading disassembly, tricks like stuffing tag bits into the bottom of aligned pointer values is pretty routine.
Cache thrashing is a thing. And perf analysis tools have bookkeeping errors.
What problems did lower bit use cause? You always had to mask it away even on the old boxes, no?
As for struct members, the alignment is (of course) "implementation defined", which is the fancy way of throwing up your hands and saying "whatever". (Since C++11 we actually have alignas(), which at least gives manual control)
For instance, jemalloc(3) on amd64 platforms will give back 16-byte aligned pointers:
[...] The allocated space is suitably aligned
(after possible pointer coercion) for storage of any type of object.
But that sort of trick is only worth it if you are writing a compiler and its runtime, an interpreter, a memory allocator (in particular GCs) or at the very last some sort of high performance library (you would return your "featured pointer" as an opaque type to the user).
Does any one know if this sentence can be backed up by a citation?
I know that the NaN-tagging trick assumes that pointers have 48 bits (small enough to fit inside a floating point mantissa), but was this ever a factor for deciding whether 5-level page tables should be added to the Linux kernel or not?
The issue listed was definitely a concern, but was worked around by having the kernel only allocate userspace linear addresses that aren't 48-bit-canonical in response to a mmap() call that supplies such an address as the hint argument. See the commit message in this commit for example: https://lore.kernel.org/patchwork/patch/796025/
So for a program to get an address above the 56-bit limit, its memory allocator has to specifically indicate to the kernel that it supports that.
On the locking side, the counterpart would probably be "lock-free algorithms", but I still don't believe their complexity means that in most cases you shouldn't look at those :)
Does this also apply when multiple processors are only reading memory?
Is LLVM not smart enough to optimize this?
However, for signed division, the C semantics (round towards zero) are different than the semantics when you apply an arithmetic shift (round towards negative infinity).
If you are fine with the latter behavior explicit shifts remove several extraneous instructions dealing with the difference.
But “high performance” is not a single dimension - if you say a bit more about what you care about, maybe another choice would fit.