
Memory Tagging and how it improves C/C++ memory safety - ingve
https://arxiv.org/abs/1802.09517
======
BeeOnRope
It seems pretty weird for me that a paper submitted in 2018 doesn't mention
Intel's MPX
([https://en.wikipedia.org/wiki/Intel_MPX](https://en.wikipedia.org/wiki/Intel_MPX)),
which is pretty much tackling exactly the same problem, in hardware (and
requiring software cooperation).

I'm not taking (yet) a position on the relative merits of the bounds-register
based MPX approach versus the tagged pointer one presented here, but the
comparison should at least be made if at least only to explain why MT would be
successful in contrast (MPX has hardly taken the world by storm).

Also, since they appear to be positioning this as something that could be fast
enough to run in production (with hardware assist), they should at least take
a shot at analyzing the attack surface presented by the probabilistic
protection: a figure like 99.6% chance of catching a rogue access makes sense
for testing scenarios, but if the adversary is an attacker, can they
manipulate those odds to their advantage? The exact tag allocation mechanism
becomes pretty important in this case.

------
maxxxxx
A long time ago I found some memory problems in a C++ codebase we had
inherited and that crashes from time. In Windows you can protect memory blocks
with a function call so I overloaded new and malloc and added some protected
blocks before and after the actual memory. Then we ran the code for a lot and
it would fault right at the point of the memory overwrite instead of some
place later. It made the code very slow but we found problems that were
incredibly hard to find in any other way.

~~~
jjnoakes
These days the gcc and clang memory/address sanitizers can do this quite
nicely, and they tend to have low overhead as well (in my limited experience
so far).

~~~
xxpor
When running a test suite at work with ASan + UBsan turned on, the test goes
from taking ~30 seconds to ~20 minutes. So there's certainly some pathological
cases.

~~~
martincmartin
Try it with just ASan. Some of the UBsan checks can be quite expensive.

Also, check how much free ram your computer has when running with ASan. It may
be running out of RAM, and e.g. spending a lot of time looking for other
memory to free, or even swapping.

------
mhandley
Another great resource on hardware extensions for security, including
efficient tagged memory, is the Cheri work at Cambridge:

[http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/](http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/)

------
dis-sys
I saw quite a lot RISC-V discussions lately, I was first confused and asked
myself "why people are so interested in such stuff not coming out ranked top X
in performance". Now I get it - it is close to impossible to evaluate new
ideas/proposals like Memory Tagging on those CPUs with closed design and tight
patent "protections".

------
dahart
This somewhat reminds me of the Microsoft debugger convention of initializing
memory with various tags so that when you crash with a pointer bug, you can
tell what kind of memory was trod upon. They even made the hex codes semi-
mnemonic so it was easier to remember, e.g., 0xCD=clean, 0xDD=dead,
0xFD=fence, etc.

[https://stackoverflow.com/a/370362/1424242](https://stackoverflow.com/a/370362/1424242)

Interesting note that with this method in the paper, "Temporal and spatial
bugs are detected probabilistically" \- and it depends on how many tag bits
you choose, which trades addressable space for more tags (higher probability
of catching errors.)

> Memory safety in C and C++ remains largely unresolved.

I read the paper's description, and still don't quite understand exactly what
they mean by "memory safety". This is not referring to safer programming
practices like using the STL with iterators, correct?

~~~
masklinn
> This somewhat reminds me of the Microsoft debugger convention of
> initializing memory with various tags so that when you crash with a pointer
> bug, you can tell what kind of memory was trod upon. They even made the hex
> codes semi-mnemonic so it was easier to remember, e.g., 0xCD=clean,
> 0xDD=dead, 0xFD=fence, etc.

BSD's malloc.conf framework does something similar. Though they've managed to
be incompatible with one another.

On FreeBSD, activating junking (default off) will fill allocated memory with
0xa5 and deallocated memory with 0x5a.

On OpenBSD, allocated memory uses 0xdb, deallocated memory uses 0xdf, junking
is default-enabled for "small chunks" & the start of freed pages (which are
also checked for write-after-free on a small delay).

~~~
jabl
.. and on glibc (the usual libc on Linux), the MALLOC_PERTURB_ environment
variable can be used for similar stuff ([http://man7.org/linux/man-
pages/man3/mallopt.3.html](http://man7.org/linux/man-
pages/man3/mallopt.3.html) ).

------
nickpsecurity
Here's related work for those interested in the subject. Two old ones:

Burroughs Architecture
[https://www.smecc.org/The%20Architecture%20%20of%20the%20Bur...](https://www.smecc.org/The%20Architecture%20%20of%20the%20Burroughs%20B-5000.htm)

Capability-based Computer Systems (book)
[https://homes.cs.washington.edu/~levy/capabook/](https://homes.cs.washington.edu/~levy/capabook/)

Some new ones using a range of techniques:

SAFE Architecture [http://www.crash-safe.org/papers.html](http://www.crash-
safe.org/papers.html)

CHERI Architecture
[http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/](http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/)

Note: CHERI runs a modified version of FreeBSD on a FPGA board. Quite
practical if it can handle that!

Watchdog Architecture
[https://www.cs.rutgers.edu/~santosh.nagarakatte/softbound/](https://www.cs.rutgers.edu/~santosh.nagarakatte/softbound/)

Hardware-assisted Secure Allocator
[http://jin.ece.ufl.edu/papers/HASP17.pdf](http://jin.ece.ufl.edu/papers/HASP17.pdf)

------
flavio81
Memory tagging in hardware? Yay, they're implementing 1970s Lisp Machines
again!

~~~
rurban
Would be a godsend for GC, much better than malloc/free safety checks. There
are small boards with HW assisted GC support, 2bits per ptr, but no big ones
as in a lispm.

------
protomyth
Does anyone know how this relates to what the Burroughs B5500 and B6500
operated?

~~~
gumby
This is very similar to the Burroughs approach that later influenced Multics
and the lispm architectures (and is available in some RISC-V versions).
Notably Intel used this in the unfortunate iapx 432 and cleverly built some of
this approach in their segmented memory model).

However by the late 1970s it had been figured out that you could do some of
these tricks without dedicated hardware support (mask-boxing of integers, for
example on processors that ignored unaligned access you could use the least
significant bits for tagging; on ones with address spaces less than the word
length you could often do the same with the upper bits). These bits could be
used by GC and/or for fast type systems, and by the mid 1980s had pretty much
wiped out the advantage of dedicated hardware for Lisp implementations (the
lispms started out much faster at tag manipulation but the weight of work on
the general purpose machines caused the latter to get better (and cheaper)
much faster).

The bigger advantage of hardware support IMHO is the security opportunity for
capability management.

------
Sean1708
Am I correct in thinking that this is a similar idea to page faults but per
allocation rather than per process?

~~~
DSMan195276
After skimming the paper, I think it's actually closer to segmentation. They
just split the address space into X number of segments. The allocator then
ensures every allocation lies completely within one of these segments and
attaches that segment's ID to the returned pointer. Fundamentally, paging and
segmentation are fairly similar though, so you're not wrong, there is just no
concept of a 'page-table' here.

That said, I feel like the paper is missing a few details on the
implementation. They give the example of a segment size of 64 bytes, and a tag
size of 4 bits. Obviously, that doesn't even cover a full 4K page of data. So
I presume the segments must repeat, meaning after segment 15 there is another
segment 0. This seems to fit into their probability description (An OOB access
_might_ hit the next segment with the same ID, but it's unlikely).

The approach still seems a bit limited though. I would assume objects larger
then 64 bytes would have to be untagged, since normally you would access them
via offsetting from the same base pointer, and that base pointer would always
have the same tag. But also, a big problem is that most OOB accesses are
likely to hit the region of memory just past the allocated region. So with
larger segments you are likely to miss many of these OOB accesses since the
tag is still correct.

All that said, I don't think this idea is mentioned in the paper, but you
could also vary the size of every set of tags in the address space. So have,
say, 20 sets of tags for 64-byte regions, then after that 20 sets of tags for
1024-byte regions, and etc. So there is a mix of small and large regions.
There's no guarantee you would have the right amount of different-sized
regions, but you can always fall-back to using untagged if you need so it is
not a disaster if the numbers are a little off (Besides, of course, the fact
that untagged memory loses the security advantages of the tagged memory.

~~~
BeeOnRope
> After skimming the paper, I think it's actually closer to segmentation. They
> just split the address space into X number of segments. The allocator then
> ensures every allocation lies completely within one of these segments and
> attaches that segment's ID to the returned pointer.

It's sort of the other way around. Replacing "tags" with "colors" since
somehow it seems a bit more intuitive, the allocator itself just "colors"
different allocations with a different color, rather than splitting up the
address space beforehand. The returned pointers are also colored and a fault
occurs if any pointer is used to access memory of a different color.

> Obviously, that doesn't even cover a full 4K page of data. So I presume the
> segments must repeat.

Definitely - every color will be used many times, so they only
probabilistically catch any fault, based on mostly the total number of colors
(as long as the colors were randomized from run to run it seems like any fault
would be quickly caught with a high probability). They mention in in the paper
improvements such as ensuring that adjacent regions always have different
colors so that linear overruns are caught with 100% probability.

> I would assume objects larger then 64 bytes would have to be untagged, since
> normally you would access them via offsetting from the same base pointer,
> and that base pointer would always have the same tag.

Not at all. The 64-bytes is only the _granularity_ of the colors (actually
they suggest 16 bytes is a better option): but an arbitrarily large region can
have the same color.

Most of the rest of your post follows I think from that misunderstanding:
there is no need to change the color every 16 or 64 bytes (they call this
quantity TG) - you color the whole region with the same color. The TG value is
just the granularity at which it can change, so it's a tradeoff between
overhead for small allocations (favoring smaller TG) and total tag storage
cost (favoring larger TG).

~~~
DSMan195276
Reading it again, I guess I missed the "shadow memory" concept they talked
about, but they only mention it 5 times and none of those mentions are in the
actually explanation of how they do the memory tagging. They do mention it one
more time in the AArch64 implementation though, so it does look like you're
correct on this. I don't understand why they didn't really make this detail
much more clear though, since like you've pointed out it changes a lot about
how things work.

Now, if that is the case it complicates the setup a fair amount, since now you
do need a global 'page-table' like structure holding this information that the
allocator configures. And from reading the paper, don't they intended for the
kernel to do this checking as well on userspace pointers, indicating the
kernel also needs access to this information? It's doable, and better overall,
but more complicated.

~~~
BeeOnRope
Yes, the memory->tag mapping has to be stored somewhere, but that's true even
for a segmented system, right? You need some place to store the segment offset
and length: either some dedicated registers if you only allow a few of them,
or some in-memory tables for more general schemes, right?

Yeah they don't really get into how the shadow memory is managed. If you only
had limited or no hardware support, it seems like it would be tricky,
especially when you are arbitrarily mapping various chunks of the address
space. The ideal for performance is that there is some straightforward
relationship between a pointed-to region and the associated tag bits, e.g.,
just shift the pointer down by lg(TG) bits and then index into a single
contiguous region which stores the tag bits, pointed to by some global.

That avoids the need to have any kind of page-table like structure, but it
would waste a lot of virtual address space if the range of the pointers was
very large or very sparse.

If you had hardware support though, maybe this problem goes away: the hardware
could store the bits in "holes" right alongside the actual memory, as how ECC
works (they suggest SPARC is doing this), or they could include the
information in the TLB alongside the regular mappings (I guess this would need
kernel support too).

I didn't see much about kernel-side checks in the paper? Maybe I missed it.
The kernel already checks that user pointers are valid in the context of the
process that provided them, but of course that is a very coarse check (can the
user process as a whole read/write here?), and could still violate semantic
memory safety of the process depending on what was passed.

~~~
DSMan195276
> Yes, the memory->tag mapping has to be stored somewhere, but that's true
> even for a segmented system, right? You need some place to store the segment
> offset and length: either some dedicated registers if you only allow a few
> of them, or some in-memory tables for more general schemes, right?

Not if your segments/tags are static and defined beforehand, which is what I
thought that were going for (Since that doesn't require any extra memory). But
obviously that approach is pretty limited, which was my thinking.

> Yeah they don't really get into how the shadow memory is managed. If you
> only had limited or no hardware support, it seems like it would be tricky,
> especially when you are arbitrarily mapping various chunks of the address
> space. The ideal for performance is that there is some straightforward
> relationship between a pointed-to region and the associated tag bits, e.g.,
> just shift the pointer down by lg(TG) bits and then index into a single
> contiguous region which stores the tag bits, pointed to by some global.

This is exactly what I'm thinking as well. If the platform has paging support,
in theory this is cheaper then it seems because (besides using up tons of
virtual memory) the array can be sparsely allocated, with only the regions
actually in used taking up physical memory. But 64bytes is still really
coarse, it might be necessary to use a multi-level approach to make the size a
bit saner. But since it is just virtual memory, depending on the architecture
it may not matter.

Storing it in the page table mappings themselves is an interesting idea. Of
course, some architectures have more 'free' space in there then others. And at
that point you're talking about likely 4K granularity, not 64 bytes. Still
better then nothing of course, but it would definitely decrease the
effectiveness for smaller objects.

> I didn't see much about kernel-side checks in the paper? Maybe I missed it.
> The kernel already checks that user pointers are valid in the context of the
> process that provided them, but of course that is a very coarse check (can
> the user process as a whole read/write here?), and could still violate
> semantic memory safety of the process depending on what was passed.

Someone else asked about this as well, it seems I was led astray. The paper
has this line about ASAN:

    
    
        Does not currently detect when the kernel accesses invalid user-space memory
    

From reading that, I thought they were then going to use their memory tagging
system to do such checking (And in theory you could do it as long as the
kernel knew about the mappings). But they never actually bring this up again
in the paper.

~~~
BeeOnRope
> This is exactly what I'm thinking as well. If the platform has paging
> support, in theory this is cheaper then it seems because (besides using up
> tons of virtual memory) the array can be sparsely allocated, with only the
> regions actually in used taking up physical memory.

Yes exactly. The worse case still seems like a doubling of (actual) memory use
though: imagine a very sparsely accessed region of memory: even today each
access would bring in a whole 4K page (on x86, for example), but now you'd
bring in a whole 4K shadow page as well (yes, the access has to be really
sparse to get this 1:1 ratio).

> And at that point you're talking about likely 4K granularity, not 64 bytes.
> Still better then nothing of course, but it would definitely decrease the
> effectiveness for smaller objects.

I think a 4K granularity is next to useless - you already get that from the
hardware/OS paging support. If you want to detect problems on a 4K level, you
only need to change your allocator to mmap in a page at a time, which you can
even already do on some with tunable ones.

~~~
DSMan195276
> Yes exactly. The worse case still seems like a doubling of (actual) memory
> use though: imagine a very sparsely accessed region of memory: even today
> each access would bring in a whole 4K page (on x86, for example), but now
> you'd bring in a whole 4K shadow page as well (yes, the access has to be
> really sparse to get this 1:1 ratio).

In theory you should be partially saved by the allocator, just because it is
going to try and place as much stuff as it can into every page it allocates,
as well as allocate multiple pages at once (So they all end-up in a row). An
allocator that was aware of the 'shadow' pages should be able to allocate
chunks of pages at once that correspond to a single 'shadow' page, ensuring
you avoid creating tons of them. The higher alignment requirements mean more
wasted memory though, meaning you need to allocate more pages and need then
need more 'shadow' pages overall.

The paper suggests anywhere between 14% and 18% RAM overhead (discounting
Android apps, which had a range from 6% to 42%) for 64 byte granularity, and
that it actually goes down if you choose a smaller granularity like 16, due to
the lower alignment requirement.

And of course, you're also talking about doubling memory accesses as well,
assuming this is done in software. Every access has to first check the
'shadow' page. Unfortunately both of their implementations are at least
partially hardware assisted, so I'm not sure you can use those numbers to just
a pure CPU implementation, but they suggest around a 2x slowdown.

> I think a 4K granularity is next to useless - you already get that from the
> hardware/OS paging support. If you want to detect problems on a 4K level,
> you only need to change your allocator to mmap in a page at a time, which
> you can even already do on some with tunable ones.

I would agree. The tagging system can prevent a higher percentage of OOB
accesses at the page level in theory, but considering the size of the address-
space vs the number of allocations (assuming a 64-bit/48-bit address space) I
think the likelihood of an OOB access actually hitting another separate
allocation (assuming a mostly random placement) is already pretty low, so the
tagging system would make it lower (You now have to hit a random allocation
with the correct tag) but not significantly lower then before, and at the cost
of performance.

