There's a document in there about pointer masking: https://github.com/riscv/risc...

pjmlp · on March 12, 2022

Fixing C, hardware memory tagging is the ultimate mitigation strategy for pointer tricks.

Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...

snek_case · on March 13, 2022

I found this post on ARM MTE which was helpful in understanding the concept: https://www.anandtech.com/show/16759/sponsored-post-keep-you...

Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.

pjmlp · on March 13, 2022

Except all those fine people writing UNIX clones and embedded stuff will never do it, so here we are.

It was already known since the early days how bad C was versus the competition.

UNIX made it famous, UNIX won the server room wars, UNIX will keep it going.

olliej · on March 12, 2022

Memory tagging isn't a privilege level thing, it's an anti-compromise mechanism similar to PAC (in the sense the goal is to make it harder for an attacker to compromise code, they are functionally completely different).

The basic idea is you often want finer the page level granularity on memory access rights. An example ARM give in the documentation covering the ARM MTE is an allocator. With memory tagging you can make it so unallocated memory in the allocator is not accessible.

Essentially every piece of memory gets a tag, and you can only access a piece of memory through a pointer that has the matching tag. To illustrate imagine an allocator (which is the example ARM have in the documentation for the ARM MTE)

You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):

    |bbbbbbbbbb|

When you allocator allocates a byte it does the following:

1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)

So we get something like:

    |1bbbbbbbbb|

    p = (1,0) // pointer with a tag of 1 and the address 0

Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1

So imagine you have a bunch of allocations

    |13251bbbbb|

You can see we've re-used a tag, because there is a finite amount of space for tags in a pointer, so while our original allocation was a 1 byte allocation at 0, we can do p[4] and the access will work. However, if we're choosing the tag randomly and attacker is in theory unlikely to be able to luck out and get the correct tag so your process crashes (it's super important for these mechanisms that any failure results in a unstoppable crash, e.g. no signal handlers or anything). Another thing you allocator does is revert memory to being untagged (or I guess tagged distinctly) on free, so a use after free also cannot work.

In reality the tagging is not per byte because that would be insane: MTE has a significant increase in the physical ram requirements for a system. If you have an N-bit tag, that means you need to have N extra bits in the physical ram for every granule. I don't know what sort of granule sizes people are looking at but the overhead in physical ram requirements is literally (granule size in bits + bits for tag)/(granule size in bits) so you can see how significant this is.

Unlike PAC, my understanding is there is no cryptographic logic linking the tag to pointer, so pointer arithmetic continues to work without overhead whereas in a PAC model p += 1 say would be: temp = AUTH(p), temp = temp + 1, p = SIGN(temp).

The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:

    struct {
        void* vtable
        data fields
    }

For those unfamiliar, a vtable is essentially just a list of function pointers to support polymorphism. In this case the vtable pointer is tagged with the appropriate tag for wherever the vtable is. Because the vtable itself is stored in tagged memory it can't be modified by the attacker (in reality tables are all in read only memory, but pretend they're not for this example). But if the attacker can get some random, correctly tagged pointer what they can do is build their own vtable in that memory, and then simply overwrite the vtable pointer with their correctly tagged pointer for the malicious vtable. Of course you can just have the memory holding the object itself also be tagged, so they need the correct pointer tagging for that :D

In the PAC model the pointer is signed by a secret key (it's literally inaccessible to the process) and a nonce (on Mac + iOS this nonce includes the address of the vtable pointer itself). For an attacker to create a valid pointer they need to be able to generate the correct signature over the bits in the pointer and the nonce. Because different nonces are used for pointers in different uses, they can't just get (for example) one object to overwrite another. If the nonce includes the address of the pointer they can't even just copy a validly signed pointer from another location in memory.

I really do like the PAC model a lot, but to me the MTE mechanism seems to be a much stronger protection mechanism, albeit a very expensive one (PAC doesn't require additional ram for the signed pointers).

my123 · on March 12, 2022

Arm MTE uses a 4-bit tag for each 16 bytes region.

olliej · on March 12, 2022

Which would eat a little more than 3% of the physical memory in a device.

Does ARM allow any freedom in tag size, or is it strictly 4 bits?

I realize I may not have been clear for people unfamiliar with MTE* tagging is device level so you can't (for example) put the tags in a separate mapping and just increase your usage of existing memory by 3% (obviously a software implementation could do that, but the perf would probably be suboptimal :D ). You literally need X% more dram cells.

* Not saying @my123 doesn't understand, just I can't edit my original comment and I figure contextually this is reasonable :D

my123 · on March 12, 2022

Strictly 4 bits. For the Morello prototype architecture with full CHERI, it’s 1 bit for each 16 bytes region. (capability valid bit)

saagarjha · on March 12, 2022

Of course, CHERI faces very different challenges than MTE does ;)