
Pointers Are Complicated, or What’s in a Byte? - kibwen
https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
======
kibwen
Note that this is the same author who worked on formally verifying Rust's type
system via RustBelt
([https://www.ralfj.de/blog/2017/07/08/rustbelt.html](https://www.ralfj.de/blog/2017/07/08/rustbelt.html)),
which has managed to find a few bugs in the standard library
([https://www.ralfj.de/blog/2017/06/09/mutexguard-
sync.html](https://www.ralfj.de/blog/2017/06/09/mutexguard-sync.html) and
[https://www.ralfj.de/blog/2018/07/13/arc-
synchronization.htm...](https://www.ralfj.de/blog/2018/07/13/arc-
synchronization.html)), and also hopefully provides (by the nature of formal
proofs) some small comfort regarding the quantity of other latent bugs. :P

------
rwallace
> it is not clear what else to do – in our abstract machine, there is no
> single coherent “address space” that all allocations live in, that we could
> use to map every pointer to a distinct integer. Every allocation is just
> identified by an (unobservable) ID.

That's all very well if you assume malloc is a primitive. But in this model,
how can you _implement_ malloc? (And have it be definitely correct, rather
than just 'works for the moment because the compiler hasn't yet got around to
calling it undefined behavior and optimizing it out of existence'?)

~~~
ralfjung
This is an extremely good question, and I do not have a satisfying answer.
Note that `malloc` is special in the C standard as well, and AFAIK it is not
possible to implement `malloc` in standard C at all.

Heck, there are several models of C out there where you cannot implement
`memcpy`.

~~~
mbel
> Note that `malloc` is special in the C standard as well, and AFAIK it is not
> possible to implement `malloc` in standard C at all.

malloc() is usually implemented in C [0] and to be honest I'm not sure why do
you think it wouldn't be? Of course you need OS API calls to actually get some
memory (unless you want to operate on statically defined memory buffer) but
well it's still C.

[0]
[https://github.com/lattera/glibc/blob/master/malloc/malloc.c](https://github.com/lattera/glibc/blob/master/malloc/malloc.c)

~~~
TheCoelacanth
It's C, but it's not _standard_ C. It uses implementation defined behavior.
Every compiler has some implementation defined behavior because they would be
much less useful without it.

------
bumholio
The idea of redefining pointers from an {address} to an {offset,allocation_id}
tuple seems to me a very powerful solution to the problems discussed recently
in _" C's biggest mistake: conflating pointers with arrays"_ [1]

Instead of redefining arrays and introduce a new incompatible syntax, we are
redefining the compiler representation of what a pointer is and get run-time
guarantees of correctness without modifying the source code of existing C and
C++ programs. For example, on a 64 bit machine a pointer can become an 128 bit
long concatenation of two address:

    
    
      [ptr][alloc_id]
    

The [ptr] part is binary equivalent of existing C pointers (the memory address
of the pointed object), while the alloc_id is another 64 bit address of the
allocation structure from which the object is part of. Here is a "fat" pointer
that indexes the 4th element of an array, previously allocated by malloc(8 *
sizeof(object)) :

    
    
      [ ][ ][ ][ ][ ][ ][ ][ ][malloc_data]
               ^______     ___^
                     |    |
                   [ptr][alloc_id]   = "fat pointer"
    
    

Defining the allocation_ID as the address of the malloc data structure and
placing that at the end of the array earns us a very efficient way to make
sure pointer increment (a very frequent operation and source of memory bugs)
still points inside of the array, in a single assembly instruction: just CMP
the new [ptr] with the [alloc_id] (for other operations, like pointer
arithmetic with negative ints where [ptr] can go down, you would of course
need to deference [alloc_id] and obtain the lower limit of the array stored
somewhere in the malloc_data structure)

This solves a slew of problems:

\- pointer arithmetic can now be performed only when it makes sense (identical
alloc_id)

\- binary equality of pointers is a guarantee that they point to the same
object

\- the compiler can easily implement some or more memory protection checks,
based on the desired performance-safety trade-off

\- the cost in memory or stack space of doubling pointer size should not have
a significant impact on most real programs

\- no syntax changes should be required

Since this cannot be a new ideea, I would like someone more knowledgeable to
poke some holes in it.

[1]
[https://news.ycombinator.com/item?id=17585357](https://news.ycombinator.com/item?id=17585357)

~~~
pjmlp
In fact your idea is not new.

"Efficient Tagged Memory"

[http://www.cl.cam.ac.uk/research/security/ctsrd/pdfs/201711-...](http://www.cl.cam.ac.uk/research/security/ctsrd/pdfs/201711-iccd2017-efficient-
tags.pdf)

"SPARC M7 Application Data Integrity"

[https://swisdev.oracle.com/_files/What-Is-
ADI.html](https://swisdev.oracle.com/_files/What-Is-ADI.html)

The problem to most C improvement solutions is mostly human, not technical.

~~~
bumholio
As I have said, I'm sure it's not a new concept. You provided some links to
hardware accelerators for somewhat related systems, and they imply software
solutions are unworkable performance wise.

But it seems to me that is not the case for what I am proposing. For the most
part, working with pointers should generate almost the same assembly, the
[alloc_id] part is simply copied around verbatim and [ptr] is used as before.
A function that just dereferences pointer parameters will receive alloc_id in
the stack frame but it will ignore it and not load it in the registers. If the
pointer is duplicated, the alloc_id is copied on, somewhat increasing stack
pressure and reducing memory bandwidth, but certainly not 100x slowdown.
Modern processors are very good at parallelising these type of loads and
stores.

The performance impact should hit when doing pointer arithmetic and advanced
casting, depending on the degree of runtime assurance we want to offer. Every
address alteration is followed by a sanity check of the pointer against it's
allocation segment. An out-of-bounds pointer can be NULLed to trigger a
subsequent run-time exception, while keeping with the standard that such
pointers mean undefined behavior.

In the case of some frequent operations, like incrementing, these checks can
be extremely fast. Not that hardware solutions would not help, but if such a
technique works there is certainly a class of programs that would even accept
the slowdown of a software solution. So I have to wonder if there is more to
this I am not seeing, besides performance.

~~~
pjmlp
What you are missing it is the human factor.

At CppCon 2014, if I am not mistaken, too lazy to search the exact year. Herb
Sutter asked the audience how many use some kind of analysis tooling. About 1%
of the audience say they did.

Joe Duffy has a remark almost at the end of his keynote at Rustconf where he
states even with Midori running in front of the Windows team, they weren't
accepting it as possible.

At least on Solaris, regardless of its future, those protections are now on
for many executables.

[https://blogs.oracle.com/solaris/default-memory-allocator-
se...](https://blogs.oracle.com/solaris/default-memory-allocator-security-
protections-using-silicon-secured-memory-ssm-adi)

Likewise Google has been locking down what native code is allowed to do on
Android, including compiling everything with FORTIFY enabled.

It seems pressure must come from OS vendors for habits to change.

------
antirez
Here the author is constructing a theory based all on true statements, but the
mental model he builds, which is philosophically valid, is IMHO not the best
choice in order to really address and simplify those concepts, for the scope
of further reasoning upon that aspects of languages. My POV is that the
interpretation of what a pointer is, should be more bound to the low level
load/store instructions, and the address space of the process. In this
declination a pointer is just a number, the problem is that certain
programming statements in C/C++ that the programmer believes _will result_
into a load/store operation, will be disregarded by the compiler, in name of a
set of rules that violate what should be a normal behavior. So I think that a
better mental model is to think at a program as a set of load/store operations
(from the POV of memory), and how the compiler may decide to disregard such
operations depending on the context. I think that reasoning like that also
makes us pointing the finger in the right direction, which is the ANSI C
committee.

~~~
dbaupp
The model of "pointers are just a number" likely means that even "obvious"
optimizations like

    
    
      int i = 0;
      *p = 10;
      return i;
    
      // becomes
    
      *p = 10;
      return 0;
    

aren't valid, because p could "just be a number" pointing to i's stack slot.

~~~
dingo_bat
Can you explain this more? Surely the compiler can reason about p being able
to point to i's address.

~~~
dbaupp
p could be

    
    
      int *p = rand();
    

or that code could be in a function

    
    
      int f(int *p)
    

which may be called from completely different compilation units (or even
across a dynamic library boundary) meaning there's no way for the compiler to
know what p is.

------
mannykannot
This is an interesting discussion, but I get the impression that the author is
trying to reconcile two fundamentally irreconcilable ways of looking at
memory: on the one hand, the machine view of a single address space, and on
the other, the program view of distinct variables. I am not at all convinced
that there is a unifying viewpoint, at least if you want to cast pointers in
both models to and from integers.

One can, of course, map {base, offset} pointers to the natural numbers (though
not necessarily Rust integer types? I don't know Rust) by considering the base
to be the most significant bits of the number and the offset the least, but
arithmetic on these numbers is not the same as pointer arithmetic in the
address-space model.

I do not follow the part, in the "what's in a byte" section, about a byte-by-
byte memcpy. Is the question "what is the first byte?" any more or less
problematical for a {base, offset} pointer than it is for a multi-byte
integer, where place value matters?

~~~
tom_mellior
> Is the question "what is the first byte?" any more or less problematical for
> a {base, offset} pointer than it is for a multi-byte integer, where place
> value matters?

Shifting and bit-masking integers are operations that make sense. Storing an
int in memory and loading back only one of its bytes is equivalent to loading
it all, then shifting and masking. It's meaningful, thus you do need to model
the bits faithfully.

In contrast, shifting and masking pointers is (with some exceptions) not
meaningful. Neither is it meaningful to store one of these in memory and to
load back only part of it. It makes sense to use a more abstract, symbolic
model for bytes of such values.

For whatever it's worth, the author's solution for bytes is the same as
CompCert's:
[https://github.com/AbsInt/CompCert/blob/3939a1ccfdb86795e9fd...](https://github.com/AbsInt/CompCert/blob/3939a1ccfdb86795e9fdf5953489ddfee238152c/common/Memdata.v#L155)

~~~
mannykannot
That's a good point - if I am following correctly, a memcpy of a byte from an
integer can be regarded as an arithmetical operation (in fact, a
multiplicative one), while, in general, there is no equivalent interpretation
for copying a byte of a pointer.

I guess one could make a similar argument for the bytes of a unicode string.

While writing this, it occurred to me that the difficulties with casting
between pointers and integers is in going from integers to pointers, where
there is no unique mapping (for pointer to integer, base + offset is
meaningful, but it loses information.)

------
kulu2002
I am extremely convinced with all points in article apart from the model that
author proposes, which I am not sure I quite get it. I think the definition of
pointer becomes more and more abstract as you move from hardware level to more
high-level languages.

In languages like C/C++ (Not aware of Rust) which lie in middle (higher than
assembly where pointer is just a hw register and lower than say Java having no
pointers per se) the definition of pointer is not so clear.

The other source of confusion is syntax of pointer in C/C++ and fact that they
can be used interchangeably with arrays and arithmetically manipulated.

Pointers are indeed complicated and thats why standards like MISRA C /
autospice restricts use of pointers to great extent.

In highest level languages like MATLAB for model based development, the inputs
outputs are treated in form of Arrays at user level. When autocode is
generated from these models compilers internally transform array operations
into pointer operations, essentially making the array-based source code into
object code that's as efficient as the pointer version. The advantage is that
the pointer-oriented compiler-generated code is created in a controlled
fashion.

------
throwaway0255
I know meta comments are frowned upon here, but why is the “or:” in headlines
(borrowed from Dr. Strangelove I presume) so popular among programmers?

I see it done again at least once a week. Is it just because the movie is
popular among programmers, or is there something more to it?

~~~
chrismorgan
This form of subtitle spans back a _long_ way. I don’t know just how far, but
I do know that twelve of the fourteen major works by Gilbert & Sullivan had
such form, beginning with “ _Thespis_ ; or, _The Gods Grown Old_ ” in 1871.

Many of us have a fondness for comparatively old-fashioned approaches to
various things; double titles like that are one such example, using em dashes
despite space-separated en dashes being more popular is another (though that
one varies by locale), and writing “ _& c._” for “ _et cetera_ ” is another
(there’s lots of fun history around the ampersand ligature).

~~~
irishsultan
Also for example "Frankenstein; or, The Modern Prometheus" by Mary Shelley,
which was published in 1818.

But actually it goes back even further, at least to the 17th century,
[https://en.wikipedia.org/wiki/Twelfth_Night#/media/File:Twel...](https://en.wikipedia.org/wiki/Twelfth_Night#/media/File:Twelfth_Night_F1.jpg)

------
iforgotpassword
I know what the post is trying to say and all the examples are fine and
learning C(++) you should be aware of this and everything, but I think the
conclusion of saying that this shows pointers aren't integers is an odd
choice. They still are in memory. It's all about the interpretation of these
integers, and the shortcuts compilers can take because the standard allows
them to do so, as the author points out. If you turn off all optimizations or
go ahead and write a really simple compiler yourself, you probably end up
getting exactly the results you'd "expect" from those examples, at least on
any platform that is not totally exotic. Also uninitialized bytes are still
ordinary bytes, you just get a free rand() run. :)

Yeah maybe it's just me being weird.

~~~
GolDDranks
Unitialized bytes can NOT be considered "ordinary bytes" on x86. This is
because of memory paging and copy-on-write. If a memory is untouched, it is
allowed to map to a different physical memory on different accesses, so the
contents are not only "unknown", they may change under your feet! You must
touch the memory page for this not to happen.

~~~
gmueckl
This cannot possibly be correct. A read access to a region without a mapped
page in the page table will cause the CPU to trigger a page fault exception.
The OS must then decide whether to map memory there a one time operation or
abort the program. A second read from the same page must lead to the same
mapping then. There is no sane way for an operating system to switch out
program heap pages after they have been mapped.

~~~
ralfjung
This is looking at the wrong level of abstraction though. The _compiler_ will
already optimize your code in a way that multiple uses of the same
uninitialized value can produce different results.

Arguing about these assembly/CPU-level details (unfortunately?) is besides the
point when debating the semantics of languages like C, C++ or Rust.

~~~
gmueckl
Grandparent already narrowed discussion down to x86 behaviour and I responded
to details on that. This is definitely all outside the scope of language
specifications.

------
miket94
auto x = new int[8]; auto y = new int[8]; However, given how low-level a
language C++ is, we can actually break this assumption by setting i to y-x.
Since &x[i] is the same as x+i, this means we are actually writing 23 to
&y[0].

This assumes that the address space of y starts at the end of the address
space of x? In other words x+8==y? Could these two allocated blocks of memory
ever not be continuous?

~~~
vardump
> However, given how low-level a language C++ is, we can actually break this
> assumption by setting i to y-x. Since &x[i] is the same as x+i, this means
> we are actually writing 23 to &y[0].

> This assumes that the address space of y starts at the end of the address
> space of x?

No it does not. Any value of x or y should "work", because y-x computes a
_difference_ between pointers.

> Could these two allocated blocks of memory ever not be continuous?

Probability they're _not_ continuous is high and only gets higher as the heap
gets more fragmented over time.

