
What’s the difference between an integer and a pointer? - luu
https://blog.regehr.org/archives/1621
======
cesarb
The Linux kernel uses a macro called RELOC_HIDE to stop gcc from seeing
through the conversion from pointer to integer in these cases:
[https://elixir.bootlin.com/linux/v4.18/source/include/linux/...](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler-
gcc.h#L50)

Edit: a good explanation of that macro is at
[http://lists.linuxcoding.com/kernel/2006-q3/msg17979.html](http://lists.linuxcoding.com/kernel/2006-q3/msg17979.html)
:

"The reason for it is that gcc assumes that if you add something on to the
address of a symbol, the resulting address is still inside the bounds of the
symbol, and do optimizations based on that. The RELOC_HIDE macro is designed
to prevent gcc knowing that the resulting pointer is obtained by adding an
offset to the address of a symbol. As far as gcc knows, the resulting pointer
could point to anything."

~~~
chrisseaton
> gcc assumes that if you add something on to the address of a symbol, the
> resulting address is still inside the bounds of the symbol

How can one C object span two symbols in the first place for this assumption
to be invalid?

~~~
amluto
> How can one C object span two symbols in the first place for this assumption
> to be invalid?

Because Linux isn’t written in C. It’s written in C, assembler, and linker
script (.lds). In both assembly and linker script, one can lay out multiple
symbols with a defined relationship. Unfortunately, C doesn’t have a way to
say “give me a pointer to the object 10 bytes past the address p. Yes, I know
it’s valid and it is not the same object as p.”

~~~
User23
I believe for most, if not all, architectures the synchronization and memory
barrier primitives cannot be written in standard C.

~~~
jcranmer
That has not been true since C11.

------
captainmuon
I miss the good old times where a pointer was just a number, and if you did
`*addr` you plainly read from that memory address. I'm not sure if the
standard was ever that way, but compilers sure used to behave like that.

Maybe we can one day get a `--std=simple-world` flag that forgoes a few
optimizations and makes UB code do the "obvious" thing.

~~~
api
It never was guaranteed by any standard, and in fact numerous old machines had
weird pointer formats: segmented addressing, strange widths (36 bits), and
tagged pointers come to mind.

Its more like we had this window of time when pointers were not weird and got
too used to it.

~~~
captainmuon
Sure, but back then the standard was not relevant. You had your x86 computer,
your compiler (MSVC, DJGPP, CodeWarrior, Borland...) and maybe a "how to write
C/C++" book. You did have weird things like segmented addressing, but for most
purposes, pointers were just memory addresses, and you could use casts to
reinterpret data as you wished. In fact, using bit-cast-magic and doing
pointer arithmetics were the preferred way of doing things.

It's strange that we put so much effort into learning all this arcane
knowledge and now we're supposed to unlearn it...

~~~
wruza
I can’t tell for sure, since I was too young back then and didn’t write
something complex enough, but I suspect that compilers were not evil genies
back then. It was probably later gcc who first used UB as an excuse to evil
optimizations.

I also don’t understand a reason behind dragging all that legacy into current
standards. Does a real application of it make at least 0.1% of all usage? You
cannot even buy a chip that implements segments, tagged pointers, etc etc. at
least they could make a special mode where you can span across symbol whatever
it means or see a pointer as a cpu sees it.

All this can be solved with simple ptr_untag(p), ptr_span(expr) and similar
constructs, but instead we have resort to outsmarting the specific compiler
logic or introducing ourself with complicated type systems. That went insane.
Personally I just want my bytes in linear address space and a way to tell
which bytes point to which and which do not. I liked asm and then C, they were
two of my first three languages, but what C became today is just a horrible
mess.

~~~
jcranmer
> You cannot even buy a chip that implements segments

Not only can you buy a chip that implements segments, the computer you wrote
that statement on probably has such a chip.

~~~
wruza
It is arm-based, so I believe it doesn’t. If you’re about x86s, then segment
registers are there, but were not actually used (except for fs-gs utility
cases) for almost a couple of decades, afaik. Segmented memory model is simply
slow, cumbersome and unnecessary in the presence of decent pmmu.

Edit: though strictly speaking I was obviously wrong on that, clarifications
are welcome.

~~~
slededit
While it may be “cumbersome” from the programmers perspective - it’s certainly
a lot faster than an MMU.

There’s no universe where traversing a page table (actually a tree) in memory
is faster than an offset and a bounds check.

~~~
wruza
What are you talking about? On x86 page tables were cached in TLB since their
introduction. No mmu at all (80286) means that you're subject to
fragmentation, and swapping segments is as expensive as a syscall since you
have to lookup the descriptor through GDTR. x86 segments are scarce, expensive
and cumbersome resources invented to cover bus width mismatch. Today it is not
ever an issue. Even TSS is shared one per cpu, so much useless and slow its
hardware part is.

We could imagine more segment registers and bigger descriptor tables, but that
would be just poor man's manual TLB (manual always failed in cpus). If you
concerned with constant checks, you may order yourself a cpu with BOUND
instruction support and put it everywhere with the same result. Oh, it is
already in x86-64, nevermind.

>actually a tree

It is 2 level "tree", afair. Directory and table. Please make your homework
and research why segmented model was kicked off software arena by virtually
everyone involved.

~~~
slededit
Spoken like a programmer that has never had to fix TLB misses as a bottleneck.
It’s actually quite common as the TLB is rather small. Not even enough for 2MB
worth of 4K pages.

You can use huge pages but it has all of the drawbacks of segments and none of
the benefits.

VMWare used to have a super fast 32-bit hypervisor based on segments long
before special instructions were added. This of course had to be reworked
completely for X64.

Also Intel’s bound check instructions are still extremely rare and don’t work
that well in practice. I’ve used them.

~~~
wruza
A recipe for fixing TLB misses (as any cache misses) is simple: don’t thrash
your cache. Ofc I didn’t, I cannot even imagine what does one do to bottleneck
at TLB – LSD? It is one of these problems like “doc, if I turn my finger 180,
it hurts”.

>VMWare 32-bit before special instructions

DOS also was pretty fast, but that didn’t make it a good multi-user protected-
mode OS. All these early emulations and monkey patching of guests cannot
substitute hardware vt in the wild.

------
Upvoter33
Regehr and colleagues have been doing excellent work on undefined behavior for
some time, which has been fantastic.

I do personally wish the C standard was changed a bit to have less undefined
behavior, leaning towards getting programmer intent correct and not worrying
(as much) about getting every possible optimization into the generated code.

~~~
saagarjha
> I do personally wish the C standard was changed a bit to have less undefined
> behavior

Agreed. There are some particularly egregious ones where the behavior should
at the very least be implementation defined.

------
simias
I sort of see the point of the article but I think talking about differences
between pointers and integers obscures the real issue here: the problem as far
as I can tell is just that offsetting a pointer more than one byte after the
end of the object it points to is UB.

TFA argues that converting to integers first sidesteps the issue but I'm not
convinced (although I'm not a C standard lawyer). Surely if you convert a
pointer to an integer, add an offset that's greater than the size of the
original object and then convert that to a pointer you trigger UB? How else
could this be defined? You can't really expect the compiler to figure out what
you're doing when you're casting "random" integers as pointers.

Also the article mentions Rust in its introduction but in Rust anything
surrounding pointers (including pointer arithmetics) is unsafe so I don't
think there's any conflict whatsoever here.

~~~
cjcole
Related discussion at internals.rust-lang.org:

[https://internals.rust-lang.org/t/pointers-are-
complicated-o...](https://internals.rust-lang.org/t/pointers-are-complicated-
or-whats-in-a-byte/8045)

"This summer, I will again work (amongst other things) on a “memory model” for
Rust/MIR. However, before I can talk about the ideas I have for this year, I
have to finally take the time and dispel the myth that “pointers are simple:
they are just integers”. Both parts of this statement are false, at least in
languages with unsafe features like Rust or C: Pointers are neither simple nor
(just) integers.

I also want to define a piece of the memory model that has to be fixed before
we can even talk about some of the more complex parts: Just what is the data
that is stored in memory? It is organized in bytes, the minimal addressable
unit and the smallest piece that can be accessed (at least on most platforms),
but what are the possible values of a byte? Again, it turns out “it’s just an
8-bit integer” does not actually work as the answer."

Ralf Jung is a coauthor of the "What’s the difference between an integer and a
pointer?" paper.

~~~
simias
Okay I can agree with that but then there's no need to look for contrived
examples, the C standard itself spells out a bunch of UB that arise from using
pointers as "just integers". It's probably worth repeating, especially for
people new to these languages, but it's not exactly breaking news. I guess I
just expected something else when I started reading this article.

------
dahart
> First, it is a big mistake to try to understand pointers in a programming
> language as if they follow the same rules as pointer-sized integer values.

I’m a bit confused by this statement. I don’t know the C & C++ standards, but
isn’t pointer arithmetic defined and allowed within the bounds of the heap
allocation of the base pointer? Like, on an array for example, aren’t pointers
required to “follow the same rules as pointer-sized integer values”, as long
as you’re not indexing outside your array allocation?

Wouldn’t the issue here be more clearly stated as “don’t use pointer
arithmetic to index outside the bounds of the base pointer’s allocation”
rather than “never use pointer artithmetic”?

~~~
camgunz
Yeah it's fine inside an array, but as soon as you move a pointer from one
allocation to another, boom.

Dunno what it means for stack pointers though, probably the same except now
you've made it so they can't be registers.

------
csours
A pointer is an integer with a shiv

------
aknoob
Conceptually, Pointer is a higher and a more well defined, level of
abstraction than a raw integer.In much the same way as an Iterator is a higher
and more well defined level of abstraction than a raw pointer.

The fact that an Iterator, a pointer and an integer all tend to be different
ways of looking at same integer value is more or less an implementation
detail.

------
utopcell
This pointer manipulation is fine. GCC simply optimizes the parameters for
printf(): [https://godbolt.org/z/AHY7Nd](https://godbolt.org/z/AHY7Nd)

*x will be 7 in the end.

------
algesten
I can't replicate this example on my mac. Regardless of `diff` or 16 (which is
the diff for me), I get the same output for clang and gcc: 7 5 7

~~~
micro-ram
Agreed. I get the intended result of 7 5 7 on macOS 10.13.6 LLVM.

    
    
      Mac:tmp $ clang -O3 mem1d.c ; ./a.out
      diff = -96
      x = 0x7f9f4fc00300, *x = 7
      y = 0x7f9f4fc00360, *y = 5
      p = 0x7f9f4fc00300, *p = 7
    
      Mac:tmp $ diff mem1d.c mem2d.c 
      13c13
      <   int *p = (int *)(yi + diff);
      ---
      >   int *p = (int *)(yi - 96);
    
      Mac:tmp $ clang -O3 mem2d.c ; ./a.out
      diff = -96
      x = 0x7fc2b5c00300, *x = 7
      y = 0x7fc2b5c00360, *y = 5
      p = 0x7fc2b5c00300, *p = 7
    
      Mac:tmp$ clang --version
      Apple LLVM version 10.0.0 (clang-1000.11.45.2)
      Target: x86_64-apple-darwin17.7.0
      Thread model: posix

~~~
daveFNbuck
You only pasted a clang run here. It's gcc that's supposed to give 3 5 7.

------
tmalsburg2
One question that is not answered by the post: if the gcc-compiled program
does not overwrite *x, what does it overwrite instead?

------
paulddraper
> For example, they assume that a pointer derived from one heap allocation
> cannot alias a pointer derived from a different heap allocation.

Doesn't this depends on which allocations have been freed?

------
bluto
One is a number and the other is a dog often used when hunting

