
You Can't Always Hash Pointers in C - ingve
http://nullprogram.com/blog/2016/05/30/
======
kazinator
> _In other words, even in a conforming implementation, the same pointer might
> cast to two different integer values._

If you're porting to that type of system, you will be keenly aware of this.
The problem can be solved.

On the Intel 8086, every address has 4096 ways of referring to it by some
combination of segment:offset. A segment is 64 kilobytes wide, and the
segment: part of the address is "paragraph" (16 byte block) aligned. 65536/16
= 4096. For instance address 0x1234 can be referenced as 0123:0004, or
0122:0014 and so on.

However, there _is_ a unique underlying address.

A pointer value can be normalized (using nonportable code, of course) to use,
say, the largest possible segment value and the smallest possible offset. The
resulting value can then be hashed.

Another way to deal with it on 8086 might be to use a "far" pointer, and
calculate its offset from a "far" null pointer:

    
    
       (far char *) ptr - (far char *) 0;
    

This should produce the physical address as an integer.

So that is to say, the concept of hashing an address is sound; only using
"pointer" to mean "address" is not maximally portable. An implementation of
pointer hashing can be ported to such architectures by some nonportable code,
if necessary.

If you care about such portability, wrap the "pointer to integer address"
logic in a function and implement that function as necessary.

------
gpderetta
Some motorola 68000 machines did automatically mask out some of the most
significant bits of any pointers. This means that two pointers to the same
object can have different representation. That did happen in practice as
programmers used the highest bits to stuff metadata and relied on the
automatic masking. That made things interesting when the platform wanted to
increase the address space. And preventing this sort of backward compatiblity
issue is the reason why AMD64 will trap if the unused bits of the address
space are not all 1 or 0.

Bottom line, the pointer to integer conversion should be specified by your
platform ABI. Read it, then either refuse to support your program on platforms
with insane ABIs or have a fallback path for them.

~~~
vidarh
Wasn't this the case with 68000 based designs in general? The CPU only had 24
address lines. I think it was first the 68020 that had a full 32 bit address
bus

And you're right, people absolutely did use it.

On the Amiga I believe that Amiga BASIC (developed by Microsoft) was perhaps
the most prominent example of Amiga applications that allegedly did this (only
source I've found for this is a comment by a former Commodore employee - Dr
Peter Kittel), amongst a massive amount of other problems. People did get it
running on 68020+ systems, with a combination of patches and disabling fast
ram, but it was pretty much forgotten from AmigaOS 2.x onwards anyway.

~~~
kazinator
If your program targets a 68000, and uses tags in the upper bits of pointer,
_and_ that program also hashes pointers, then of course you're going to
reconcile these two facts: your hashing function will have to strip out the
pointer tags.

Or perhaps not; it depends on whether the tags are immutable attributes. For
instance if they indicate type, and type is immutable, then they can perhaps
just be mixed into the hash. Under this design, the only way two pointers to
the same thing can have a different hash is if you have a use-after-free bug:
you're hashing a pointer to something that was de-allocated and re-allocated
as a new kind of object. Two objects which occupy the same address at
different times don't have to have the same hash, even if it is a pointer
hash.

Looked at in another way, those tag bits, if they are immutable, effectively
extend the address space. An object with tag 0101 and another one with 0110,
all other bits equal, can be considered to be in different spaces (with the
constraint that since they collide to the same physical address, they cannot
exist simultaneously).

~~~
vidarh
I'm not quite sure what you're replying to. I was not talking about hashing
pointers at all. Bare, raw, pointers on the M68000 are 32 bits, but the CPU
only has 24 address lines, and the top 8 bits are disregarded. But starting
with the 68020, the CPU has 32 address lines and the top 8 bits are not
disregarded.

The point is/was that relying on bits that are unused today is risky because
it has historically tended to ensure your software breaks when the next CPU
generation needs more of the possible address space.

Note specially that a lot of the apps that used the top 8 bits did _not_ use
it for type tags that might imply safe usage, but often used it to store
unrelated data.

E.g. lets say you had a data structure with a number of pointers and a number
of flags. You might very well decide to pack the structure so that the flags
overlapped the top 8 bits of the pointers.

------
gpderetta
This remind me of a story.

When Stephanov was implementing the STL, he needed a strict weak ordering
among all pointers to implement things like std::map.

Instead of wasting time arguing within the committee on wether operator<
should have the required semantics, he simply added std::less as a primitive
and default comparison for std::map. It is specified to call operator< for
every T, except that for pointer it relies on unspecified compiler support for
doing the right thing.

On pretty much all implementations it simply calls operator< even for
pointers.

~~~
tehrei
Oh, so then on C++ you can check if pointer-to-T p is within T a[sz]; by doing
(a == p || std::less(a, p)) && (std::less(p, a + sz))

~~~
gpderetta
Possibly not. The standard guarantees that less yields a total order for
pointers. But the ordering itself is unspecified, and not even guaranteed to
be consistent with operator< for pointers to elements of the same array.

------
pjc50
Another platform that doesn't have straightforward pointers: x86. 16-bit mode.
Because of the segmentation system, if you wanted a fully general pointer to
anywhere in the address space ("far") it had to be split into two registers.
Greater speed could be achieved with "near" pointers, but only within a 64k
segment. You could also very easily have non-identical pointers that pointed
to the same memory location.

Fragments of this mess survive in the Windows API, like Shelley's "vast and
trunkless legs of stone":
[https://blogs.msdn.microsoft.com/oldnewthing/20031125-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20031125-00/?p=41713)

~~~
userbinator
I think the 8051 is more interesting - it's a Harvard architecture with at
least 3 separate address spaces:

[http://www.keil.com/support/man/docs/c51/c51_le_ptrs.htm](http://www.keil.com/support/man/docs/c51/c51_le_ptrs.htm)

[http://www.keil.com/support/man/docs/c51/c51_le_ptrconversio...](http://www.keil.com/support/man/docs/c51/c51_le_ptrconversions.htm)

Just like x86, it's likely that everyone interacts daily with a system
containing at least 1 8051 core.

~~~
svens_
As someone who did a lot of embedded development, I wouldn't exactly call it
"more interesting". More like "that shit needs to die already".

The biggest problem with those is, that modern compilers don't support it.
Well, that is open-source compilers. IAR will happily sell you a license of
their IDE for 2.4k$/seat. There is sdcc, but it's no comparison to a modern
gcc or the commercial offerings.

Unfortunately you're right though, 8051 processors are cheap and abundant.
Chip makers use it as the go to architecture for any simple embedded
processing requirements. It's slowly changing though, but ARM licensing costs
are still much higher. Let's hope that in a decade or so they'll use RISC-V
instead.

~~~
mjevans
RISC-V is indeed what I thought, but hasn't MIPS also been free (enough) and
open for at least a decade? You'd think that on licencing alone that'd be
enough for a replacement in new products.

Edit: I am mistaken, it seems that though MIPS is used in all sorts of things
that I'd expect wanted the cheapest possible CPU core it has only become free
for academic use. I wonder how common that mis-conception is for those not
active in the space?

~~~
svens_
Yes, MIPS is a cheap candidate for when you need a bit more computing power
than just a microcontroller.

Home routers usually have MIPS CPUs and run Linux on top of it. This wouldn't
be possible with the 8051.

------
jks
If you know that the strings are stored in the same table, couldn't you
subtract from each pointer a pointer to the beginning of the table and hash
the resulting offsets?

~~~
svens_
That's probably true, however there's a small catch.

Looking at the standard (no difference between C99/C11), chapter 6.5.6
("additive operators"), point 9 defines the behavior of pointer subtraction:

    
    
      When two pointers are subtracted, both shall point to elements of the same array object,
      or one past the last element of the array object; the result is the difference of the
      subscripts of the two array elements.
    

However reading further:

    
    
      The size of the result is implementation-defined,
      and its type (a signed integer type) is ptrdiff_t defined in the <stddef.h> header.
      If the result is not representable in an object of that type, the behavior is undefined.
    

Unlike the (u)intptr_t types, ptrdiff_t is not optional AFAIK. The way I
understand it, you could still invoke UB if the implementation uses a very
silly type for ptrdiff_t, though I didn't do further research on it.

Still, using pointer difference makes this a lot safer and more portable.
Especially on platforms with unusual memory models, where a pointer might not
be just a single integer (e.g. segmented memory on x86). Also the operation is
Theta(1), so it doesn't impact performance much.

------
cesarb
The post mentions only the C standard; I wonder what POSIX says about it.

And there are also the platform ABIs, which specify how a pointer is passed
between functions. If you're in a platform in which the ABI doesn't allow
arbitrarily setting unused bits, even a "security-conscious implementation"
will have to leave them alone, since a pointer can be passed to a function
written in a different language. This also means that the "map the same object
at multiple virtual addresses" trick, in which the implementation flips bits
within the pointer knowing there is always a valid alias mapping to it, is
broken: suppose two pointers to within the same object are passed (perhaps at
separate times) to code written in another language. From the point of view of
the other language, they might not be pointing to the same object, even if
they should (for instance an array and an element within it, the function
might want to compute the offset).

I also wonder how well does the "map the same object at multiple virtual
addresses" trick work in architectures with VIVT caches, where both the index
and the tag are virtual addresses. In these architectures, if you write
through one mapping and want to access the same memory through another
mapping, you _have_ to flush the cache, otherwise you will have hard-to-debug
problems if both mappings are in the cache at the same time.

~~~
gpderetta
"I also wonder how well does the "map the same object at multiple virtual
addresses" trick work in architectures with VIVT caches"

Not well at all, unless the OS goes out of the way to do cache coloring. There
is a reason that VIVT are today considered suboptimal.

------
js8
Another platform: IBM zSeries has 24-bit and 31-bit addressing modes, so that
means an address stored in a 32-bit register has some of the highest bits
ignored when dereferenced. Especially in the past, people stored useful
information into those bits, so again, the two pointers are not necessarily
equal even though they can point to the same address.

~~~
apaprocki
Just like the 48-bit addressing mode of the x86-64 chips most people reading
this are using along with the storage of useful information in bits 48-64 (JS
engines!).

~~~
qb45
If they store something there they have to mask it out before dereferencing
the pointer. x86-64 doesn't ignore the topmost bits, if you set them wrong the
CPU will throw an exception.

------
TazeTSchnitzel
A big problem for the Mill people has been getting LLVM to understand that
pointers on the Mill do not behave like normal integers.

~~~
Kristine1975
Video about using clang/LLVM for the Mill here:
[https://www.youtube.com/watch?v=QyzlEYspqTI](https://www.youtube.com/watch?v=QyzlEYspqTI)

------
tehrei
Interesting. What if I use memcpy and memcmp? As in,

    
    
        void foo(void *a, void *b) {
          char ca[sizeof(void*)];
          char cb[sizeof(void*)];
          memmove(ca, &a, sizeof(void*));
          memmove(cb, &b, sizeof(void*));
          assert((a == b) == (memcmp(ca, cb, sizeof(void*)) == 0));
        }

~~~
mikeash
That doesn't help. Your assert can be false. Consider, for example, a 32-bit
architecture which ignores the top 8 bits of the pointer (as was the case in
the old 68000 Macs, as discussed elsewhere in these comments). Consider two
pointers which differ only in a top bit, like 0x1000f000 and 0x0000f000. They
will compare equal, but the underlying bytes extracted with your memmove call
will differ.

And this isn't just a historical curiosity. ARM64 can optionally ignore the
top eight bits of its 64-bit pointers, for example.

------
ausjke
Very insightful, never fully understand why pointers can't be cast to
integers(not even to uintptr_r, intptr), what's the guidance here to avoid the
casting?

~~~
ArkyBeagle
You'll need custom logic with in-depth knowledge of the mapping. _In many
cases_ the idiom mentions by "kazinator" above will often work - but not
always:

#define ptrToInt(x) ( ((char _)x) - ( (char_ )0 ) );

... and even that may require local tweaking.

------
forrestthewoods
You also can't test if a pointer points to an element in an array without
undefined behavior.

C++14 5.6/6, on subtracting pointers: "Unless both pointers point to elements
of the same array object or one past the last element of the array object, the
behavior is undefined."

[http://stackoverflow.com/questions/31774683/is-pointer-
compa...](http://stackoverflow.com/questions/31774683/is-pointer-comparison-
undefined-or-unspecified-behavior-in-c)

I wish "undefined behavior" wasn't a thing in C/C++. It's not worth it. :(

~~~
_yosefk
I _think_ you can do that test by comparing pointers, you just can't subtract
them. (The SO answer you link to also points out that comparing and
subtracting are different, but I'm not sure that what I said follows.)

In practice, almost all C and C++ programs depend on both implementation-
defined and undefined behavior, and, modulo security implications, that's fine
as long as you test the program properly after compiling with a new compiler.

~~~
peteri
I'd guess that PAE on x86 systems might really hurt with this. Basically any
system where the memory model is one of segment + offset will run into trouble
with pointer comparisons. Ever since the 8086 I've assumed that pointer math
is tricky and non portable (see
[http://catb.org/jargon/html/V/vaxocentrism.html](http://catb.org/jargon/html/V/vaxocentrism.html)
for the 80s version of the same problem)

~~~
cesarb
PAE means you have longer physical addresses, but pointers are always virtual
addresses. That is, the difference is only visible in the layout of the page
tables.

Segment+offset is something else, but even on 32-bit x86 the segments were
almost always set to base 0, and AMD64 dropped most of the segment mechanism.
The exception on both is using segments to access thread-local data, but even
then it's used just as an offset into the same flat address space.

