
You Can't Always Hash Pointers in C (2016) - ScottWRobinson
https://nullprogram.com/blog/2016/05/30/
======
scraft
Slightly related (but only slightly), on the PS2 (games console, from Sony)
you could access main memory with three different pointers, cacheable,
uncached accelerated and uncached (which was the same base address with
different high bits set accordingly). Each mode had reasons to be used (for
example, when you read/write to memory which is cacheable and then fire off a
DMA transfer, the data being transferred may, and often won't, match, what has
been set in the program, until a cache flush has been requested). As you can
imagine, this regularly resulted in hard to track down bugs.

Anyway, the bottom line of this is: I have seen various pieces of code on
platforms which when comparing a memory pointer you need to mask out several
bits (unless of course you actually want to differentiate between the
different access methods of the same piece of memory).

It isn't quite what this article is talking about, but I thought it was a
related piece of interesting trivia to those that don't already know ;)

~~~
karmakaze
Great example to add to the discussion. The post does mention that pointers
may have some meta bits which I fully accept but without concrete examples
wouldn't give much consideration for.

Reminds me of the recent underspec that was closed with signed/unsigned some
operation or other. Better for the spec but can't imagine practically
mattered.

If some unbelievable situation occurs on some unique platform/compiler
remember this post. In the meantime do as we have been with using pointer
values. Back before the System7 32bit clean code people put all sorts of stuff
in pointers because we could. We know the costs but also remember that it had
value too.

------
RcouF1uZ4gsC
One thing interesting to note is that C++ has a specialization for

template<class T> struct hash<T _>

which is required to have the correct semantics.

Given that the in practice the underlying machine models of C and C++ are very
similar, and that the most widely used C compilers (gcc, clang, msvc) are also
C++ compilers, in practice hashing C pointers is likely not an issue.

However, if you want to hash pointers in a way that is blessed by the
standard, you can probably do this

my_hash.cc

    
    
        #include <functional>
    
        extern "C"{
          size_t hash(char* c){
            std::hash<char*> hasher;
            return hasher(c);
         }
       }
    
    

my_hash.h

    
    
        size_t hash(char* c);
    

EDIT: Added code if you wanted to be pedantic about it.

Then you just link your C program to my_hash.o and include my_hash.h

~~~
tlb
C++11 only requires that "A Hash is a function object for which the output
depends only on the input and has a very low probability of yielding the same
output given different input values" and that "the probability of h(a)==h(b)
for a!=b should approach 1.0/std::numeric_limits<std::size_t>::max()."

This allows a pointer to be larger than size_t, which might be the case on
some system with tagged pointers.

~~~
saagarjha
Wait, so std::hash::operator() doesn't have to return a size_t, or are you
just saying that pointers can have values "greater" than size_t and need to be
"trimmed" down to size?

~~~
asveikau
There are lots of places where size_t can't fit a full address. That is kind
of the point of adding intptr_t (if it were covered by another typedef, why
add it?) By the way, intptr_t is relatively recent (C99).

~~~
ajross
Sorry, what system are you thinking about where size_t is something other than
pointer sized? This is silly pedantry

~~~
asveikau
It's not as unlikely as you think. Use your imagination.

I seem to recall some late 90s Unix where size_t was 32 bits and pointers were
64 bits. I think it was Tru64? Seems to be backed up here:
[http://www.cecalc.ula.ve/documentacion/tutoriales/COMPAQC/DO...](http://www.cecalc.ula.ve/documentacion/tutoriales/COMPAQC/DOCU0016.HTM)
_In Compaq C, size_t is unsigned int ._

On 16 bit x86 it certainly is most natural to have size_t be 16 bits but full
addressable memory is 1mb.

Again, if there were no ambiguity here there would be no need to introduce a
new typedef, but they did, in 1999.

~~~
ajross
I'll grant the Alpha point, which seems like a clear and unfortunate bug.

But the DOS one is wrong. "Addressible" memory with segmentation may be 20
bits, but the maximum size of an object that can be referenced through any
pointer is 65536 bytes. It was possible to compile in a "huge mode" where all
pointers were expressed as base/offset, but in fact there the size_t was a
compiler-generated 32 bit type.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=11805030](https://news.ycombinator.com/item?id=11805030)

------
delhanty
Generally, hashing pointers in C or C++ is a bad idea anyway.

Unless, you have a particular love for debugging non-reproducible behavior
that is.

~~~
delhanty
Clarification: I though this was well know, but apparently it isn't, so I'll
explain it.

`malloc` doesn't have to give you the same pointer the next time the program
is run - and sometimes it doesn't.

In that case, the ordering of the hash table on iteration can change.

Particularly, release build `malloc` and debug `malloc` are likely to differ,
leading to bugs in production that don't reproduce when debugging.

~~~
wutbrodo
> In that case, the ordering of the hash table on iteration can change.

Order-dependent iteration of a hash table's items isn't exactly a central use
case of the data structure and often isn't even guaranteed. I don't think that
this example is really sufficient to suggest that hashing pointers is a bad
idea (though it may be for other reasons).

~~~
TrinaryWorksToo
In golang it is purposefully randomized on every execution to prevent you from
relying on iteration order

~~~
delhanty
That's actually better than the situation in C.

In C, debug `malloc` may very well behave the same run after run.

Then you move into production, and release `malloc` behaves differently to
debug `malloc`.

Suddenly a latent bug that was there all along shows itself in release, but
fails to reproduce in debug.

With the randomized behavior you describe in golang, the bug probably will
show in debug and be caught.

------
Animats
With Microsoft going to "fat pointers" in their systems, some pointer
conversions are not going to work.

~~~
fulafel
I assume this is something else than the "far" pointers from the 16-bit days -
what is the new application of fat C pointers at MS?

~~~
cbHXBY1D
I think Animats is referring to checkedc [1] which I can assure you
Microsoft/MSVC is not moving towards.

[1] [https://www.microsoft.com/en-
us/research/project/checked-c/](https://www.microsoft.com/en-
us/research/project/checked-c/)

------
quotemstr
I don't like this genre of programming blog post. These things always amount
to premature optimization for portability. They point out that some technique
that's been used by millions of programs for decades might not work on every
single architecture that's technically compatible with the C standard. So
what? The techniques work, and because so many programs use these techniques,
they'll _keep_ working.

In practice, almost nobody targets esoteric environments. Avoiding useful and
well-known tools just for the sake of an environment that will never see your
program is just a needless self-imposed tax that's going to suck time from
other aspects of development. In my programs, I'll keep hashing uintprt_t
values.

~~~
spc476
While an architecture where two distinct pointers could point to the same
location is esoteric _today_ (minus `mmap()` tricks), go back 30 years and it
was not only esoteric, but downright _the most popular system in existence!_
[1]. Who's to say something just as esoteric becomes the mainstream? (although
I hope not---it was horrible).

[1] MS-DOS with its FAR pointers.

~~~
cnvogel
It's still non-esoteric today:

[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/Behcjiic.html)

ARM Cortex M4 supports adressing individual bits (often registers of hardware
peripherals) of memory locations as one additional memory location writing to,
or reading from a single bit.

e.g., from the examples of said infocenter url:

    
    
       *(uint32_t*)0x20000000 |= (1<<7)
       *(uint32_t*)0x20000000 &= ~(1<<7)
    

should be equivalent to

    
    
       *(uint32_t*)0x2200001C = 1;
       *(uint32_t*)0x2200001C = 0;
    

Also Analog Sharc DSPs (which, I think, still are being sold with this
architecture, and still used, even though I've used them only 10 years ago)
alias their memory four times, depending if you want to access it as 16bit,
32bit, 48 or 64bit data.

~~~
Taniwha
I think you're missing 'volatile's in your casts

~~~
foxhill
why would you need that in this case?

~~~
saagarjha
Probably so the compiler doesn't just optimize out the statements, seeing as
they don't seem to have a visible effect on the program execution because
they're never read again.

------
anfilt
Why would one need to hash pointers? Seems like a bad idea and data strutures
made of pointers like trees are pretty well behaved.

~~~
xamuel
Say you have an array of pointers and you want to take its distinct members.

For each pointer, you could look up its hash in a table. If it's not there,
you can be sure the pointer hasn't been seen before: append it to your result
and put its hash in the table. If the hash _is_ there, you learn nothing since
it could be a collision, and you'll need to do something else (like iterate
through your whole result so far). But if the original input has few distinct
members, this shouldn't happen often.

~~~
anfilt
Yea a list of pointers can be handy. However, if the code managing this list
is receiving duplicate pointers this sounds like an other problem. If it's
newly allocated memory it should probably be submitted to this list on
creation. If we have some sort sub-allocater that sounds like a problem with
sub-allotments. I think is probably more common ways to avoid the need for
hashing, and not needing to do a linear search of your list.

