
String.hashCode() is plenty unique - jxub
http://sigpwned.com/2018/08/10/string-hashcode-is-plenty-unique/
======
theclaw
There is no error in the original article other than how it’s phrased. I think
it intends to warn people not to trust String.hashCode() to be unique, with a
lot of examples.

That is good! Why criticise it? Proper use of hashCode() is for quickly
comparing if strings might be equal before doing a full string compare, it’s
meant for building hashtables.

~~~
cm2187
Actually is any hashing algorithm guaranteeing uniqueness? Of course it is
theoretically impossible to guarantee that (pigeonhole principle) but even
practically, can a collision occur on sha256 in real use? I understand the
hash is designed so that it is theoretically difficult to manufacture a
collision but surely a collision could occur in real life?

~~~
zokier
Perfect hash functions guarantee no collisions. Of course they have other
limitations, so you have to pick some tradeoffs.

~~~
speakeron
However perfect it is, it can only guarantee no collisions if the size of the
strings you're hashing is equal or lower than the hash size.

 _In this house, we obey the pigeonhole principle._

~~~
jcranmer
It can do better, if the input is not "every possible string." A perfect hash
of the strings representing 32-bit unsigned integers can compress strings up
to 10 bytes long into 4 bytes of data.

------
Waterluvian
Just my $0.02: there is nothing to be gained by nagging about someone's
punctuation and grammar. That doesn't win you an argument about programming.

~~~
sigpwned
This is the author. I didn't submit the article here, but I did happen to run
across this posting from my GA.

Thank you for the feedback. You're right.

It's one thing to point out errors constructively, but another thing entirely
to make fun. After all, English isn't everyone's first language, and I'd have
a tough time writing an article like this in Spanish, for example!

I've updated the article to remove that comment, although I still (gently)
point out the worst of the errors. I've also added a theoretical framework to
tighten up the argument a bit.

Thanks again for the feedback. There's no purpose in being nasty when the
point can be made another way.

------
Someone
_”resulted in 1 collision. A “fair” hash function would generate an expected
1.44 collisions over this data. String.hashCode() outperforms a fair hash
function significantly”_

I doubt that is significant, and you needn’t even lookup the confidence
intervals. Think of it this way: if you ran one more experiment with similar
data, if that perfectly good hash were to approximate that 1.44 collision on
average between the two experiments, is has to have at least one experiment
where it has zero or one collision.

Also, string hashing has a few requirements that I think are more important
than having an optimal probability of collisions:

\- it has to work well on typical 16-bit text strings, where most of the time,
half the bytes are zero and most of the other bytes have only fivefold six
bits that vary (that’s why there are so many collisions in two character
strings: they are four bytes long, but, at best, differ in only about 10 bits)

\- it has to be fast.

------
iainmerrick
The original criticism seems perfectly valid (although maybe people are
reading too much into it).

If you have a 32-bit hash and less than 32 bits of input, it’s reasonable to
hope all hashes might be unique. 10x extra collisions on short strings does
seems pretty bad. And it’s unfortunate that this hash function can’t be
changed without breaking the language spec.

~~~
FartyMcFarter
> If you have a 32-bit hash and less than 32 bits of input, it’s reasonable to
> hope all hashes might be unique.

No it isn't. In fact, with 16 bits of input, the probability of a collision is
not far from 50% according to the birthday problem:

[https://en.m.wikipedia.org/wiki/Birthday_attack](https://en.m.wikipedia.org/wiki/Birthday_attack)

~~~
TheCoelacanth
That's assuming random distribution of hashes.

I can trivially write a hash function that produces a unique 32-bit hash for
inputs of 32-bit. Just truncate the input to 32 bits. It's not a very good
hash function for longer inputs, of course.

~~~
FartyMcFarter
True - but I still don't think it's reasonable to assume that a hash function
behaves that way.

