That is good! Why criticise it? Proper use of hashCode() is for quickly comparing if strings might be equal before doing a full string compare, it’s meant for building hashtables.
The problem / error is the claim that the collision demonstration with short strings somehow proves that String is much worse than other hash functions, without testing other hash function, and without understanding that the theoretical ideal predicts a lot of collisions with short strings.
Suppose further that sha256 is pretty good and the probability of a collision is close to that of random chance.
Then there have been 1e33 sha256 hashes generated ever. Looking at the probability table on Wikipedia for the birthday problem  and the row for 256 bits, interpolating it gives a probability of 1e-10 that there has been a collision.
So there would be a chance of one in ten billion that some two hashes have collided over the last 30 years. Your piece of toast, perhaps, and my bitcoin digest.
I think for practical purposes we're pretty safe.
In this house, we obey the pigeonhole principle.
What he should have praised instead is the Java hash table, which is the only one of all major programming languages, which is fast and secure. All others are mostly slow, and insecure. Java chose a fast hash and a secure collision strategy, counting collisions and switching to a tree on attacks. Most others chose a random seed and a slow hash function, which are still DOS unsafe whilst being 10x slower. Pure security theatre.
Thank you for the feedback. You're right.
It's one thing to point out errors constructively, but another thing entirely to make fun. After all, English isn't everyone's first language, and I'd have a tough time writing an article like this in Spanish, for example!
I've updated the article to remove that comment, although I still (gently) point out the worst of the errors. I've also added a theoretical framework to tighten up the argument a bit.
Thanks again for the feedback. There's no purpose in being nasty when the point can be made another way.
I doubt that is significant, and you needn’t even lookup the confidence intervals. Think of it this way: if you ran one more experiment with similar data, if that perfectly good hash were to approximate that 1.44 collision on average between the two experiments, is has to have at least one experiment where it has zero or one collision.
Also, string hashing has a few requirements that I think are more important than having an optimal probability of collisions:
- it has to work well on typical 16-bit text strings, where most of the time, half the bytes are zero and most of the other bytes have only fivefold six bits that vary (that’s why there are so many collisions in two character strings: they are four bytes long, but, at best, differ in only about 10 bits)
- it has to be fast.
If you have a 32-bit hash and less than 32 bits of input, it’s reasonable to hope all hashes might be unique. 10x extra collisions on short strings does seems pretty bad. And it’s unfortunate that this hash function can’t be changed without breaking the language spec.
No it isn't. In fact, with 16 bits of input, the probability of a collision is not far from 50% according to the birthday problem:
I can trivially write a hash function that produces a unique 32-bit hash for inputs of 32-bit. Just truncate the input to 32 bits. It's not a very good hash function for longer inputs, of course.
Literally truncating the input wouldn’t be great in the likely case that your hash table is using only 10 bits (say) of the hash value, but I bet there are similar approaches that would work.