Hacker News new | comments | show | ask | jobs | submit login
String.hashCode() is plenty unique (sigpwned.com)
52 points by jxub 3 months ago | hide | past | web | favorite | 22 comments

There is no error in the original article other than how it’s phrased. I think it intends to warn people not to trust String.hashCode() to be unique, with a lot of examples.

That is good! Why criticise it? Proper use of hashCode() is for quickly comparing if strings might be equal before doing a full string compare, it’s meant for building hashtables.

Your interpretation is generous since the original article says explicitly: “some hashes are worse than others. You can expect String to be fairly poor.”

The problem / error is the claim that the collision demonstration with short strings somehow proves that String is much worse than other hash functions, without testing other hash function, and without understanding that the theoretical ideal predicts a lot of collisions with short strings.

Actually is any hashing algorithm guaranteeing uniqueness? Of course it is theoretically impossible to guarantee that (pigeonhole principle) but even practically, can a collision occur on sha256 in real use? I understand the hash is designed so that it is theoretically difficult to manufacture a collision but surely a collision could occur in real life?

Suppose there are 100 billion computers in the world, including phones, smart watches, toasters, etc. Suppose each of those can perform at the level of a state of the art specialized mining ASIC, 10 million million hashes per second, and they've all been generating 256-bit hashes constantly for 30 years, or a billion seconds.

Suppose further that sha256 is pretty good and the probability of a collision is close to that of random chance.

Then there have been 1e33 sha256 hashes generated ever. Looking at the probability table on Wikipedia for the birthday problem [0] and the row for 256 bits, interpolating it gives a probability of 1e-10 that there has been a collision.

So there would be a chance of one in ten billion that some two hashes have collided over the last 30 years. Your piece of toast, perhaps, and my bitcoin digest.

I think for practical purposes we're pretty safe.


Impressively, my Facebook interviewer did not recognize validity of nearly identical explanation for using sha256 hashes instead of full strings as a key for purpose of counting stack traces.

I think you are assuming it has a perfectly uniform distribution.

Yes, that's what I meant by saying sha256 is at least "pretty good" as a hash function. Which is not at all proven, but one of the reasons sha256 is popular is that it looks like it should be close to uniform.

Perfect hash functions guarantee no collisions. Of course they have other limitations, so you have to pick some tradeoffs.

I'm pretty sure, as the parent mentioned, that theoretic collisionness is impossible with |input-domaim| > |output-domain| bc of the pigeonhole principle. Especially so when input has no maximum size.

However perfect it is, it can only guarantee no collisions if the size of the strings you're hashing is equal or lower than the hash size.

In this house, we obey the pigeonhole principle.

It can do better, if the input is not "every possible string." A perfect hash of the strings representing 32-bit unsigned integers can compress strings up to 10 bytes long into 4 bytes of data.

Sure, there is a pretty severe limitation for perfect hash functions: you have to know all the possible keys in advance. That's rarely the case.

There's a huge math error in the article, coming to complete wrong solutions in the second case, long strings. The probability of the 2nd case is 1.4, so the outcome of 1 collision is no surprise at all, and thus you cannot praise the quality of the Java hash function this way.

What he should have praised instead is the Java hash table, which is the only one of all major programming languages, which is fast and secure. All others are mostly slow, and insecure. Java chose a fast hash and a secure collision strategy, counting collisions and switching to a tree on attacks. Most others chose a random seed and a slow hash function, which are still DOS unsafe whilst being 10x slower. Pure security theatre.

Just my $0.02: there is nothing to be gained by nagging about someone's punctuation and grammar. That doesn't win you an argument about programming.

This is the author. I didn't submit the article here, but I did happen to run across this posting from my GA.

Thank you for the feedback. You're right.

It's one thing to point out errors constructively, but another thing entirely to make fun. After all, English isn't everyone's first language, and I'd have a tough time writing an article like this in Spanish, for example!

I've updated the article to remove that comment, although I still (gently) point out the worst of the errors. I've also added a theoretical framework to tighten up the argument a bit.

Thanks again for the feedback. There's no purpose in being nasty when the point can be made another way.

Apart from that one comment, the remainder of this article - the vast majority of it - was a well presented argument about why the original post was flawed. That comment was obviously just an aside

”resulted in 1 collision. A “fair” hash function would generate an expected 1.44 collisions over this data. String.hashCode() outperforms a fair hash function significantly”

I doubt that is significant, and you needn’t even lookup the confidence intervals. Think of it this way: if you ran one more experiment with similar data, if that perfectly good hash were to approximate that 1.44 collision on average between the two experiments, is has to have at least one experiment where it has zero or one collision.

Also, string hashing has a few requirements that I think are more important than having an optimal probability of collisions:

- it has to work well on typical 16-bit text strings, where most of the time, half the bytes are zero and most of the other bytes have only fivefold six bits that vary (that’s why there are so many collisions in two character strings: they are four bytes long, but, at best, differ in only about 10 bits)

- it has to be fast.

The original criticism seems perfectly valid (although maybe people are reading too much into it).

If you have a 32-bit hash and less than 32 bits of input, it’s reasonable to hope all hashes might be unique. 10x extra collisions on short strings does seems pretty bad. And it’s unfortunate that this hash function can’t be changed without breaking the language spec.

> If you have a 32-bit hash and less than 32 bits of input, it’s reasonable to hope all hashes might be unique.

No it isn't. In fact, with 16 bits of input, the probability of a collision is not far from 50% according to the birthday problem:


That's assuming random distribution of hashes.

I can trivially write a hash function that produces a unique 32-bit hash for inputs of 32-bit. Just truncate the input to 32 bits. It's not a very good hash function for longer inputs, of course.

True - but I still don't think it's reasonable to assume that a hash function behaves that way.

You could do that and also use a real hash for longer inputs.

Literally truncating the input wouldn’t be great in the likely case that your hash table is using only 10 bits (say) of the hash value, but I bet there are similar approaches that would work.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact