Hacker News new | past | comments | ask | show | jobs | submit login
Abnormal String Hashing (pzemtsov.github.io)
52 points by r4um 13 days ago | hide | past | web | favorite | 15 comments

If you're curious on how to avoid this in your applications, take a look at universal hashing: https://en.wikipedia.org/wiki/Universal_hashing

How does that prevent the result being 0? Seems like an implementation detail easily fixed as the author explains.

Well, there's two issues here. One is that zero is both a valid hash and a sentinel, which is a Java implementation problem. The other issue is that an attacker can pick a bunch of things that hash to zero and gum up your application's performance, which is what universal hashing solves.

Do all of those lookup tables for powers of 31 work once integer overflow happens?

What lookup tables for powers of 31 do you mean?

The HashMap implementation internally uses lookup tables (arrays) with sizes that are powers of 2, which might be what you mean? The incoming hash code can happily use all 32 bits of the int type, and its calculation is allowed to overflow and wrap. When indexing into the hash bucket array, HashMap masks off the extra bits that would lead to an index out of bounds exception otherwise.

Look at the github repository.

He's practically turned the hash code calculations inside out to do memoization, and one of the things he's doing is precalculating the hashes and merging them.

He's making the observation that ((n * 31 + x) * 31 + y) * 31 + z = 31^3 * n + 31^2 * x + 31 * y + z

His code calculates (31^3 * n) + (31^2 * x + 31 * y + z ), memoizing the first and second terms and combining them for each pairing of n, x, y, and z.

That's algebra, and of course it holds for all real numbers... if you have infinite precision arithmetic. We do not. We have 32 bit integers, and overflow behaves oddly in some cases. My question is does the overflow behave uniformly for all combinations of those two terms. I thought the answer was 'no', so I'm wondering if he's missing some matches.

Thanks, I missed the point that this was related to the GitHub repository.

Integer addition and multiplication are associative and commutative, even in the presence of two's complement wraparound. So this transformation is valid. I don't have a source to point you to, though. I'd be interested in one, if anyone has one. (Or just a proof, if it fits here.)

This is all actually pretty safe.

Firstly, Java, unlike C, makes full commitments to the bit-to-bit results of all arithmetic calculations. C only does that for the unsigned calculations, giving a compiler some freedom for the signed case. Java postulates that everything is calculated in binary 2's complememt code. No other arithmetic engines are allowed.

Secondly, within this engine, everything is done modulo 2^32, even despite the fact that Java ints are signed. At least, this is so as long as you avoid division. Multiplication of the unsigned numbers produces bit-to-bit the same result as that of unsigned ones. For instance, let's multiply -1 by -1 (FFFFFFFF by FFFFFFFF). It, obviously, produces 1. Now, let's go unsigned: FFFFFFFF times FFFFFFFF is FFFFFFFE00000001, or, in the last 32 bits, 1.

So, if in Java you want to perform any computations modulo 2^32, go for it. Just avoid division.

He really didn't understand Java's take on this. Being zero-insensitive obviously is totally insecure. Java knew that. But Java decided to fight those kind of attacks better than most others. Java has a still trivial insecure hash function, which it decided to keep, because of an API blunder. But they convert the collisions from a linked list to a tree on too many collisions which indicate an active attack. Those attacks are rare, the common case is still fast.

Zero-insensitivity would have been fixable trivially, perl fixed that with 5.18, but they couldn't, so they came up with a proper and much better fix. Unlike perl and everyone else.

This isn’t about hash collisions, it’s about unnecessarily recalculating the hash value

I'm looking at all of the lookup tables in that code and wondering how much slower it would be to do a depth-first search and calculate as you went.

And then with a trie thrown in.

This is fun, but with the test code running 10 million iterations to generate "slower" numbers, is this of practical interest?

It's a deterministic attack vector, so significant offline computation prep work is not unusual.



A DAV and a SOV huh? Can you share a practical implication?

Arbitrary length string of null characters also produces zero hash, e.g.: System.out.println("\0\0\0\0\0".hashCode());

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact