Glad to see it getting more attention. It seems to pop up every couple years. (This might be the first generalization of the technique to arbitrary web requests, which, yeah, will probably get a lot more attention).
An interesting technique - construct requests that produce pathological hashes (which degrade to linked lists, which have O(n^2) insertions). Make it big enough, and you tank the server. It may be useful to note their link to a previous article, which implies this has been known since 2003 in Perl.[1] Any news of actual attacks using this technique?
The great part is that, as long as you convert request params into hashes (a reasonable step, I'd imagine the vast majority of web frameworks do this), you're screwed if your language is vulnerable to this. Of course, you can tweak your framework to not use native hashes (or detect such attacks and handle them separately), but that's not generally an option as it's a pretty involved operation.
I wonder: would a hashtable that uses cuckoo hashes be vulnerable? Technically, sure, but would it make the attack infeasible due to the difficulty in creating double-collisions? Or is that not as hard as I'm thinking it could be?
To fix it at the framework level, assuming the language gives me access to the hash values, I would probably keep a second table mapping hashes to keys. Then I can tell the difference between collisions and dupes. Just kill the request once there are more than a few collisions.
Dupes aren't the problem, it's duplicate hashes of different data. Dupes should just override the previous value, which matches other hashtable behavior, and works in O(1). Some frameworks may add them to an array, but that's likely to be a real array, or a double-ended queue, or something easier to append to than a plain linked list like hashtables use (because they don't need anything else - to append, they need to check equality on every entry in the list, so they must traverse the whole thing every time, causing the O(n^2) time).
That's exactly what hashtables do. When they find collisions, they add it to a list of keys that hash to the same value. They do this to be able to return valid data when you request a key - otherwise, you'd find them rarely and seemingly randomly forgetting data or overriding with unrelated values. It's a useful behavior, and it's the very source of the problem.
One way to prevent this attack is to use a hashtable that has a limit on the number of collisions per table, or per key, and throw an exception when it's exceeded. The problem is that doing so by default would break valid uses in very surprising ways in favor of protecting against an unlikely and probably unknown (when designed) attack, which you won't do, because that shouldn't be handled by basic datatypes.
Isn't it trivially easy to defend against by adding a secret character/number to all of your keys, eg.: "dog" becomes "dogu" and "julia" becomes "juliau"?
In fact, why not do this at the language level? python could pick a random character at startup, and use it for the duration of the process.
From the relevant Python-dev thread, it's probably a bad idea for more than one reasons to make this the default behavior for all dicts; a special dict subclass to be used when necessary makes more sense: http://mail.python.org/pipermail/python-dev/2011-December/11...
A single pseudorandom byte chosen at container initialisation would work here, because url hashes are ephemeral. Persistent structures on the other hand are already DOS-prone, but need attacks more tailored to the application.
Even with a random per-initialization byte, you could still brute force it. Pick a random byte on your side, generate colliding keys with it, and send the request. If it doesn't work, just try again until it does. You'd expect to only have to try this ~256 times before succeeding, still seems pretty practical.
Of course, bump this up to 2-4 bytes and it no longer is.
I believe this is similar to what Ruby 1.9.x does, and why it's not vulnerable - it uses a varying initialization value for the hashes, so you can't collide every time. When they initialize it (per process, per initialization?) and what exactly it does, I don't know, but yeah. Greater overhead, less predictability, greater complexity, though all of them are essentially minor. It's not the sort of thing that tends to make it into a library until there's demonstrated need.
This is precisely why I agree with C++'s choice of a balanced binary tree as the default map implementation for its standard library. There's no denying that in the vast majority of cases hashing is faster, but that's an optimization, unnecessary except in a few specific instances, and while balanced binary trees provide many other advantages in compensation for their slight performance penalty: predictable performance (via non-amortized worst-case bounds), minimized reallocation, insertion/deletion without invalidating iterators, ordered iteration, range queries, and so on.
I'm late to the party, but I found the details interesting so thought I'd do the best technical summary I could:
An attacker precomputes a lot of different values that hash to the same value, then posts them in a web request and it eats CPU on an N^2 operation of processing the form variables due to its hashtable now being N^2 since all hashes are the same. There are some optimizations that make it easier than brute force to calculate collisions against the hash algorithms in use by almost everybody today, see [1].
The interesting part is this is a precomputed attack which relies on hash algorithms being well known and invariant.
Ruby has fixed this by initializing their hash function with a random seed each time an app runs, so an attacker can't do the pre-compute step - a collision calculated on his system won't necessarily be a collision on the server. [2]
Other workarounds include limiting the size of post requests so the dictionary can't get too big, limiting the CPU time available to a single web request, or using a balanced tree instead of a hash table.
Lua isn't mentioned there, so I'd like to ask: since Lua interns every string, isn't it vulnerable even if a framework doesn't put request parameters into the table?
Lua's string hash function looks roughly like this:
hash = length of string
for each char c in string
hash = hash ^ ((hash << 5) + (h >> 2) + (unsigned char) c)
It folds up to 32 chars into the hash. If the string is longer, it skips some characters.
This is a well known hash function and it's very different from the one being used by all the other vulnerable languages, but I don't know how hard it is to find collisions for this one.
It looks like it would be trivial to find collisions; all the functions used (xor, bitshifts, and addition) are trivially reversible. Just replace the nondeterminism of reversing the binary operators (xor and addition) with a random number generator and you've got yourself a collision generator.
If it starts skipping characters in longer strings then I'd say it's trivial to find collisions by simply constructing strings that are identical in the bytes that are examined and differ in the bytes that aren't.
C#/.NET is also not mentioned. Clearly I'm out of touch, I hadn't realized that Rubinius and JRuby were more popular than C# and VB.NET. Anyone have any idea if/how .NET might be impacted by this?
Sorry, I was skipping through it at that point (2/3 of the way through?), I don't know what their conclusion was. I think they said that it was partially different but still likely attackable, but I didn't catch it all.
So I'm digging around in the Rack source right now trying to see what's up, I haven't seen anywhere that it doesn't just rely on the native Ruby Hash class.
Does anybody closer to the matter know if Rack's actually vulnerable in 1.9?
this may help in some cases when an interpreter fix is not available/installable but not fix the hashing problem in general, especially outside of POST params.
http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec...
... but this was also one of Tim Newsham's (many) contributions to our joint research project on IDS in '97:
http://insecure.org/stf/secnet_ids/secnet_ids.html
Glad to see it getting more attention. It seems to pop up every couple years. (This might be the first generalization of the technique to arbitrary web requests, which, yeah, will probably get a lot more attention).