
Ask YC: Suggestions for a hash algorithm? - brianr
For a project I'm working on right now (an image server that dynamically generates images based on a URL and caches them on disk), I'm planning to use 20 bits worth of hash as both the directory path and a simple checksum. The string being hashed will be the filename concatenated with a secret key, which will hopefully prevent DOS attacks.<p>I don't really know anything about hashing... my ideas so far are to use the last 20 bits of MD5, or maybe the first 20 bits of Adler32. Even distribution is more important than speed. Can anyone offer any advice here?
======
cperciva
The ideal solution is to take 20 bits (low bits, high bits, middle bits -- it
doesn't matter) from HMAC-SHA256(key, URL). An HMAC is considered "broken" if
an attacker who doesn't know the key can do significantly better than random
chance at guessing what the HMAC of a different URL would be; so until someone
breaks HMAC-SHA256 and as long as your secret key remains secret, no attacker
will be able to create a denial of service via hash collisions.

Weaker options may or may not be secure; speaking as someone who is
professionally paranoid, I suggest not trying to invent your own mechanisms.
If you don't have an implementation of HMAC-SHA256 readily available, HMAC-
SHA1 or HMAC-MD5 are _probably_ adequate, depending on how paranoid you are.
(Or if C code works for you, let me know and I'll point you towards my BSD
licensed HMAC-SHA256 code.)

~~~
brianr
OK, this makes sense. How long should the secret key be?

~~~
cperciva
Depends how paranoid you are, but 256 bits is ideal.

~~~
brianr
OK. Thanks for your help!

------
tlrobinson
I'm a bit unclear on how you're preventing DOS attacks. Does the client have
to request a "signed" URL from another part of the application first?

If the user has to login to your application, couldn't you just limit it based
on that? Or by IP address? (of course that won't help against DDOS, but that's
another problem all together...)

As far as hashing algorithms go, if you're just trying to keep an attacker
from pounding your server in a DOS attack, any of the SHA1, SHA2, or MD5
should be fine, since even though there may be weaknesses, in reality it takes
a long time to find collisions.

~~~
brianr
For performance reasons, we wanted to serve everything directly from the
filesystem instead of dynamically using a script. If an image requested is not
found, the 404 handler generates it, saves it to disk, and sends it back.

The legitimate URLs are for 'static' images (static in the sense that once
they have been generated, they will never need to be generated again) that are
stored on disk using the filesystem. For example, an image url might look
like:

    
    
      http://example.com/1e/2f/c/(image generation parameters).png
    

The idea is that 1e2fc (the hash) serves as both the path on disk (so that
files are roughly evenly distributed among the 2^20 directories) and as a
checksum to prevent DOS.

~~~
calambrac
How does the user generate a legitimate new image? You aren't handing out the
secret key to everyone, so I assume that's being handled in an application
component?

~~~
brianr
Correct, the application code generates correct image URLs which are then
embedded in the HTML source.

~~~
calambrac
Forgive me for asking the obvious, but what is the DOS defense on the
application? Just a captcha?

------
marketer
What kind of DOS patterns are you trying to prevent? People requesting the
same image repeatedly, or people requesting random images? Generally, DOS
attacks are prevented by making the client do some non-trivial amount of work,
like answering a computational challenge, or filling out a catchpa.

~~~
brianr
People requesting random images. A captcha won't work because the images are
in <img> tags in every page in the application... I don't want to make users
type in a captcha before each pageview.

~~~
mrtron
Sorry to nitpick, but that really isn't a DOS.

That is more of a crawl that you are trying to prevent, which could be quite
taxing on your system but it is not intended to be.

So, that being said, your approach should be roughly good enough to prevent
that.

~~~
tlrobinson
If an attacker requesting hundred of these image can bring the system to it's
knees, it's a denial of service attack.

~~~
mrtron
A 'hacker' could request the same image hundreds of time right now even with
this layer of obscurity!

They are two separate issues.

~~~
tlrobinson
No. The expensive part is dynamically generating the image. Once the image is
generated it is cached and served up statically like any other plain html or
image file.

------
cduan
If possible, why not just use a totally random string rather than a hash? That
way it's truly unbreakable.

------
nreece
Why not use CRC32, or even better the MD5 on file, as the base hash, and then
do a XOR with your secret key.

~~~
brianr
XOR with the secret key sounds like a good idea... hashing the whole file
won't work though, since I want to check the hash before generating the file--
I'm hashing the URL to prevent an attacker from asking the server to generate
billions of images. Think it will still work well if the input is only 20-30
characters?

------
kirubakaran
So you are using Google Charts huh?

