
Speed Hashing - csomar
http://www.codinghorror.com/blog/2012/04/speed-hashing.html
======
cperciva
Using "hash" to mean "strongly collision-free function" is a valid
terminological choice; in a wide-readership piece like this it should ideally
be pointed out in order to avoid confusion, but there are many fields where
that definition would be assumed without statement.

Getting confused between _hash functions_ and _password-based key derivation
functions_ , on the other hand, is absolutely inexcusable; that very confusion
is why there are so many people making the mistake of using SHA256(salt .
password) as a key derivation function. People hear that MD5 is broken but
SHA256 isn't; not realizing that MD5's breakage as a hash function does not
impact its security as a PBKDF, and not realizing that MD5 and SHA256, being
_not designed as PBKDFs_ are both entirely inappropriate for use in that
context.

Also, for reference: MD5(12 printable ASCII characters) ~ bcrypt(8 printable
ASCII characters) ~ scrypt(6 printable ASCII characters) in terms of cost-to-
crack on sophisticated (ASIC/FPGA/GPU) hardware.

~~~
ajross
Not to quip, but surely some of the blame for that confusion has to lie on the
cryptography experts who insist on using lung-spasm-inducing hairballs like
"PBDKF" ( _edit: typo there was, I swear, unintentional!_ ) to represent
concepts better explained in English as "expensive to compute".

~~~
tptacek
"Password based key derivation function".

The problem here isn't that cryptographers are inflicting a terrible name at
you, it's that security people have repurposed a function intended for one
purpose (generating crypto keys) for another (storing passwords).

Also: don't blame us, blame the PKCS standards group.

~~~
ajross
Sure sure. What I'm really saying is that the core concepts here are
"collision resistance" and "susceptibility to brute force", and that those are
comparatively easy to grok for a typical developer. But they get scared to
death when they read stuff like (verbatim from cperciva above) "MD5's breakage
as a hash function does not impact its security as a PBKDF, and not realizing
that MD5 and SHA256, being not designed as PBKDFs" which makes it sound more
subtle than it is.

~~~
ajross
To clarify: the clear English way to write that would be something like
"Colliding passwords aren't a problem, because it's cheaper to search the
password space anyway, so you can use MD5 there securely. But MD5 and SHA256
were designed to be cheap to compute, so they're easy to brute force. Use a
password function designed to be expensive instead." Charging off down the
jargon road can only complicate things. And doing so in a post about people
being confused (!) about the subject is frankly hurting, not helping.

------
judofyr
Nitpicks. It seems that he's talking about password hashing:

> "Hashes are designed to be tamper-proof". Wrong. _Cryptographically secure_
> hash functions are.

> "Hashes, when used for security, need to be slow." Wrong. _Password hashes_
> needs this; not SHA1/MD5 etc.

~~~
codinghorror
Well, I'd say a hash that has no need whatsoever to be tamper-proof (no
attackers, ever) and cares only about speed is a _checksum_ \-- so maybe a
terminology issue.

I agree there are certainly other uses for hashes, just trying to distinguish
between hashes and checksums.

Re-reading what I wrote, I open with "Hashes are a bit like fingerprints for
data. A given hash uniquely represents a file, or any arbitrary collection of
data" so the context of this article is hash functions that are able to
uniquely identify something in a reliable, trustworthy way -- either checksums
(fast, no need for security) or hashing (slower, more reliable, possibly
"secure" for some definition of secure), but always in the context of "can I
_trust_ this value to tell me that the data is really what I think it is?"

Of course there is a tradeoff with speed; a person's name can be good enough
identifier (checksum) in some circumstances, but maybe other circumstances
require more reliability like DNA or fingerprints (secure-ish hash), at the
cost of being far slower to collect and validate and way more onerous.

~~~
nadam
I am afraid the terminology you use is not canonical.

A hash (function) is any function that maps a big variable-length data
structure to a smaller (fixed-length) data structure.

A checksum is a special hash (function) which has the purpose of detecting
accidental errors during transmission or storage. (This makes them different
from hash functions used for example in hash tables. For example relevant
question: minimum how many badly transmitted bits are needed to break the
error detection in the case of a specific checksum?)

A cryptographic hash function is another kind of special hash function which
also has a very good definition on wikipedia.

(Basically these are the definitions on Wikipedia, and I think they are fine.
If the terminology you use would be canonical we would use the name 'checksum
table' instead of 'hash table', and it would be quite illogical also, because
the term 'checksum' really includes purpose in its name 'checking something').

~~~
ralph
I have hash functions that work on less bytes than the length of the hash they
produce.

~~~
arethuza
Won't that be fairly common? Java's hashCode and the .Net GetHashCode both
return 4 byte ints and there are presumably a lot of HashMap and Dictionary
objects out there using short strings as keys.

~~~
ralph
It's very common; the norm even. My point was that "A hash (function) is any
function that maps a big variable-length data structure to a smaller (fixed-
length) data structure." is incorrect; the source size isn't relevant.

~~~
dekz
The source size is extremely relevant, in the property that the function is
able to work with variable sized input including large input. This definition
has obviously embedded its purpose unto itself. The reason we have such
cryptographically secure hash functions corresponds to the need to digest
large data.

------
JoachimSchipper
Hashes are designed to be fast. _Password_ hashes (MD5crypt / SHA1crypt /
bcrypt / scrypt / PBDKDF2) are designed to be slow to make dictionary attacks
harder, but the hashes used in SSL and the like are designed to be as fast as
possible without sacrificing too much security.

~~~
masklinn
As often, he plays fast and loose with terminology and uses "checksums" for
"hash functions" and "hashes" for " _cryptographic_ hash functions".

> the hashes used in SSL and the like are designed to be as fast as possible
> without sacrificing too much security.

See also: map and set hashes. There, you're looking for speed and good
distribution across the buckets, a slower hash function is not something to
look forward to.

~~~
codinghorror
I guess, but what's the value of a "checksum" that fails to detect certain
"cryptographically secure" changes in a file? MD5 is clearly just a checksum
now, since it's not secure, but it was designed to be originally. And would
you ever want to use a checksum that others could manipulate at will? Curious.

I feel like the line between "this is a checksum" and "this is a secure hash"
is kind of illusory, other than for pure performance reasons.

~~~
JoachimSchipper
E.g. ZFS uses non-cryptographical hashes to detect bitrot. TCP and IPv4 use a
non-cryptographical hash to detect packet corruption. Etc.

~~~
codinghorror
terminology issue, I guess, I'd call those checksums -- where speed is the
overriding concern and you're not worried about attackers changing the data
underneath you. (And you actually need to uniquely identify the data in a
reliable way..)

~~~
scott_s
Sure, it's a terminology issue, but as several others have pointed out, you're
the only one using your terminology. So expect people to be confused when you
try to explain these things using your definitions.

~~~
parenthephobia
In the case of TCP/IP, everyone calls the checksum a checksum, and nobody
calls it a hash.

------
mistercow
>A given hash uniquely represents a file, or any arbitrary collection of data.
At least in theory.

It really bothers me when people misuse "in theory" like that. "In theory"
means "we have a model that makes good predictions in some circumstances, but
there are cases where it may fall short". But a model in which hashes uniquely
represent arbitrary collections of data is a model which allows for infinite
compressibility of strings. That is not a theory that merely falls short in
some edge cases; it is a theory which is hilariously and catastrophically
wrong.

~~~
nooop
It does not allow infinite string compression, because you obviously have to
store the original data. For hashes of good length and quality, the theory
stands. I would even use MD5 and say the theory stands for non-critical every
day uses given there is no attacker.

~~~
mistercow
> It does not allow infinite string compression, because you obviously have to
> store the original data.

Nope. Say you have a hash function `h` which guarantees a unique, 128-bit
output for any input. Then `h` is a function which compress any string into
128-bits:

If `y = h(x)` then it is trivial for me to write a program that will
reconstruct `x` from `y`. I will simply iterate `x` through the possible input
strings (which I can do because the set of strings is countable) until I find
one that satisfies `h(x) == y`. Impractical, yes, but allowed by the theory,
and that means the theory is invalid.

~~~
IsTom
That's false. Nobody is ever saying that h(x) is unique, only probability that
h(x) = y for any y should be about 1/size of y choice space.

~~~
mistercow
"A given hash uniquely represents a file, or any arbitrary collection of data.
At least in theory."

------
exDM69
Nitpick: this statement is not exactly true: "Hashes are designed to be
tamper-proof".

This only applies to cryptographic hash functions, like SHA-1 and Skein. There
are non-cryptogaphic hash functions which are not designed to be secure, like
Dan Bernstein's DJB-family of hashes (DJBX33A and DJBX33X), that are used e.g.
for hash tables with string keys.

~~~
codinghorror
I suppose I was thinking of Java and .NET where you don't generally get the
chance to "pick" a hash function unless you're in the Crypto namespaces. There
are certainly hash-based data structures like HashTables and the like, but the
underlying hash algorithms are not exposed in any way, they're just magic.

~~~
marcusf
Not to pile on with the nitpicking, but for just the cases everybody here is
talking about (eg non-crypto hashing) you get to "pick" exactly the hash
function. You're forced, actually, to implement hashCode() in Java if you want
to be able to put the object in a HashMap, HashSet, etc.

~~~
canop_fr
You're right but that's still in a very small subset of possible hashing
functions because you must produce a 4 bytes hash, which is then hashed again
(to the length of the storage array). Of course you don't have as requirement
to avoid collision, just to distribute as equally as possible.

------
lysium
While long and random passwords are a good thing, there are still applications
that make that very difficult to use.

I'm looking at you, Apple AppStore, letting me choose a new password after
only the second wrongly-entered, disallowing copy&paste on that website to
enter the new password, requiring JavaScript on that website to enforce the
no-copy&paste rule and then not showing me the characters I enter.

That's ridiculous. After the fifth attempt to enter my 20-character password
correctly (which was chosen randomly by my Password manager), I gave up and
entered a 8-character password.

So the problem is not only the user, but also ridiculous application
requirements.

~~~
tzs
That is indeed annoying. As a work around, if you use a browser or browser
extension that lets you modify the DOM on the fly, you can disable the paste
blocking. For instance, in Safari, use "Inspect Element" on the password
field, find the onpaste="..." attribute, double click the onpaste and change
it to something that isn't a valid event name.

You can then paste in the password field.

It should be reasonably possible to make a bookmarklet that goes through all
form elements on a page and does this automatically to all that have onpaste
handlers, so you don't have to bust open the inspector and dick around every
time you want to log in.

~~~
lysium
That's good to know, I'll try it next time!

------
badcrcerror
> If you could mimic another person's fingerprint or DNA at will, you could do
> some seriously evil stuff. MD5 is clearly compromised, and SHA-1 is not
> looking too great these days.

You cannot "mimic" another person's fingerprint with MD5. Currently you can
only "create" two people with the same fingerprints.

This might sound the same but consider the case where you want to replace
someone's notepad.exe with an evil notepad.exe (and make sure that it has the
same MD5 so they don't notice). Currently, no attack exists to do that.

------
charlieok
I think it's also worth looking at the protocol used to perform the
authentication. This is closely tied to what kind of verifier you can/can't
store.

If you store a salted hashed password, the authenticating party needs to send
you the actual password.

Sure, they could send you the password over an encrypted connection. But that
implies that a security context (and shared secret) has already been
established. There's a chicken/egg problem here. The most desirable approach
is to use a protocol which performs mutual authentication as part of the
initial setup of the security context.

There are protocols which can perform password-based authentication without
actually sending the plaintext password. It's called a zero-knowledge password
proof. In addition, some of these allow a "verifier" to be stored rather than
the password itself.

In this scenario, not only do you not store a secret which would allow
impersonation of a user, but that secret is also never even sent to you.

We need support for this in browsers :)

------
eruquen
Nobody mentioned pepper yet. Not the "static salt" variant which you might
find while googling. That is just more security by obscurity. I'm talking
about adding a random string of fixed length characters to the (salted)
password that is not saved anywhere.

At login, it requires a bit of brute-forcing on the server to check the hash
since we have to go trough all possible pepper strings. This adds a few ms
(e.g. with a random string of length 4).

Now, on the attacker's side the picture looks drastically different. The
amount of time required to brute-force through the already salted hashes grows
exponentially with the length of the pepper string. If it takes a few days to
crack the whole database without pepper, it might take a few years to do it
with pepper. There is absolutely nothing the attacker can do about it. No
access to any part of your system will help him or her.

~~~
tptacek
That's because it's a silly idea which is inferior to cryptographic adaptive
hashing, as is done by bcrypt, scrypt, and PBKDF2. If you want to provide a
work factor, use a real one.

~~~
eruquen
First of all, you come off as condescending when you say "it's a silly idea"
and say that the work factors you mentioned are "real" ones, thereby implying
that adding more combinations is not a real work factor.

Second, the actual usefulness of any work factor has to be proven first. Do we
know that the iterative approaches offer "real" work factors? What if
subsequent iterations are easier to compute by exploiting the structure of the
input (i.e. password + result of previous iteration). We have to prove that
this is not the case. The same argument holds for peppering, which is also
based on computing hashes with a certain structure.

I'd like to hear your arguments as to why certain work factors are more "real"
than others, especially peppering.

~~~
tptacek
I am absolutely intending condescension towards the idea. Since you're an
anonymous user with almost no comments in your history, I do not concede the
idea that it's even possible to condescend to you: I have no idea who you are.

The usefulness of work factors has been proven. You can start here:
[http://www.bsdcan.org/2009/schedule/attachments/87_scrypt.pd...](http://www.bsdcan.org/2009/schedule/attachments/87_scrypt.pdf)

------
Sami_Lehtinen
There is a reason why there are hashes and cryptographic hashes and CRCs. Some
hashes are fast and some slow. Same hash algorithm isn't right choise for
every usage.

Excellent article about hashes: <http://home.comcast.net/~bretm/hash/>

------
minhajuddin
People should be using stuff like PasswordMakerPro
([https://chrome.google.com/webstore/detail/ocjkdaaapapjpmipmh...](https://chrome.google.com/webstore/detail/ocjkdaaapapjpmipmhiadedofjiokogj))
or something similar to generate big ugly(using the whole character set)
passwords.

However, secure passwords are still a crutch. We should be fixing the problem
of asking for passwords. We already have solutions for this, but they are not
widely used. StartSSL(<http://www.startssl.com/>) gives you a secure
certificate which can be used to login to their site. I would love to see the
browser makers make it easy for users to use client side certificates.

~~~
ricardobeat
Until that lovely site goes offline and you, nor anyone else, knows any of
your passwords. Not for me.

~~~
minhajuddin
With passwordmaker pro, the passwords are not really 'stored' anywyere. It
just computes a hash using the 'domain name' a Master Password and some other
predefined stuff. So, it's not dependent on any site. So, it uses a different
password for each domain (because the domain is an input to the hash). This
way, even if a site you are using is compromised, and the passwords leaked,
the effect is local to that site.

------
mckoss
_In reality the usable space is substantially less; you can start seeing
significant collisions once you've filled half the space._

More problematic, the chance of any collisions is 50% when you have filled
just sqrt(space) - known as the birthday paradox.

~~~
pjscott
It looks like he just misread this article, which is kind of a disaster itself
-- it recommends storing MD5 hashes of passwords, maybe with a salt if you
want to be _extra secure_.

<http://www.skrenta.com/2007/08/md5_tutorial.html>

Edit: Oh dear, there's a followup post:

[http://www.skrenta.com/2007/08/crypto_vs_the_working_coder.h...](http://www.skrenta.com/2007/08/crypto_vs_the_working_coder.html)

------
Misiek
"salts alone can no longer save you"

I use two salts to hash a password: sha1(SALT . SALT2 . $password); the second
salt is unique for every user and stored in a database. Why is it not secure?

~~~
rmc
In reality you arent using 2 salts, you are using one unique salt per user,
each users salt starts with the same few bytes though.

~~~
Misiek
yes, I put the first salt in database and the second salt under www-root.
Hacker who hack the database only will not know the fist salt.

~~~
leftnode
I think you have to assume worst case: if they have access to your database,
they have access to your web root. It might not be the case, but you should
assume that.

------
revelation
Here is my problem with all of this. PBKDF really just means "take this hash
function and apply it 1000 times in succession". That seems like a very
butchery style to approach cryptography for me.

bcrypt seems to lack any research verification. Its a modified blowfish, but
who of us can tell what modifications are safe without making the scheme
ineffective and gauge its validity?

~~~
tptacek
PBKDF2 is more complicated than just iterating a hash 1000 times, but for most
applications, iterating SHA1 many thousands of times is just fine.

I'm not sure what kind of research verification you're hoping to get for a
password hash. These aren't the kinds of constructions where the possibility
of spending 48 hours of cluster compute time to generate a collision will
actually hurt you.

Sometimes, "research verification" is just a fig leaf for "too many Rails
developers got uppity about using bcrypt and now I need to signal that I'm not
one of them". Password hashes don't get a lot of research attention because
they're pretty basic, cryptographically speaking.

------
afhof
One very curious thing to me is that for the upcoming SHA-3 standard,
Wikipedia lists the cycle timings for each hash method. I would have thought
that slower hashing speed would be a good thing, but the faster the candidate
algorithm the better it appears.

Perhaps the faster the hash is easier implement in hardware / less power for
embedded devices?

~~~
DougBTX
Imagine an implementation of _git fsck_ , which checks that all the files in
the repository match their sha1 based file names. The less time it takes to
hash each file, the faster overall checking time, so it is better to use a
hash which is designed to be fast.

------
tarr11
I'd be interested in hearing about what stack-specific measures website owners
should take (outside of just pass phrases or using bcrypt) to secure their
systems.

I run several sites, and I'm not a security expert. It'd be great to have some
sort of checklist of things I should and shouldn't do.

------
romaniv
What about appending a significantly long random string to all passwords?
Would that slow down re-computation of hashes significantly enough to thwart
brute-force attacks? I don't know enough about GPU architecture to tell for
sure.

~~~
glogla
Not really.

The reason why longer passwords take long is not a property of the GPU or
computation, it is because there are many more possible passwords as the
length increases and you have to try sizable portion of them.

For example, there are (to make the math simple) 10 000 four number pins
(0000, 0001, ... 9999). However, if you append 4 to every pin, you have made
them longer, but there are still 10 000 of them, because the 4 is fixed.
However if you use 5 number pins, there are 100 000 of possible ones.

Random string would work only if you can store it in a way that attacker can't
get to them. However since the application that uses them needs them to verify
the password, this is not very realistic assumption.

~~~
romaniv
I was thinking more along the lines of somehow forcing certain _space_
complexity on every hash computation so it can't be as easily parallelized.

However, I see that my exact reasoning in the post above is not solid. Having
longer strings as the final result of password comparison is irrelevant,
because they are not part of the brute-force computation in the first place.

------
gws
Password security quick basics:

Developers -> use bcrypt

Users -> use a service like Lastpass with a long passphrase as master
password; use unique 12+ character (including specials!) passwords through
Lastpass

------
16s
For long, random passwords that can be easily remembered, try SHA1_Pass. It's
free, with source code and runs on Windows, Linux and Macs.

------
DannoHung
Why would you use MD5 as a checksum if different documents with identical
hashes can be produced?

~~~
tptacek
Presumably because it is faster than many of the secure hashes (particularly
SHA256), and still functions adequately as a checksum (bit errors are not
going to produce meaningful collisions).

------
mojuba
And to slightly improve salting:

    
    
        $pw_hash = sprintf('%s|%s|%s|%s|%s',
            SALT1, $user_name, SALT2, $pw, SALT3);
    

In fact, my salts usually include non-printables, including the NULL char,
plus they are very long.

~~~
judofyr
How does this improve salting? The point of salting is to avoid rainbow
tables; as long as the salt is random you should be safe.

------
donpark
Where to store the salt is that's where the real problem is.

It doesn't really matter how much encryption and hashing we throw at anything
if everything is stored on the same server.

What's the point of salt if salt is in plain sight?

What's the point of asking users to verify hash of downloadable files when
hash is stored along side the file itself?

What's the point of cryptography when code points straight to all that's
necessary to dispell the protection?

Ultimate shame is that we still lack the necessary infrastructure for minimum
level of security despite all the cloud-related hype, leaving each server to
stand-alone which is no security at all.

~~~
dpark
> _What's the point of salt if salt is in plain sight?_

To prevent rainbow tables and to increase attack complexity. And to prevent
someone who sees a hash from being able to drop it into a search engine and
get the password (works for unsalted common passwords).

The point of salt has never been secrecy.

> _What's the point of asking users to verify hash of downloadable files when
> hash is stored along side the file itself?_

To verify data integrity. It isn't an question of security, or they would be
signed properly.

> _What's the point of cryptography when code points straight to all that's
> necessary to dispell the protection?_

I have no idea what you're talking about.

> _Ultimate shame is that we still lack the necessary infrastructure for
> minimum level of security despite all the cloud-related hype, leaving each
> server to stand-alone which is no security at all._

Still no idea what you're talking about.

~~~
donpark
Your answers make sense in the context of servers for websites with low
security needs, where privacy is of higher concern than security.

They do not make sense for servers with high security needs.

~~~
dpark
I'm sorry, but you are simply incorrect. Salt is not a piece of secret data.
Period. That was never the intention of salt, and it still is not. Secrecy of
the salt does not improve its usefulness.

As for the download hash, this has nothing to do with security at all, as I
already said.

~~~
donpark
Re salt, please see project-rainbowcrack.com as an example of CUDA-driven hash
cracking.

Re download hash, I think we disagree on what data integrity means. By data
integrity, I mean trustworthiness which implies security, meaning file content
hasn't been modified in-storage or in-transit. I'm guessing you are using
those words to mean protection against transport-errors or typos (credit card
entry) for which checksum is commonly used.

Given that HTTP is still the most common transport for file download and it's
a reliable transport, file-hashing for transport-errors detection is I think
meaningless. I know some browsers failed to report errors but comparing length
is typically suffice to catch this.

That leaves only file-hashing as protection against tempering in-storage or
in-transit which is why I am saying it's security-related. Keyed-hashing can
mediate but, unfortunately, is not commonly used as it requires PKI.

~~~
dpark
How does pointing to rainbow tables indicate that salt should be secret?
That's one of the things that salt prevents, and it does it _without being
secret_. A 32-char random salt could be publicly published for each of your
users and it would still do its job. Secrecy doesn't matter.

Data integrity is not trustworthiness. Data integrity means I got what you
said I should get, not that you are trustworthy (and therefore I can trust
what I got from you). If you want secure assurance that you got what I really
sent, then you need secure assurance that I am who I say I am, and no hash
will get that. You need a signed file (or hash). A hash by itself is indeed
nothing but a robust a checksum in this case.

There's no point in arguing about the md5 for downloads. You're assuming that
it's for security, and it's really not. It's primarily for people who are
paranoid about corruption. It is also useful for validating when downloading
from a mirror, though.

~~~
donpark
Let's just stop here. Given that we can't even agree on meaning of words, I
think continuing this discussion will produce only further disagreements or
misunderstandings.

I do, however, appreciate your cordial and elaborate responses for which I
thank you.

