Hacker News new | past | comments | ask | show | jobs | submit login
Yahoo discloses hack of 1B accounts (yahoo.tumblr.com)
1046 points by QUFB on Dec 14, 2016 | hide | past | web | favorite | 569 comments

Fittingly, attempting to change my password to a 32-character random string generated by 1Password returns an error that the password "cannot contain my email or username", regardless of the contents of that random string (I tried several).

It does, however, _happily_ accept `passwordpassword` and cheerily move along to confirming that my recovery email account from 2003 is still valid.

Gonna guess that's a bad message for a password length violation or something else.

Not that it's much better. Is it so hard to allow 50 character passwords?

I'm guessing it detected an @ symbol?

has anyone tried password@password ?

if the password is stored properly, (i.e. bcrypt), the number of characters shouldn't matter at all, be it 50 or 5000.

It sort of does matter for bcrypt, surprisingly: http://security.stackexchange.com/questions/39849/does-bcryp...

In the interests of hewing closest to cryptographic reality, I design not to allow a password longer than the algorithm can usefully use.

I think it's best to allow longer passwords for those who use long phrases. It's easier to remember the full phrase than a truncated version. You could show a warning that the extra chars beyond 50-55 will be ignored.

Or you could SHA256 the original password and feed the hash to bcrypt. Remember to use the 64-byte hexadecimal hash, not the 32-byte binary because bcrypt chokes on null bytes.

Everyone's been saying "just use bcrypt", but bcrypt has too many gotchas to be the default choice. We really need to work on getting scrypt and argon2 into the most popular programming languages and frameworks a.s.a.p.

> Everyone's been saying "just use bcrypt", but bcrypt has too many gotchas to be the default choice

This has got to be the underlying problem of modern security. By the time a best practice is well known, it's no longer best practice.

I think that's a good observation. The implication seems to be that we're not iterating fast enough, or not sufficiently fast in implementing changes/improvements.

On the flipside, isn't there a risk of moving too quickly? There's a certain culture of caution because there's something to be said for "if it aint broke, don't fix it." and even if something is broke, how certain are we that cool new encryption algorithm is better or safer?

Like nutrition!

You would probably want to use PBKDF2 as a key-stretching function rather than just naive SHA256. Otherwise you're clipping your bcrypt input from "56 arbitrary bytes" down to "56 hexadecimal characters".

I haven't looked deeply at this, but using "key stretching" that clips your output characters to such a small space smells very suspect to me.

Remember: there is only 32 bytes of actual output there, regardless of whether you represent it as hex or binary. And since bcrypt can't take more than 56 bytes of input, you are clipping that down to the equivalent of 23 bytes.

Is "just use scrypt" an acceptable answer then? I'm not a security expert and I don't know the advantages of one over the other.

Yes, scrypt is a perfectly fine password hash.

If you are currently using something else (say salted md5 or even just plain md5), you can migrate your passwords to scrpyt(current_hash()) without having to change everyone's password and/or wait for everyone to log in.

See also this comment thread: https://news.ycombinator.com/item?id=12549110

Don't do that. You've essentially just turned the old hashes into plain-text passwords, and how sure are you that those hashes don't exist in backups anywhere?

No, not exactly. An adversary who has the old hash, but not the plaintext that it represents cannot login because scrypt(H(H(value))) != scrypt(H(value)). This is not considering the offline crackability of a compromised hash. But there are legitimate situations where upgrading the password backing to a modern slow hash is preferable to continuing to use the old hash or worse storing the old hash as a field for a long time so that when a breach happens both the new and old hashes are available.

There are user experience battles when talking about forcing a million users to change their passwords in a real system. Hashing the hash may be vastly preferable to management nixing the security upgrade. A password updating schema that changes the hash as users login and eventually locking the accounts of users who have not logged in for an extended period of time can accomplish rolling the hashes without having to tell users to change their passwords.

Not if you mark the converted versions and try scrypt(oldhash()) on users authenticating with them.

Woah! Very good point!

scrypt is okay if you use it correctly. It's too easy to use it incorrectly, though, because scrypt is a low-level algorithm that wasn't specifically designed for password storage. [1]


In order to be able to tell people to "just use scrypt", we would need to have a sort of standard wrapper that uses the correct parameters by default and produces identical results in every common programming language.

> You could show a warning that the extra chars beyond 50-55 will be ignored.

Or you could use a better KDF, e.g. scrypt (even PBKDF2 is better on this metric). Artificial password-length restrictions are symptomatic of poor design.

This is surprising, do you know how Argon2 behaves compared to this?

Argon2 does the right thing. No silly upper bounds.

That depends on how silly you consider 2^32 - 1 bytes, if I recall correctly.

That's just a bug. Truncation invalidates the 'stored properly' part of the statement.

Could you expand on that? I did not think bcrypt was responsible for storing the resultant hash. The limit appears to be in calculating the hash.

The original phrasing was "stored properly, (i.e. bcrypt)". That's including the hashing as part of the 'storing'. Bcrypt has a size limit, but a size limit is not the same thing as truncating. It's broken code on the front end that truncates instead of doing something like sha512.

Bcrypt spits out a string, that the caller must store, somewhere. I presume the parent post means that Bcrypt "stores" in its output string a value that, for all practical purposes, varies reliably with the same salt but different plaintext.

If the password is stored properly, (i.e. bcrypt) then there does need to be some length limit or it becomes too easy to DoS a service by sending it hundreds of megabytes of password to bcrypt. There's no reason for that length limit to be less than 100 characters though.

You are going to be limited by the max http request size way before that.

To upload 100s or even more than a few megs you need a multipart message, a password form won't accept MP http requests.

On the back-end there usually is a naive POST handler which happily accepts anything it can parse, unless a mature framework with sane defaults is used.

Yep, people who've run marginally popular sites have dealt with this before. Give someone a text box and watch them try to stuff 4GB of content in it. There has to be a cutoff somewhere, but as you note, it should be well outside of the realm of reasonable password lengths (hundreds of characters).

GitHub is the only website I can think of off the top of my head that doesn't limit to an arbitrarily small number (aka <100). Do you name any other "major" websites that allow 100 character passwords?

Anything built with the popular rails gem devise allows 128 by default [1]


Hash the password locally (you are serving JavaScript over SSL right?) and only send the SHA256.

That would lock out anyone who chooses not to execute your JavaScript.

Requiring me to trust your code in order for you to decide whether or not to trust me is asking too much.

How would you know that the hash is of a password of sufficient entropy?

Never trust the client.

This isn't about trusting the client: it's about your endpoint being able to only accept a SHA256 hash sum from the client (thus: length limited) while allowing the user to input arbitrarily long passwords.

They hash in the browser: the only way they can mess with it by producing silly outputs, but that only hurts them.

I can't think of any security implications of hashing on the client-side. What's your thinking?

Does salting work if you hash in the browser?

Well in this case the hash would be passed to Bcrypt or Scrypt, which have built in salt support, so client side salting wouldn't matter.

If the hashes are leaked, you could log in with them.

Well serverside you store them as plaintext equivalents - i.e. salt+hash the hash. So a leak doesn't leak the user-side.

It's probably a naive substring detection check.

I hit this on the last Yahoo hack go-round, and it seemed that having a name in the form 'F Lastname' (for example) disallowed use of the letter F in the password.

I say "seemed" as I did not go through the exercise of testing with multiple combinations of name, initial, and password.

* Defending Against Hackers Took a Back Seat at Yahoo, Insiders Say - The New York Times || http://www.nytimes.com/2016/09/29/technology/yahoo-data-brea...

Time to update that article from September. Hooray for Yahoo, they made it 76 days without a 500M+ user security breach.

(No, I don't know the actual dates... just making a joke.)

Last time I tried changing my Yahoo password it took me days before it accepted something (and I had password generator scripts and my brain). Now it's back to something along the lines of `letmein`.

Just leave it at passwordpassword, it will be leaked eventually anyway

Strong passwords that need to be memorized shouldn't be wasted on security bozos

He doesn't need to memorize it. He mentioned he used 1Password to generate it. I'd assume he's storing it there too.

I've always wondered when 1Password is going to get hacked..

Won't do much; AFAIK everything is encrypted client-side with your master password. So a hacker could, in theory, get my encrypted database, but by the time they crack my strong password, I will, at the very least, have changed all those passwords.

That's not what a hack against 1Password, LastPass, or similar product will look like. When it happens, it will be because someone manages to commit to the VCS repository of one or more of their client applications (iOS, Android, desktop, etc.). All it takes is a few lines of code to dump the unencrypted contents on the device itself, and post them to some API endpoint or email address.

One commit to a VCS by a disgruntled employee, or an attacker who social engineers credentials to the VCS, and the client applications themselves - which must be trusted to decrypt the contents locally - will be compromised.

This is the problem with proprietary password managers, where the client applications are provided by the company. You cannot vet that software which is running on your device today, let alone all the app updates coming down the pipeline.

Thank you for writing this. I use a password manager, and whenever I see someone say "it's unhackable because of the encryption" I want to tell them this, exactly. All someone needs to do is to surreptitiously send your password to their own server and all your passwords are owned. It's not difficult.

I've often wondered about this. Is there a preferable alternative?

At the very least you need to memorize the 1pw password, but I do memorize some others as well

I can kind of understand that reasoning, but one of the nicer things about strong passwords is that there are a lot of them. In some sense that's what makes them strong passwords.

Why do you assume the password would be leaked eventually? Usually hashes are leaked (as in this case), not passwords.

leaking unsalted md5 passwords == leaking passwords

Not if it's my password. I use ~100 bits of entropy.

Hence "Strong passwords that need to be memorized" in OP's comment. Or else your memory is way better than mine (or I care way less, or probably both).

Whether they need to be memorized or not does not make the statement "it will be leaked eventually anyway" more true.

I use a password database so I don't memorize most of my passwords.

Well, it does, because memory puts some limitations on length and complexity...

It is possible to memorize a 100 bit password. I once had a 1000 word poem memorized, and could write it down flawlessly from memory.

I agree that it's not worth memorizing, you should instead use a password database. But I still maintain my original point that there's no reason to assume that your password will be leaked eventually if you use a strong password.

Could the attacker find an easier to find string that matches the same md5 hash?

The current best attack wrt matching an existing hash brings MD5's 128 bits of security down to 123. So no, that's not going to happen.

MD5 is terrible for human passwords because it's fast. But md5 is not actually broken for password storage purposes. If you use a long random password, md5 is enough.

Yes, if you add a (long - at least 32 bit) salt and something like at least 10^9 rounds of MD5 then, yeah, it's probably ok

No. I mean single unsalted MD5. You will not crack a 20-random-char password. You cannot process 2^120 guesses, and MD5 is not broken for this use.

Just follow NIST guidelines and never change it. That way when the servers in Utah crack your password, they don't have to recrack it later.

I've run into off-by-one issues in password length requirements in the past, so if 32 characters is the stated maximum it might only be capable of 31 on the validation side.

That asymmetry in length support strongly suggests they're storing passwords in plain text.

I tried to change it to a 64 character random 1Password string with numbers, characters and symbols. It complained it was too easy to guess. I submitted the exact same password and it accepted it.

> August 2013

> hashed passwords (using MD5)

I don't even know what to say.

> investigating the creation of forged cookies that could allow an intruder to access users' accounts without a password. Based on the ongoing investigation, we believe an unauthorized third party accessed our proprietary code to learn how to forge cookies

How is this possible? Aren't most auth cookies just a session ID that can be used to look up a server-side session? Did they not use random, unpredictable, non-sequential session IDs?

1) As Yahoo "upgraded" all password storage in UDB (where all login / registration details are stored) to be bcrypt before 2013, I'm curious how this was possible.

2) Yahoo doesn't use a centralized session storage. If you know a few values (not disclosing the exact ones) from the UDB, it's theoretically (guess not so theoretical now) possible to create forged cookies if you steal the signing keys. To my knowledge, the keys were supposed to only be on edit/login boxes (but it's been a while so I may be forgetting something), so this is a pretty big breach.

On a number of engagements I've come across password databases that have been migrated to bcrypt. In one case I checked CVS to see who made the code change, and found the MD5 passwords on his dev box. In another I tracked down a MySQL slave that had broken replication for over a year.

In both cases I tried to track down backups, but discovered neither company was keeping them. That is another possible vector.

1) I'd be flabbergasted beyond belief if there was ever a Yahoo! engineer who had user passwords on their laptop / Dev box. The technical hurdle for that would be a stretch, let alone the fact of the other ramifications of doing this.

2) there's no SQL database involved with Yahoo!'s storage of passwords. It's a custom built db system with proprietary access and replication protocols.

I wasn't saying either possibility was the cause of the Yahoo breach. Simply pointing out that there is always another way.

The NSA's MUSCULAR program for example decoded proprietary secret squirrel cross datacenter replication protocols designed by both Google and Yahoo, so that isn't much of a safe guard against state level actors.

Yet, somehow they did get out.

Apologies, I've heard the details at this point and I can't disclose them. The limit of what I can do is poke holes in the theories that are wrong.

Aren't the details "three years after we were hacked, law enforcement told us that we had been hacked, and we believe them?"

The press release explicitly says "We have not been able to identify the intrusion associated with this theft." I especially noticed that the "What are we doing to protect our users?" section doesn't mention anything about Yahoo fixing any security issues.

Presumably, then, as a Yahoo engineer, you know what your security practices are but you don't know what you did wrong or whether you've fixed it.

Do you honestly believe a press release covers every detail, especially ones with strong legal implications, and might not have rather been worded very carefully?

The contrast between your statements and the press statement is great enough to imply Yahoo is being dishonest.

"Dishonest", not in the slightest. From what I'm told, they really don't know how they got in. But that's only the part of the story discussed in the press release, what's not discussed is how the data existed in that format.

From my experience if Paranoids did know they would have locked it down at the expense of engineers or others. I know since I have made breaking changes to infrastructure which did lock out some engineers and cause plenty of headaches.

Every Yahoo I have ever known has cursed the Paranoids for getting and the way. Every Yahoo that has actually been in a situation has also blessed the Paranoids for the same reasons.

Simple fact is that Yahoo has a mega butt ton of code from several decades. There are going to be holes and when they are found they are fixed pretty damn quick. Last one I dealt with was solved in hours with all hand on deck. Sometimes it just sucks to be as old a Yahoo is.

If they do not know how the adversaries got in, how do they know the adversaries are not still in to some degree?

Good point. I don't know if they do know that for sure.

> the "What are we doing to protect our users?" section doesn't mention anything about Yahoo fixing any security issues.

"We continuously enhance our safeguards and systems that detect and prevent unauthorized access to user accounts."

At the end of the same paragraph. They're already continuously updating security, before they even knew they were hacked. Three years have passed, so for all they know something in those continuous updates covered this hack.

I am taking a WAG here but if they got code then they might be able to take educated guesses at the UDB values without actual access to UDB. Those guesses are more likely to be true with bot registered accounts where there is duplication of information.

This goes back to my theory that a good portion where junk accounts.

Not saying this is acceptable, just saying garbage in garbage out.

You can't guess the XX (anonymized for obvious reasons) key without access to the UDB.

I'm guessing by your handle I know who you are :). Ex-Yahoo super chat moderating guy here, which should let you know me.

Wouldn't the upgrade require the accounts to actually login to migrate password? Last I was at Yahoo there was at least 3B junk accounts in UDB. With out knowing details I am guessing that many of the "compromised" accounts fall into that bucket.

I get that membership can't just trash junk accounts but marketing was very aware of them. Paranoids also can't just say a compromised junk account is not a compromise, they are too paranoid for that.

This unfortunately sounds bad PR wise, with little knowledge of actual impact. On the flip side I'm pretty sure I am not on the radar of the state actor since they would more then likely be looking at their own.

Just to confirm, purple Yahoo! car in YEF spot ;)

As to your question, no, they didn't need to login due to how the hash "upgrade" was done (unlike how Tumblr did it around the same time). I was one of the people in the billion accounts and I definitely have logged in and also changed my password multiple times (also have very high entropy passwords and use TFA).

It wasn't me despite your DR Ycan't photos. :)

Tumblr was indeed what I was thinking about.

What's funny is that there's someone currently working at Yahoo with a name scarily similar to yours and I was pretty sure for a moment that you were some random ycombinator person faking being him.

Although...he IS cool.

bcrypt(md5(password)) allow the existing password hash to be reused.

No. They've stolen the hash, so if they crack it, you've just let them waltz back in.

The correct response is force a password reset, and _delete_ weak hashes so that they cannot be stolen in a subsequent breach. At worst, store a bcrypted md5 password as you suggest, but only as a check for a password the user must not be allowed to use again; it _cannot_ be used to sign them in.

One of the attacks you're preventing is on _other_ sites, where the user has reused the passwords. Keeping around weak hashes even to let that user perform a reset is risking that hash being taken, cracked and used in a breach elsewhere.

When they did the bcrypt(md5(password)) there was no leaks of Yahoo!'s md5'd passwords. That's obviously changed now and thus why the billion passwords were invalidated (I'm one of those folks btw, but I also had TFA on my account and my password had sufficient entropy you won't brute force the md5).

Keeping around weak hashes even to let that user perform a reset is risking that hash being taken, cracked and used in a breach elsewhere.

We're currently working on PCI compliance. In pen testing, we got dinged for not preventing re-use of prior passwords, and that bothers me for exactly this reason (plus the new NIST standards say NOT to force periodic changing).

I believe that our hashes are strong (using scrypt, salt, etc.). But the belief that you're getting it right shouldn't let you be lax in other areas, hence security in depth.

So I really object to the requirement that we keep around those old hashes.

Good point. Thanks for pointing out my mistake.

Is the info about the Y and T cookies in this pdf [1][2] accurate?

[1] (EDIT: now with screenshots) http://imgur.com/a/g61VZ

[2] (Not affiliated with link, but the risk-averse may wish to open in a sandbox) ftp://hackbbs.org/milworm/270

Doing a google search for the link showed me the title of the document which I remember reading in the past. The overall coverage of Y&T cookies is more or less accurate at the time of writing back in like 2010/2011, but there's a bunch of mostly minor technical inaccuracies too. I don't want to comment on much without rereading it, but I remember the description of Sled ID made me laugh (which btw I'd guess less than 1% of current Yahoo employees knows what that is).

Also, the video that goes with the PDF is too funny! Just watched it on YouTube [0] again. Notice how he doesn't actually sign into Web Messenger, just goes to the login page? If he had, it would've failed. Same thing with him closing the browser before Yahoo Mail loaded. "Sensitive" reads and everything that did a write operation always (unless there was a bug) validated the cookie against the UDB. So even if you stole the signing key, without the values from the UDB, you would have very limited ability to do anything other than the trivial things shown in the video.

[0] https://m.youtube.com/watch?v=n2CNp_zmje8

It seems that Yahoo has a problem with moribund accounts- many people had a Yahoo ID 10-20 years ago, and then abandoned it.

If these accounts are not deleted (and there are a bunch of organisational reasons not to), then the MD5 hash has to be kept around somewhere, until the user re-enters a password and a better hash is generated.

> Yahoo doesn't use a centralized session storage. If you know a few values (not disclosing the exact ones) from the UDB, it's theoretically (guess not so theoretical now) possible to create forged cookies if you steal the signing keys. To my knowledge, the keys were supposed to only be on edit/login boxes (but it's been a while so I may be forgetting something), so this is a pretty big breach.

Isn't that highly confidential company information?

> 1) As Yahoo "upgraded" all password storage in UDB (where all login / registration details are stored) to be bcrypt before 2013, I'm curious how this was possible.

You check the plaintext password sent to the backend against the md5, on success you rehash it as bcrypt, insert it in the table.

Web tokens, for example, don't necessarily include just a session ID. Some include the full session details within its payload. This can be quite useful, actually, because it offloads session-lookup onto the client.

How do you invalidate a JWT server-side without the user interacting with the server ?

My preferred method:

Add an "expires" field to the token, this should contain a date after which the token is no longer valid. Now all token s auto-invalidate after a certain period.

Allow some or all tokens to "refresh" by calling a particular endpoint (call with valid token and get a token with expiry from now).

Optionally add some form of identifier to the token (user_id works great) so that you can push a message out to your servers that looks like this: "All tokens for x expiring before y are invalid". Once time y has passed your server can forget about the message. This will be a very small set (often 0) as very few people use the "log out my devices" features.

Logouts should be done client side by deleting the token.

If you are worried about your token being sniffed you are either not using HTTPS, or sticking it somewhere stupid.

> Add an "expires" field to the token, this should contain a date after which the token is no longer valid. Now all token s auto-invalidate after a certain period.

Doesn't JWT already have this - "exp" is a reserved claim for expiration time?


4.1.4. "exp" (Expiration Time) Claim

The "exp" (expiration time) claim identifies the expiration time on or after which the JWT MUST NOT be accepted for processing. The processing of the "exp" claim requires that the current date/time MUST be before the expiration date/time listed in the "exp" claim.

Yes but that is more for standard idle time expiration.. The problem being addressed above is for actively invalidating an existing JWT for a user once they already have it (and before the default/original expiry is met).

> Now all token s auto-invalidate after a certain period.

You need to make sure that there is some process that will refuse to keep on re-upping the cookie lifetime. Otherwise an attacker could indefinitely keep the stolen cookie alive.

If you see a suspicious usage pattern then force a login by invalidating the tokens. Allowing indefinite refreshing is a feature and a drawback of this method.

You CBS Combine a session cookie with a jwt Token That get sent over a Header

Which gives you the worst of both worlds

Tokens have in-built expiry dates (cryptographically signed by the server upon issuance). Once that date has passed the token becomes useless.

If you meant "how can you prematurely invalidate a specific user's JWT without needing a server side lookup", you can't.

I think the best you can do is issue different classes of JWT to a user based on what actions you wish to grant them. This lets you reduce load going to backend lookups to only a subset of JWTs where the ability to invalidate them earlier than planned on a per user basis is necessary/desired.

For JWTs that aren't tied to backend lookups the only solution if one or more users are accessing resources they no longer should be via one of these tokens is to invalidate all of them.

The client can hold onto the token indefinitely, the server doesn't care. But next time a request comes in with that token it will be expired. The server validates the timestamp which is part of the encrypted payload that only the server can decrypt; instant validation and no DB lookup.

This is possible if you support the 'jti' claim[1]. There's a discussion of an implementation of it here[2].

[1]: http://self-issued.info/docs/draft-ietf-oauth-json-web-token... [2]: https://auth0.com/blog/blacklist-json-web-token-api-keys/

Each JWT has an issued at date, so you just need to reject all tokens issued before that time. In addition to invalidating all tokens if there is a breach, each user account can have its own datefield to invalidate all the tokens for that account if a user changes their password or whatever.

I'm not too familiar with JWT, but i have some hands-on experience with Macaroons; the simplest way would be to have a custom caveat of validity set in the token, let's say, a validity GUID, which is an id of server-side record of validity (true/false), e.g. in some database table. Once you set that record of validity to false, the token bearing that GUID automatically becomes invalid.

Otherwise, without server-side changes (such as change of secret key used for signature generation), it is impossible.

With JSON web tokens (JWT), the client or server must know the secret key used to sign the token in order to validate it, but anyone can view its payload.

Could do it if you knew the JWT token text in theory?

MD5 is still not too bad, if properly salted. And if you use multiple rounds of hashing, it can be as slow as Bcrypt. As far as I know, MD5 is still not generally broken, we only found some weaknesses.

To prove me wrong you can try and reverse this one (unsalted , just one round):


Even so, the fact that we have the knowledge to generate collisions in MD5 means you really shouldn't be relying on it when there are better alternatives.

Try and generate a collision with the hash I gave. You can't, as far as I'm aware.

We can only generate collisions of carefully crafted sources, not arbitrary ones.

So MD5 is fine, as long as you follow the standard procedure for storing password hashes:

1) Unique salts + long master salt (to prevent rainbow table lookups).

2) Enough rounds of hashing.

3) Don't allow the most common passwords.

4) Don't allow very short passwords.

I'm not saying MD5 is ideal, I use Bcrypt / Scrypt myself. But it's not MD5's fault Yahoo's engineers are lame.

I'm wondering if this is one of the reason Alex Stamos left...

DO NOT delete your Yahoo account! In their disclaimer when you delete it, they state:

> "[...] we may allow other users to sign up for and use your current Yahoo! ID and profile names after your account has been deleted"

Bummer if you forget that it was the password reset email for your Facebook account, huh? Instead of deleting your account, purge it of all data: https://honeypot.net/purge-your-yahoo-account/

I just deleted all data from my account and set an automatic responder stating that, due to security concerns, I no longer use that account. I created my Y! account in 1998, it's a shame it has come to this. There were a lot of memories I had to purge along with my account (even though I had a different main account in the last decade). Shame!

This is a terrible policy. Do other email providers have a similar policy?

Probably not that terrible if they only do it for accounts that were created and never used. Like all the good GitHub usernames that seem to be abandoned.

GitHub usernames and emails are very different things. You don't get password reminders sent to your github profile, but you can get those via email.

BTW, no, most email providers never allow the reuse of close account names.

Microsoft seems to, although I can't find a specific statement from them confirming it: http://windowsitpro.com/blog/recycled-email-addresses-and-ou...

If someone knows how to delete more than 100 emails at a time, let me know. I have more the 10k emails, 80% of which are probably spam!

... And the answer is, scroll to the very bottom, then delete. I was able to delete over 1000 that way.

The other way is to search before:"2016/12/15", and delete all the search results.

They used to automatically put an email address back into circulation if you failed to log in for 6 months.

"Separately, we previously disclosed that our outside forensic experts were investigating the creation of forged cookies that could allow an intruder to access users’ accounts without a password. Based on the ongoing investigation, we believe an unauthorized third party accessed our proprietary code to learn how to forge cookies."

So that exactly explains how my Yahoo account was used to send spam despite having a password that can't be reasonably brute forced (despite them using MD5). :-/

The forged cookie attack was used on a limited number of accounts, by a state sponsored actor. Going to this amount of effort and then sending spam would be on par with breaking into a bank just to steal the printer paper from the office.

Most likely either: 1) you were phished and didn't realize it 2) logged in to your Yahoo account from a device that had malware on it

> just to steal the printer paper from the office

Or stealing $6,000 with $100,000 gun :)


I'm willing to accept that perhaps that was not how my account was compromised but the time frame when this happened was well in line for when this breach supposedly occurred.

Regardless, it was some sort of automated spam/phishing emails that were sent from Yahoo's network using my account to contacts on my list. I analyzed the headers of multiple bounced messages that were sent to email addresses no longer in use and confirmed the origin of the traffic.

I'm not going to fall for a phishing attack and I only access email from devices I personally control. Could one of them had some sort of malware infection? I guess it is possible but I am security conscious and it is highly unlikely. I also would expect a hacker that has compromised one of my devices would be far more interested in using my banking credentials than using my Yahoo account to send spam.

You reused the password on other websites, I'm guessing. Especially likely if it was a strong (i.e. hard to memorise) password.

The bulk hacking attacks that began around Spring 2010 hit all the big webmail providers. The source of the passwords was always, without fail, reversed hashes from breakins at other big websites:


Source: was a tech lead on the Google anti-hijacking team during this period.

Nope, not password re-use either. I learned that lesson the hard way over a decade ago.

Regardless, it's something that has always continued to eat at me since I can't say for certain how it happened.

Are you sure they actually logged in to your account to send spam (are the spam emails visible in your sent folder), or could it be that someone is just spoofing the SMTP MAIL FROM / email From: header?

As far as I can tell it wasn't someone spoofing my email address. Emails were sent to people on my contact list and the numerous bounce messages to contacts that no longer had valid email addresses confirmed the origin of the traffic.

It's possible that a contact of yours was compromised, and that contact had many contacts in common with you. And then they spoofed your address.

That's a good theory but in my case the sets of common contacts would be almost nil for that account.

I had the same issue, I could see the email sent from sent folder. This happened about year ago and I was very surprised.

Given Yahoo's security policies, whose to say someone wasn't just sending it from Yahoo's SMTP servers without any access to user's email accounts?

What do you mean by a password that can't be reasonably brute forced?

EDIT: To clarify, I mean specifically with md5. I'm by no means an expert, just curious because I had considered md5 so broken that this comment caught my attention.

Rumours of MD5's death have been greatly exaggerated.

MD5's weakness is that it's (relatively) easy to produce two strings which have the same hash. However, given an MD5 hash, it's not easy to produce a string which also has that hash.

In principle, one could intentionally construct two passwords which have the same hash. It's hard to see how that could be exploited maliciously - any attacker knows both passwords to begin with. Even then, making colliding strings that would make acceptable passwords hasn't been done yet, AFAIK: the shortest colliding strings found so far are 64 bytes long and contain several unprintable characters.

OTOH, computers are fast enough now that brute-forcing MD5 is practical for short strings with a limited set of characters, which is what passwords tend to be. One should use algorithms like PBKDF2, scrypt, and bcrypt which can increase their complexity as the computation capacity of potential attackers increases. This isn't because of a particular weakness in MD5, though, and one should equally avoid storing passwords as SHA-512 hashes, say.

The thing you definitely shouldn't use MD5 for is digitally signing a file you didn't make, because it's possible that whoever did make it also made another file with the same MD5 hash, for which your signature would also be valid.

On a side note: You can use such crafted strings as a black box testing tool to verify if a site does infact use md5 or other weak algorithms to store the passwords. This can perhaps be used in conjunction with other factors to craft an attack.

As a corrollary this can also be used as a testing tool by anyone for any third party site to determine known vulenrablities in their password storage

Definitely check this episode of 'Hacked' out for a simple overview. I just started listening to this show. It's a shame there are so few episodes.


A preimage attack for MD5 has complexity of about 2^123. So, even if you get the MD5 hash for a password, it will be exceedingly hard to find a password that has the same hash (assuming the original password is long and random).

I don't think that's true.

This site from 2006 claims they could find collisions in an average of 45 minutes on a 1.6 Ghz Pentium 4: http://www.bishopfox.com/resources/tools/other-free-tools/md...

If you account for speed increases over the last 10 years and assume the password thief has access to a botnet, then it wouldn't surprise me if they've found collisions for the entire list.

Edit: Nevermind, the link finds two strings that hash to the same thing; it does not find a string that hashes to an existing hash.

The collision generator behind that link does not implement a preimage attack (given a string X, come up with another string Y with the same MD5 hash).

Instead, it implements the much easier collision attack (come up with two strings that have the same MD5 hash).

I thought the whole point of the MD5 vulnerability was that the limit was 2^128 and as such there are more inputs that possible output hashes, meaning more possible input collisions.

All hash functions have collisions. The point is that a good cryptographic hash function makes it very hard to find collisions.

The “preimage attack” on a cryptographic hash function tries to find a message that has a specific hash value. That is, you lock down a hash value (the MD5 hash for a password) and try to find a message that hashes to that value (the original password, or any other input that happens to have the same hash).

The best known preimage attack against MD5 has complexity 2^123. It's better than brute forcing, but still unpractical. Thus, if I come up with a good password that is long and random, you will have a very hard time coming up with a string that has the same MD5 hash value.

The practical attacks against MD5 are collision attacks. A collision attack tries to find two messages with the same hash value. With MD5 in particular, there's a chosen prefix collision attack, where you choose two messages and append to them so that the hashes will match. This was particularly devastating with X.509 signatures and certificates, where the attacker could have the MD5 hash signed by a certificate authority, and then use the same signature with their other message that has the same MD5 hash.

What about Rainbow Tables? (https://en.wikipedia.org/wiki/Rainbow_table#Precomputed_hash...)

Instead of computing the MD5 of a huge number of passwords looking for a match, you simply store the precomputed password and hash pairs in a database table.

A rainbow table is just a precomputed table of hashes for a lot of passwords. Some tricks are used to make the table smaller, but you can think of it as just a lookup table. Only the passwords that were precomputed and put into the table will be found.

Rainbow tables are usually computed for short passwords (1-10 characters) and limited character set (say, alphanumerics). They are good for finding the bad passwords if you get your hands on a set of MD5 hashed passwords. But they are of no help if you need to reverse a good, long, random password.

Every hash has a finite output length, and therefore a finite number of possible outputs. 2^128 is a very large finite number. It's not that large in the grand scheme of things (there are over 2^260 or so atoms in the universe), and it's definitely better to use a hash with 2^256 outputs now that there exist good 256-bit hashes that are faster than MD5, but 2^128 is still quite a large number. The internets are quoting me about 10 billion hashes per second on a good GPU from a few years ago, which comes out to about one sextillion years to find an input for every possible output. (It divides linearly if you have more GPUs, but that clearly won't help very much.)

What's broken about MD5 is that, due to an algorithmic flaw, it's very easy to generate two inputs of your choice that have a matching output. That's great if you want to do things like spoof an SSL certificate (you generate two certificate signing requests, get one of them signed, apply the signature to the other), but not directly helpful for attacking a password hash where someone else chose the password.

What is conceptually broken is that such an algorithmic flaw exists, and also due to algorithmic flaws it takes a bit under 2^128 tries to find an input for a specific possible output. That worries mathematicians, because it's a sign the hash isn't behaving as randomly (speaking informally) as one would hope, and that people are starting to understand its structure. If that understanding continues, it might be broken more in the future, so you absolutely shouldn't build new systems on MD5 because we expect the research to happen at some point.

But, at least today, it's still true that you can have a password that can't be brute-forced despite the use of MD5. Maybe someone will present a paper tomorrow that disproves that.

This is a very clear explanation, thanks!

All hashing algorithms that I am aware of have more inputs than outputs. By the pigeon hole principle, there will always be collisions. MD5 is weak, but it still isn't trivial to find an input that hashes to the same thing as a high entropy password.

> that hashes to the same thing as a high entropy password.

To be clear, it's not the entropy of the original password that matters, except for the fact that all common low-entropy passwords already have their MD5s stored in public databases. (What hashes to 5f4dcc3b5aa765d61d8327deb882cf99? You can look it up with Google.)

You can come up with two plaintexts that hash to the same thing in MD5. You can't come up with something that hashes to a new MD5 value given to you, aside from finding it in one of those databases.

If it's a password so long and complex it wouldn't be in any rainbow table computable in reasonable time. While MD5 can be computed quickly, there is still a limit to how many you can compute -- and there are an infinite number of possible passwords if they aren't length limited.

Interestingly even if the password has infinite length, an MD5 hash has a fixed finite length. You can think of it as a glorified modulus operator, beyond some point the longer passwords will have hashes that match shorter ones.

True -- but assuming these passwords aren't stored the same (very, very wrong) way on another site, and they're no longer useful on Yahoo, what's important is finding the real password, not just a password that happens to match the given hash.

Rainbow tables are attacks against secure algorithms.

MD5 is recognised as an insecure algorithm: given a known hash, there are multiple possible passwords that would resolve to the same hash, therefore appearing to be the correct password.

With MD5, it's not necessary to compute an infinite number of possible passwords, and it is possible that, given a particular hash, a collision can be found within a reasonable time.

Either a) you don't have a clue about the complexity involved in finding a collision for a specific hash or b) your definition of "reasonable time" is longer than the age of the universe and/or using 100 trillion state of the art GPUs is realistic.

I'm leaning towards option a, you read a blog post once and think you're an expert on cryptography now.

  > the complexity involved in finding a collision for a specific hash
If it can be shown that a preimage collision can be computed in less time than an exhaustive search, the algorithm is generally regarded as having a weakness, even if the given "less time" is still a very very long time.

The theoretical complexity of MD5 is 2^128, but a preimage attack was discovered in 2009 which showed that a collision can be found in 2^123.4. [1]

Collision attacks against MD5 have become more practical, there are even frameworks for it [2]. The complexity of 2^123.4 still makes a preimage attack against MD5 computionally unfeasible, but given that it's been shown to be weaker than its theorerical 2^128, it's possible that MD5 has other weaknesses which would allow the complexity to be reduced to a level that is computationally feasible.

[1] https://www.iacr.org/archive/eurocrypt2009/54790136/54790136...

[2] https://marc-stevens.nl/p/hashclash/

To be fair, pretty much every MD5 discussion I've ever seen or been involved in (including with "security expert" former coworkers) has had someone making the same claim.

What you're describing is the same for every having algorithm in existence. All hashes can represent multiple (indeed, infinite) passwords. So they all have collisions. This is because all hashes are fixed-length, and so finite, while the possible inputs are infinite.

This isn't the reason that MD5 is weaker than other algorithms.

You are describing a first preimage attack. There have not been any computable first (or second) preimage attacks on md5.


There are collision attacks, but that is not relevant for password cracking.

From 2009: a preimage attack reduced the complexity from 2^128 to 2^123.4 [1].

It's still a big number, but it's less than the theoretical complexity.

[1] https://www.iacr.org/archive/eurocrypt2009/54790136/54790136...

What I meant by "computable" is something that can be computed with today's hardware.

Pretty much even if you choose a high entropy password like say:

the MD5 algorithm can be broken using various techniques like collisions, unsalted I believe means that their database would accept the hashes the third party has. End result is they should have migrated away from MD5 after it was declared unsafe.

No it can't.

Two principles here:

1. If your password is very very good (a Diceware password would suffice), then any method of storing passwords that is better than storing them in plaintext will stop someone from brute forcing it.

2. If your password is very bad, then even an excellent password hashing algorithm will not save you.

"Just use bcrypt" is meant to save people who are in the middle.

No, a collision attack would not give you the plaintext from a hash. A first preimage attack would do that, but no computable first (or second) preimage attacks against md5 have been found.


Nope, that doesn't explain it. Without Yahoo! UDB access to get a couple values unique to your login, you can't forge a cookie that allows you access to Yahoo! Mail.

Related: former Yahoo security engineer talks about a backdoor Yahoo installed for the NSA to read private emails...behind their security teams' backs...


In case you are looking for the important information, it seems to be MD5 hash without salt.

Bloody hell. Sloppy and incompetent.

I'm genuinely curious how the decision to use MD5 gets made. Who says, "hey, maybe we should use MD5." And then who responds, "that sounds like a great idea Bob." Seriously. I've known for years that MD5 is insufficient for hashing passwords and I'm just some random guy. This kind of thing really baffles me.

Yahoo has been a company for a long time. I imagine your conversation happened round about 1999 when using MD5 wasn't insane. And then they were just slow to upgrade.

It's still bad, I'm just saying the conversation about what hash algo to use didn't happen yesterday.

I'd like to believe that. However, I was recently asked to test a new website for an organization I volunteer for, and discovered their "forgot password" flow emailed me my plaintext password. I wrote an explanation of why this was bad, and how it could be fixed, to a non-technical friend of mine who works there; he passed my email to the (Bay Area based!) consulting shop that did their website. The shop sent this response:

"We do not store passwords as a plain text in database. We have functionality which encrypts and decrypts passwords. We have only ecnrypted passwords in the database.

Almost all other servers use one-way encryption. In this case, passwords cannot be decrypted from hashing."

Again, this is a Bay Area based shop. For code written in 2016.

I was shocked to receive this, but it (among other things) leads me to suspect that there are lot of people out there, in positions of power, who aren't just ignorant, but who actively cling to password-storage anti-patterns.

I'm at a loss for how to fix this.

Just for clarity, the "forgot password" flow emailed you the current password of the account (not a temporarily one)?

That's insane...

Yes, the current password.

submit the website to http://plaintextoffenders.com/

Ironically, hosted on a Y! site.

But it's not like if we didn't have a pretty much continuous stream of major data leaks for the past 5 years. Surely yahoo engineers occasionally open a newspaper...

From everything I've read, the engineers did. The problem was that the security team had to go head-to-head with the budget team. And unfortunately, the budget team won - since the upper levels didn't feel that the IT security salaries were a necessary expenditure. And beyond that, there was concern that making people actually change their passwords regularly and requiring anything like security in said passwords was going to discourage users from using Yahoo and send them over to GMail.

Unfortunately... that argument wasn't wrong.

> The problem was that the security team had to go head-to-head with the budget team. //

Wouldn't engineers at such a big corp whistle-blow such incompetent decision making?

Apparently [1] they had a $1.37B net income in 2013. Given using bcrypt with a Blowfish hash and salting was pretty much a de facto standard by that point (I think that's what Wordpress were doing, hardly revolutionary security work) it seems the relative cost for Yahoo was approximately zero.

All I can imagine is that those in control were asked to leave the system open for government snooping? Why else would engineers working there not [anonymously] bring this to press attention - "hey, Yahoo security amounts to a piece of sticky tape holding a bank-vault shut".

- - -

[1] http://www.marketwatch.com/investing/stock/yhoo/financials#

It's not that hard to implement something at the start. It's more work to retrofit it on top of an existing system in a way that doesn't reduce the total security.

But would it require users to change their password?

The way I would have implemented it, but would be keen to know how secure it is, is that you start with the md5 of the password (md5(password)). You then bcrypt or scrypt that md5 (bcrypt(md5(password))) and replace the md5 in your database with the bcrypt hash.

When a user logs in, all you need to do is to calculate the md5 first then check that md5 against the bcrypt hash you have stored.

I am not a crypto expert but intuitively it doesn't look like I would have weakened the security that way. You can't really attack bcrypt(md5(password)) much more than bcrypt(password). Can you?

The method I've used is to add the column for the new stronghash then you update the old column to stronghash(<oldhash>), where <oldhash> is dumbhash(password) check against that on login stronghash(dumbhash(password)) and generate just stronghash(<password>) while you have the plaintext password in memory and update the row to add the new hash (simple and interoperable, not dependent on dumbhash) and drop the stronghash(<oldhash>). After a <longtime> limit (to optimize both maintenance overhead of the additional column / behavior and limit exposure to only minority users that haven't logged in for <longtime>), you drop the stronghash(<oldhash>) from everyone and do a "we sent you a reset email" for anyone that's trying to log in but has no <stronghash> password hash.

This is fine workflow, but keep in mind

> and do a "we sent you a reset email" for anyone that's trying to log in but has no <stronghash> password hash.

Yahoo is an email provider so many of these users won't have an external provider to refer to.

This workflow is much better than the other proposals I've read up-thread.

It's one way to do it, which is okay sometimes.

The other way is to add a new empty column for bcrypt. The next time the user logs in, you save the bcrypt hash and you remove the MD5 hash.

Over time, the active users will be migrated to the new scheme. The only issue is the abandoned accounts, they'll keep the old weak scheme.

There are other migration techniques. If you know md5(password), you can create bcrypt(md5(password)).

That's what I do, though care should be taken that you can't then login against the old passwords by putting md5(password) in the password field.

Usually you do this by decorating the bcrypt(md5(p)) entries in some way so you can recognize which ones are tested with bcrypt() vs bcrypt(md5()).

I am not sure I agree. Your way will leave all the non active users exposed in the case of a leak. They may not be active on your website but are likely active on another website using the same password.

As I said, that's an option among others, it has drawbacks.

For a website like Yahoo with billions of abandoned accounts, that's a serious drawback ^^

The problem is in collisions. Md5(password) can yield the same result for many different values of password so simply bcrypting that result means that you start with a restricted possibility space. So less secure. Punts the question to how much less secure. Seems to me it would still be worth it to do and then all new passwords going forward are done correctly.

Agree, but a collision even for md5 is a relatively rare event. When brute-forcing the bcrypt hash, this would reduce the attempts you would need to try against a given hash, but only by a very small factor. With a reasonable work factor, I would assume it would still make a brute force attack impractical at scale.

I didn't do the test, but I'd expect that there wouldn't be more than a handful of collisions for the md5 of the 100m most common passwords.

[edit] I actually I just did the test on this 10m password list and no collision


I've done it before on a 1 billion word / password list and didn't get any collisions.

That being said md5 does generate collisions. I was playing with the IMDB movie database that you can download. They use a combination of the title and the year as a primary key. I tried using an md5 instead to save space (but giving a reproducible ID instead if an identity column), and got many collisions. No collision with SHA256.

Wait, what? No MD5 collisions at all were publicly known until Xiaoyun Wang disclosed one in 2004 using a new cryptographic technique she invented (explained in Wang and Yu's "How to Break MD5 and Other Hash Functions").

MD5 has a 128-bit output so collisions that occur by chance should require about 2⁶⁴ inputs (18 exa-inputs). Surely your database didn't contain over 2⁶⁴ different movie records.

Could you take a look at what you were doing again? Your description doesn't really make sense mathematically.

You must be right. I can't reproduce it. I must have fucked something up then.

You likely goofed something up. No one has demonstrated two strings that are conceivably used as passwords that users type in -- and that includes the tuple {movie title:year} -- that have MD5 collisions.

The security problem with MD5 isn't collisions.

I think you are right, I can't reproduce it.

What you're describing is not possible given the database you tested. Are there more details that would clarify your post?

Oh, of course md5 has collisions. It's relatively easy (not computationally easy, but there are known methods) to find two random strings that hash to the same value, it's just very difficult to find a string that hashes to the value of a specific other string.

Not "relatively easy" by chance: it should require 2⁶⁴ entries in your database to see a single collision happen at random! It's only "relatively easy" following cryptographic research in the early 2000s that exploits structure in MD5 to produce collisions deliberately.

Yes, collisions are easier than preimages, but they still shouldn't occur by chance in real applications!

Realized my wording was way to ambiguous, clarified. Thanks!

Very nice. Thanks for that. So yes, this is likely the thing to do in this situation.

Unfortunately, this isn't an accurate description of the nature of the collision problem with MD5, which involves carefully crafted inputs using a sophisticated cryptographic attack -- not arbitrary user inputs that don't intend to collide with each other. See my and danielweber's comments about this down-thread.

(Yes, susceptibility to collisions was recognized as a problem with MD5 leading to a reason not to use it, but the collisions in question were constructed, not encountered accidentally. There isn't any evidence to date that the probability of a collision given two randomly chosen inputs is higher than the expected 1/2¹²⁸. You could test this yourself by hashing 2⁴⁰ random strings under MD5: you won't see a collision among the outputs!)

>Md5(password) can yield the same result for many different values of password //

Not "many different" using the normal constraints of text/numbers/typographical-marks and with maximum password lengths of 32 or so (I'll bet Yahoo's was shorter than that in 2013).

Are there any MD5 collisions in [:graph:]{,32} ?

I really doubt it. When people demonstrate MD5 collisions, they use a hex strings like

0e306561559aa787d00bc6f70bbdfe3404cf03659e70 4f8534c00ffb659c4c8740cc942feb2da115a3f4155c bb8607497386656d7d1f34a42059d78f5a8dd1ef

Yes, because MD5 digests are much shorter than 32 characters, even if it's just ascii, so by the pidgeonhole principle there must be. If you're asking if there are _known_ collisions between two messages with less than 32 printable ascii characters -- the answer is likely yes, but there are not known to me and likely not publicly known at all yet.

I thought md5 were 32 characters. But you're right every md5 hash would be in that space, so there must be collisions.

bcrypt(md5(password)) is what Yahoo! did when they switched.

Especially about it being a bad idea to make people regularly change their passwords!

And nobody ever seemed to say "hey, maybe we should be using something more secure". Yahoo's been around for how many decades, and the fact they were still using MD5 in 2013 is just shameful. Yeah if it was some legacy code from 1993 you can probably excuse it, but I just can't believe after 20 years nobody thought it was a problem.

I'm not really a software developer but I really can't imagine it being a huge change. Instead of md5(pass) you could probably just change that to secure_hash(md5(pass), salt), add another column in the database for the salt, and rehash all the passwords. Customers wouldn't notice. Rehashing the databases would take a while, but otherwise that's really not a huge amount of work.

Well, you can only rehash if you have the plaintext password. So you have to wait until they login again, or force a password reset for everyone. In the former case you're stuck with a bunch of md5 passwords hanging around for any account that's not very active, and for the latter you'll lose some percentage of active accounts whose reset process is for some reason no longer functional. You could mix-and-match the two methods (start with the former, force the latter on any stragglers after, say, a few weeks) to minimize the damage, but that's more work and a number that someone somewhere in the organization finds very important is still probably gonna go down.

(I've never had to do this myself, so these are just the most obvious options I came up with. Possibly there are others.)

  You can only rehash if you have the plaintext password
There are techniques to rehash, even without the plain-text password, and without the user having to login to trigger a rehash.

Drupal 7 used such a technique for upgrades from Drupal 6, migrating from MD5 to a salted sha512 hash, but it's not an uncommon technique.

The old passwords are stored as MD5 hashes in the databases. The MD5 hash is processed through the same techniques as new passwords: a salt and the new sha512 hash. Provide a way to identify whether the origin was a password, or an MD5 hash.

Either way, you end up with a hash. You can identify whether the origin was a password, or an MD5 hash, but you can neither determine the origin MD5 hash, nor the origin password, as the new hash is secure. So even if the original MD5 hash was insecure, the new hash is secure.

When someone attempts to login, you still need to determine which password-validation to use: hash = sha512(salt + password), or hash = sha512(salt + MD5(password)), but the security level is the same.

> hash = sha512(salt + MD5(password))

Passing the password through MD5 reduces the complexity to 128 bits, you can't get that back.

So the security level is not the same, though it may be resistant to some attacks on MD5.

And it's probably not important for most people, since there are less than 2^56 eight character ASCII passwords.

  > "Passing the password through MD5 reduces the complexity to 128 bits, you can't get that back."
Assuming that the new hash is secure (and sha512 is generally agreed to be secure), then, given a specific sha512 hash, the original MD5 hash can only be determined via rainbow tables, which is a Big-O operation. Even though entropy is reduced, it's still a significant work to determine the original MD5 hash (significant in this instance being longer than the heat-death of the Sun, given current extrapolations of computing performance).

Attacks against MD5 are based around knowing the original MD5 hash. In this instance, the original MD5 hash is unknown, so there is no mathematical shortcut to finding a collision.

In this case an attacker isn't looking for a collision (which would mean creating two passwords with the same hash, and what hash that is doesn't matter).

The attacker needs a password with a specific hash, and the best reported attack for that is around 2^128.

Agreed, that the best reported rainbow-table attack on MD5 is 2^128 (i.e. the complete range of possible MD5 hashes).

Personally, I'm willing to chance that my password will be discovered via a brute-force attack within the next 0.65 billion billion years [1]

[1] http://bitcoin.stackexchange.com/questions/2847/how-long-wou...

I think it does make sense to be cautious.

A new preimage attack could be discovered - or might already have been, secretly.

> Passing the password through MD5 reduces the complexity to 128 bits

No, this is not the problem with MD5. You are not going to find two user-memorizeable-and-typeable passwords with an MD5 collision.

If you are bringing a password with more than 128 bits of complexity to the party, any password storage scheme better than plaintext will have your password safe.

For passwords, there is no known problem with MD5, unless you know about a preimage attack.

Collisions are a problem for digital signatures, not for passwords.

But some people do want and use more than 2^128 bit passwords, for whatever reason, and an MD5 intermediate stage limits that.

I was doing all kinds of mental gymnastics trying to figure out how this would work; thanks for explaining it so clearly.

I have been in this situation, and you're correct.

Somewhere in the organization, a product team is going to throw a fit about usability and churn over the decision to reset user passwords en masse, or to force users to change them when they first log in. This isn't a slight against product managers, but one of the clearest indications of a company's overall security culture "health" is how the security, engineering and product teams choose to compromise and "pick their battles." Risk accepting vulnerabilities has a legitimate place when you have to balance product development and usability, but so does pushing back on egregious issues.

I don't have privileged insight into Yahoo's organization, but in this case it's pretty clear the security team should have either been more diligent in conveying the ramifications or less kneecapped by the surrounding org units, depending on the circumstance. More importantly, Yahoo should have "migrated" their passwords in the manner a parallel comment explains in this thread. This is what Facebook and other companies did after maturing their security programs (see "Facebook Onion" on how Facebook transitioned away from MD5).

Also good to note - there is evidence Yahoo's security culture improved over the years. The decision to go with MD5 almost certainly happened in the 90s, and when Tumblr suffered a breach all users were forced to reset their passwords. The capability and awareness was clearly there.

x0's algorithm was secure_hash(md5(pass), salt), you already have md5(pass) so this can be done in one bulk update.

Does an insecure algorithm mean that you effectively have the plain text passwords?

Not necessarily, because of collisions.

The password "foo" may encrypt to the hash "12345". If an attacker were to discover that the hash is "12345", they would look for a password that hashes to "12345", which could, hypothetically, be the password "bar". They don't know the original password "foo", they've simply discovered an alternative, which happens to match the algorithm enough to unlock access.

In general, rainbow tables are used for identifying and attacking common passwords, but that doesn't mean that the algorithm is insecure.

Insecure algorithms can be attacked through collisions, which don't necessarily give you the original password, they just provide an alternative password which is accepted by the algorithm. The distinction matters when it comes to password reuse, because if Site A uses MD5, but Site B uses sha512, finding a collision that grants access on Site A doesn't necessarily give you a password that will grant access on Site B.

Having worked with monolithic legacy codebases that they likely have, it has gone through hundreds of developers who dont work for the company anymore that created a bunch of spaghetti code means its a huge effort required to make sure that none of their other services break when they implement such changes. Also, management HATES when dev teams do this because it isn't "new stuff" thats immediately visible to their bosses nor the end user.

If anything goes wrong with the password update, users get angry, lose faith in the services, stress, a few people get fired maybe, etc etc. On the other hand, letting it stay old and crappy just everything stays just peachy, and nobody is the wiser that the entire system is a house of cards. Until the day someone hacks the database of course... which happened so its "now" a problem.

They're not going to begin to take security seriously even after this incident. They'll do what they need to right now but there's no auditing and their users don't normally care about this sort of thing, therefore the management won't care either.

There are likely to be a lot of identity systems using the password in the database, all of which have been coded to look for an MD5 hash, not a salted hash. This means code in a number of applications have to be updated at the same time.

The typical way around this is to create your new destination column (e.g. sha256 with salt), and progressively have applications reference this column rather than the MD5 unsalted column.

It's a huge amount of work, and if the applications were made in 1990's, the code is likely legacy. If Yahoo are doing regular code security reviews, this will likely have been put in the pile of "we need to fix, but it's too costly to do".

> It's a huge amount of work, and if the applications were made in 1990's, the code is likely legacy.

Which begs the question, can legacy code survive in an international network?

That's the right question to ask. The answer is no, because new security vulnerabilities are disclosed every hour.

A large organisation will implement layered security (otherwise known as layers of the onion) to prevent this type of attack. This means; more secure passwords to access the password database, fewer people with access, rotation of access passwords, auditing of backup storage and encryption, etc etc. Clearly Yahoo's layers of security were all broken to allow this type of theft.

>It's a huge amount of work //

Really? Moving from doing md5(password) to bcrypt(password,salt)? I see organisations make things hard and legacy code-base, yadda, yadda but surely if Yahoo couldn't do this then they couldn't manage scratching their own butt; it really seems like quite a small change in the scheme of things. Like one senior engineer, one afternoon of work (then testing, etc., OK, sure) ... ?

"It Takes 6 Days to Change 1 Line of Code" https://news.ycombinator.com/item?id=13119138

I'm going to go out on a limb and guess you've never worked as a software engineer in a large organisation.

Given MD5 hashes are currently stored, how do you propose user's password get converted to SHA256/512? Should Yahoo brute force the passwords, and then store them in the new algorithm? Or should they wait for the user to log on, verify their password, and store it in the new hash algorithm (given some users rarely log on, this could take over 12 months to complete 80% of users).

Yes it could take months or years to complete the process, but they've had at least a decade.

Even if it never completes (abandoned accounts), it would still have saved most active accounts from being breached.

100% agree. Yahoo should have started the process a long time ago.

I was just replying to the comment it could be completed in an afternoon.

You're right on the first count. It wasn't sarcasm, it was a question.

On the storing of hashes though the standard protocol has been to pass the hash in as if it were a password.

Hashing the hash isn't a good idea, you're reducing the domain of your secure_hash function to the range of md5. The way to do it is to have a "password hash algo version" column and when the user puts in their password, you verify against the hash[algo](password) and rehash with the later version, changing the algo column for that user.

You could do both though. Give much more security in the short term and upgrade anyone else who logged in later.

I did ask about the hash of hash thing some time ago and ptacek claimed that's a reasonable thing to do.

> you're reducing the domain of your secure_hash function to the range of md5.

Oh no, only 128 bits. The NSA will be able to brute force one of those passwords in 80 years.

You need to do both. If you only do the latter, then stale accounts which never log in again will never have their passwords upgraded to the more secure hash. Hashing the hash allows you to replace the md5 hashes immediately, and then you can perform the upgrade if/when the user logs in again.

>I'm not really a software developer but...

If I had a nickle for every time I've heard this statement then I'd have enough to comfortably retire.

Yes, in theory, changing a column in a database (which in this case, happens to be a password) seems simple, but in practice, it's not.

You're assuming engineering is just sitting on their thumbs, reviewing their code once a week, thinking of ways to optimize it.

In reality, they're constantly under pressure to develop new features, fix reported bugs, move on to the next project, keep the site from falling over, etc etc.

And the ones who choose NOT to work hard aren't sitting around reviewing old code either.

For an IdP at the scale of Yahoo, the can adopt something as complicated as supporting versioned passwords and migrating credentials to the latest secure algorithm upon successful login. You have the clear text password at that point. You can store metadata such as the version (or algorithms) used to hash the credential.


It's easy as hell. Even PHP, so often flamed for "bad security" these days supports EASY functions for this (and polyfills are available, if you're running PHP < 5.5, which you should't do anyway):

- password_hash, which creates a salted hash (the returned value consists of a type/strength spec, the hash, and the salt)

- password_verify, which verifies a password with a hash in a timing-safe manner

- password_needs_rehash, which tells you if you should update the hash in the database

password_hash and password_needs_rehash take a parameter for the hash function (currently only bcrypt is supported, quite likely to keep people from using md5/sha1), and for the cost (the amount of hash function calls).

I believe any reasonable programming language these days has such functions.

What I am NOT so sure about is how the various LDAP server implementations, which many people use for SSO and "normal" account management (because it's easier to connect a new software to LDAP than to migrate existing user db's into LDAP), handle password storage. I mean, having an LDAP server for the credentials prevents any form of password leakage, but in case someone breaches both servers/the LDAP daemon is running on the same host as the webserver?

Nothing is "easy as hell" at scale.

Normally you'd =not= store the salt separately; the usual way is keeping the salt and the password together in the same 'blob'

Rehashing can be safely implemented as long as the auth. process can handle both md5 and some composite hash [i.e. shash(md5(pwd))]

It's really a trivial operation.

I doubt that decision was made in the last decade. It's surely just something that's been around for a long time and was never upgraded.

Still neglectful, but I sincerely doubt it was just a recent engineer's bad decision-making.

It gets/got made ~10-15 years ago. (I don't understand the "no salt" thing, though. That was common practice even ~20 years ago on Linux machines, so I'm mildly surprised that it wasn't implemented in this case.)

> I'm genuinely curious how the decision to use MD5 gets made.

You assume a formal decision was made? I think a manager just went "make them secure" and history was made. That's how it usually seems to happen if it's not a user-facing thing.

I think the organization as a whole is just indifferent. Does this breach really matter to Yahoo's bottom line? They were already sold to Verizon. Most of the active users probably won't read this news. It's sad to say, but I think Yahoo as a whole just doesn't care about their users.


No, sorry. They're borderline criminally negligent. When you have 1bn passwords stored in raw md5, a decade after the first rainbow tables were published, then you don't deserve anyone's business or your freedom.

Sure, it's borderline negligent.

But it's already a godsend compared to what many banks do, storing passwords in plaintext, sending reset passwords via plaintext email, requiring 4-8 character passwords that can only contain digits and a limited set of characters, etc.

I'd be more than happy if any bank would follow Yahoo!'s password standards.

Most banks don't have a billion customers. (There are probably a few that do, but not many.)

It's really not. Unsalted MD5 has been shameful for a long, long time.

As a data point: when I was a teenage code monkey in 2004 writing PHP I already understood that unsalted MD5 is unsafe.

According to Wikipedia:

* 2004 it became possible to find MD5 collisions at a rate of one per hour on a cluster

* 2005 it became possible to do this within "a few hours" on a consumer laptop

* 2006 it became possible to do this within one minute

* nowadays it's possible to do this "within seconds"

Plus, as others have mentioned, it's now possible to find collisions instantly by using widely available rainbow tables, e.g. https://md5db.net/decrypt

MD5 collisions are probably not important for passwords.

To put it in layman terms.

The MD5 collisions attack usually done by researchers: They want to generate 2 files with the same MD5 hash (they can put anything they want in these files).

This kind of attack doesn't affect passwords. The user picked one file (i.e. the password), you don't know it, you can't change it, you can't choose it.

Care to explain? The hashes are what is compared so it seems it's important.

The existence of crafted collisions -- being able to create a pair of M1 and M2 such that MD5(M1) = MD5(M2) -- is primarily relevant to situations where MD5 is being used as a signature algorithm, such as in certificate issuance. In these applications, being able to generate a pair of documents with the same hash is catastrophic.

Being able to generate a pair of passwords that are treated as equal, on the other hand, is useless from a security perspective. It's a neat party trick, but it's not dangerous.

Now, if there were a preimage attack -- being able to take MD5(M1) and come up with a M2 such that MD5(M2) = MD5(M1) -- that'd be a much bigger deal, and it'd break MD5 password hashing wide open. But nobody's done that yet.

I'm a total greenhorn when it comes to cryptography, but the difference between these two situations was totally lost on me until I read this comment. When I see, "It's easy to create MD5 collisions," my first thought is, "If you give me a hash, it's easy to find a string that results in an identical hash." If I'm understanding this right, that would be a "preimage attack," and would be bad for all the reasons being discussed in this thread.

However, it seems like "It's easy to create MD5 collisions," at least as it is true today, actually means something different: That, given a string, it's easy to find a second string that shares the same hash. If that's the case, I have two questions:

* I am totally lost as to how these are different scenarios. There's no difference I can see between "Here's string A" and "here's the hash of string A," if the goal is to find a "string B" that shares the hash. Are these "crafted collisions" generated by modifying string A and string B, until a collision pops out?

* If that's the case... what's everyone freaking out about? Why were people saying MD5 is unsafe 20 years ago, if even now, we can't achieve a preimage attack that can get you into an account based on the valid password's hash? Yahoo could have printed these hashes out and hung them up on posters in the mall and no one would have been able to get into accounts from it. There are dozens of comments lamenting how stupid this was, but... it seems like there's no actual problem?

> However, it seems like "It's easy to create MD5 collisions," at least as it is true today, actually means something different: That, given a string, it's easy to find a second string that shares the same hash.

Very early MD5 collision attacks were even weaker, actually: given nothing, it was possible to find a pair of arbitrary garbage strings which had the same hash as each other. It wasn't until later that it became possible to pick what the strings would "look like".

> Are these "crafted collisions" generated by modifying string A and string B, until a collision pops out?

Generally speaking, yes.

> If that's the case... what's everyone freaking out about?

The issue with using MD5 as a password hash function actually has nothing to do with collisions. That's a red herring. :) The real problem is that using any fast and/or unsalted hash function for passwords is unsafe!

A fast hash function is unsafe because it makes it easy to generate a bunch of potential passwords, calculate their hashes, and look for a match.

An unsalted hash function is unsafe because it makes it possible to build a "rainbow table" of all possible passwords and their hashes, and look up password hashes in that table.

As used in this situation, MD5 is both fast and unsalted.

Most people here don't seem to understand the difference between collision and preimage attacks. So they're overreacting to the fact Yahoo used MD5.

Storing unsalted passwords, however, would be a huge mistake, if Yahoo did so as someone here claimed.

There are precomputed lookup tables for the unsalted hashes of many, many passwords (both MD5 and more secure hashes) and cracking unsalted passwords is simply a database lookup.

Ah ha! There's the weakness I was missing, thank you so much for responding. I hadn't even thought of it that way---I knew salts shook up the resulting hashes, but an actual benefit of it is that it makes it pretty much impossible to do any "homework" (rainbow tables) ahead of time.

Google(MD5(M1)) = MD5(M2) is more than enough for most users.

That website does not find collisions. It uses rainbow tables (or some other type of table) to crack passwords that it already knows.

Collisions are irrelevant for password cracking.

> Sure, SHA1, scrypt or bcrypt with salt were already common back then, but it's an entirely different story than if they had used it today.

Not an excuse, this is Yahoo, not a PHP shop in India doing some low budget contracting.They should have a top of the line security team enforcing the most recent secure practices. Furthermore I got no email from Yahoo telling me that my account may have been hacked. Both incompetent and irresponsible at the same time.

By the way I did some PHP dev back in 2011. bcrypt hashing was already common practice. How can you come up with that argument in good faith ?

> Furthermore I got no email from Yahoo telling me that my account may have been hacked

Then your account was most likely not on the list of accounts compromised.

> By the way I did some PHP dev back in 2011

Well Yahoo is a tad bit older then that, by about 17 years. This is not an excuse, but really comparing your 2011 coding to 1994.... Go ahead and boot up your old 486. I'll get back to you when this page loads up in an hour. :)

Yahoo's code base is old and huge, like billions of lines huge. Yahoo's engineers have modernized it at a massively rapid pace. I'm not sure of current state, but when I left Yahoo finance was written in something like 10 languages including serving pages in C, cause that's all they had back then.

Current tech is NodeJSish and others. They have their own hardened versions. But still migrating millions of lines of C to something other then C isn't a walk in the park.

> How can you come up with that argument in good faith ?

Let's say I've seen far worse in 2016, from companies storing far more sensitive data.

Like a bank, with no 2FA support, emailing me my plaintext password after clicking "Password forgotten", in 2016.

This story is problematic, but I'd be grateful if that bank would implement even the same stuff as Yahoo.

Also malicious (allowing NSA to search through everyone's emails).

Current law seems to dictate that if the NSA wants that, it's what they're getting. Blame the government.

They actually fought in court about it, so I commend them for it

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact