Hacker News new | past | comments | ask | show | jobs | submit login
Security Update (stackoverflow.blog)
292 points by taspeotis 32 days ago | hide | past | web | favorite | 200 comments

Serious question: what sensitive user data is there on Stack Overflow anyway? Questions, answers and comments are all public, the content is Creative Commons licensed and even available in handy downloadable & queryable form: https://data.stackexchange.com/

As far as I can see, the primary sensitive user data they have is e-mail addresses, but (unlike, say, Reddit) most StackExchange forums don't deal with personally embarrassing material and many of if not most StackExchange users post with handles easily associated to their real names.

If Stack Overflow's enterprise knowledge management systems were breached then that could potentially be a big deal. A release of knowledge on development and production systems from companies could potentially lead to larger hacks in the future.

Of course, that is all speculative until the extent of the breach is released. From the press release, they seem to imply that the only affected areas were public facing.

Stack Overflow Enterprise is either self hosted or hosted by SO, but on a individual Azure server instance, so each instance would probably need to be compromised individually. SO will probably release an update for Enterprise customers that scans server logs to check if that Enterprise customer was affected.

They have job listings, information about listings I have applied to, and a copy of my resume with contact information.

Job listings are public information, your resume may also be considered public info if you have sent it out to recruiters or public job posting sites.

Listings you've applied to could be potentially private. But if you use an alias (and only reveal your real name after the company agrees to interview you), then it's not a problem either!

> But if you use an alias (and only reveal your real name after the company agrees to interview you), then it's not a problem either!

Do you do that? Do you know anyone that does? If so, how do people react to it?

i use an alias on stackoverflow, and i use an alias here on HN!

My google profile is also an alias. My facebook profile is also an alias. My twitter, same. And if i had a linked-in profile, it would be an alias too. Some of those alias may be the same, but some i deliberately make different, so that i can choose whether others can associate different aliases together as the same person.

Why anyone puts their real name online is beyond me. When i tell my real life friends, they are shocked, and also ask whether it's inconvenient. Yes, i would answer - it is inconvenient, but it is more control. It gives you the option to prevent others from being able to track all my activities online across different services (at least, not easily).

You replied about online usernames, but not about the question asked - do you apply to jobs using an alias / how do companies react when you change your name in the process?

It took me roughly 20 seconds to find your real name out by clever googling and your HN comment history. You can probably do the same to me.

An alias is not privacy :]

What do you mean by alias? Do you apply to jobs as “chii” or as an assumed name, like “John Smith”, when really your name is “Bob Jones”?

Do you use different email addresses too?

> your resume may also be considered public info if you have sent it out to recruiters or public job posting sites.

No, absolutely not.

And passwords. In particular, probably some people reuse passwords between Stack Overflow and GitHub, and keep other credentials in files in private repos.

I would like to hope Stack Overflow of all companies doesn't store passwords in plaintext, but you never know.

They don't have to store them as plain text for it to be a problem.

If they're not salted it's trivial to crack the hashes, and if they are all uniquely salted, while it's time consuming, they can still gradually crack them.

Given that you could probably sift through the users to find particularly juicy targets (usernames of maintainers of top open source projects with github repos for example?) that could justify the work of a time consuming attack on the hashes.

Of all the sites and people, I expect Jeff Atwood and Joel Spolsky to have followed best practices in storing user passwords.

Jeff wrote this in 2007:

> Do not invent your own "clever" password storage scheme

> Never store passwords as plaintext.

> Add a long, unique random salt to each password you store.

> Use a cryptographically secure hash. I think Thomas hates MD5 so very much it makes him seem a little crazier than he actually is. But he's right. MD5 is vulnerable. Why pick anything remotely vulnerable, when you don't have to? SHA-2 or Bcrypt would be a better choice.


Another topic on passwords: https://blog.codinghorror.com/the-dirty-truth-about-web-pass...

> SHA-2 or Bcrypt would be a better choice.

As the sibling comment points out, SHA-2 is worthless for storing passwords. A GPU can crank through an obscene number of SHA-2 hashes per second. Bcrypt is is intentionally much slower and harder to use a GPU to brute force. The two algorithms are almost totally unrelated and it's concerning they were mentioned together.

To be fair, Jeff's post was written in 2007 ... this predates Khanna's PS3 clusters. CUDA was released the same year. So I'm not sure it would be obvious in 2007 that SHA-2 would be "worthless for storing passwords".

PBKDF2 seems to have been specified in 2000 as part of RFC 2898 [1] and, IIRC, existed before then. Bcrypt is from 1999 [2]. Both of those algorithms were explicitly designed to be slow. My understanding is that PBKDF2 wasn't explicitly designed for password storage, but that it was recognized as being useful for that early on. Bcrypt was explicitly designed for password storage. The article referenced came out 7 years later. IMO: there was evidence that salted SHA-2 was not secure.

[1] - https://tools.ietf.org/html/rfc2898#section-5.2 [2] - https://en.wikipedia.org/wiki/Bcrypt

> SHA-2 is worthless for storing passwords

It really depends on what the person means. Yes, single sha-2 hash is pretty useless. But that's often not what's really happening. For example libcrypt is used in many cases with the default $6$ format which uses thousands of rounds of sha512. That's still "password hashed with sha-2".

If that were the case, his advice ought to have been something like "use libcrypt with SHA-2" or something like that. Just "use SHA-2" was kinda useless in 2007 and hasn't gotten more useful since then.

Sure. But that advice was written 12 years ago. I'd be surprised if they didn't update their password storage since then.

How do you update password storage if you don't store the passwords only hashes?

You could upgrade it for new users, but for old ones? (e.g. I don't change passwords often)

When a user whose password is hashed the old way logs in, after checking the password they just supplied against your stored old hash but before forgetting the plaintext, you can compute the new hash & update your records.

Of course, that only works for active users - it won't upgrade anyone that never logs in. Depending on just how weak the old hash is, you may want to eventually cut off any lingering un-upgraded accounts: just forget their old hash, requiring them to go through your password reset process should they ever come back. If you've left it long enough, those accounts will probably never be used again anyway, so that should be NBD.

I updated a DB from MD5 to BCrypt at my last job. Essentially on the next login I updated the password (since the password isn't hashed on the client end). So all active users got switched over. After a relatively short period I reset the remaining passwords and forced users to send password reset emails to login.

Onion hash: hash the existing hash with the new algorithm.

That throws away entropy, but I suppose so long as very few hash collisions of a password are likely to be real passwords themselves it would be harmless.

You could rehash the password the next time the user logs in.

You ask people to change their passwords, and when they do so, you store them with the new scheme. You can do that by e.g. requiring everyone to change their password on their next login/visit.

You don't even need to ask people to change their passwords; when they login, you have a copy of the original password, so you can just store the new hash over the old.

If they aren't salted, just generate your own rainbow table!

He wrote it in 2007.

He said to use Bcrypt or SHA-2. One of those algorithms was explicitly designed for password storage, the other was not. They aren't really the same class of algorithm. As such, it doesn't really make sense to suggest one as an alternative to the other. Nobody ever suggests uses Bcrypt as the hash in a hashtable - even if it technically could be made to work. GPUs that could crank through trillions of hashes per second might not have existed, but, that doesn't make it good advice. Good advice would have been "use bcrypt or, if you can't, PBKDF2 (or maybe some other library that implements a variant of one of those)"

> MD5 is vulnerable. Why pick anything remotely vulnerable, when you don't have to? SHA-2

For the purpose of hashing passwords, there's no significant difference between MD5 and SHA-2. Both are awful choices.

Edit: Actually, SHA-2 may be worse because CPUs have hardware acceleration and thus the attacker may be able to crack it somewhat faster than MD5.

It’s definitely not worse—MD5 is faster with GPUs—but plain SHA-2 isn’t much better, either.

Also "Some level of production access" could easily include the ability for the attacker to intercept and exfiltrate passwords as users log in.

I haven't checked in a while, what's the cost of cracking a single salted password that is stored properly, e.g. with pbkdf2 or bcrypt or whatever is currently state of the art?

"Cracking" isn't a meaningful operation for these algorithms. You would need to try guessing, if you don't know anything about the password then you need to guess all possible passwords until you get the right one or a collision (the pigeon hole principle applies)

If my password is "password" you can probably guess that almost instantly no matter what scheme is used. If my password is 32 characters of random base64 output then it wouldn't matter if the scheme is MD5($password) you won't guess it.

Schemes like PBKDF2 and bcrypt trade something useful in the middle, if your password is mediocre they make it too hard to bother guessing. But you'd need to define how mediocre it is to get a meaningful estimate.

The cost? That depends on the password. How long does it take to crack 123456? That depends on the parameters.

In a good case, it takes 50ms per hash on a modern system, so that's 20 attempts per second (instead of 20 billion per second for a plain md5 or so). I don't really expect a popular site to do much more than that, so that's the best case. From experience, most sites will be more around the 20 billion mark than the 20 mark, but I expect stackoverflow to be on the good end.

The current state of the art is Argon2, with scrypt second and bcrypt/pbkdf2 tied for third. The first two have memory hardness, and the first one is the standard chosen in the password hashing algorithm competition. The third is still acceptable because most developers still go for a salted single hash like sha256. Somehow they got the salting method, but I'd rather they break all identical passwords at the same time (they weren't that strong anyway if they're shared between more than one person) than that they crack only a tiny percentage because of the cost. Slowness helps more than salting, yet as a pentester I see more of the latter than the former. So I'd rather recommend something available for their platform as a good third choice than them going "meh, effort" and not implementing the recommendation.

> The current state of the art is Argon2

Details for the interested implementer: there's a lot of bad software floating around out there, be careful and do your due diligence.

Use the argon2id function. If the language binding does not expose the argon2id function, but only argon2i and argon2d, then it's outdated, avoid. If the library has not been updated past 2016 (argon2 v1.3), it's vulnerable, avoid. (Some language bindings ship with an embedded library.)

Language bindings to the argon2 library do not document how to pick good parameters because language binding authors do not understand nor care about security, the suggestions in the synopses are laughably undervalued. Compare with the expert recommendations in https://password-hashing.net/argon2-specs.pdf chap. 6.4, 8, 9 and https://tools.ietf.org/html/draft-irtf-cfrg-argon2#section-4 .

Algorithm for picking the correct values on the target server hardware:

    const PASSPHRASE := 6 random words from dictionary
    const SALT := 16 bytes from urandom
    const DURATION := 0.5   ### or greater; this is the maximum amount
                            ### you are willing for your user to wait
    mut T_COST := 1
    mut M_FACTOR := concat(4096, 'M')
    const PARALLELISM := `nproc`
    const TAG_SIZE := 16    ### bytes, or 128 bits

    while {
        const TIMER := benchtime argon2id(
        if TIMER > DURATION {
            if 1 === T_COST {
                reduce M_FACTOR     ### e.g. divide by a constant
                jump to top of while
            } else {
                jump out of while
        print T_COST, concat(M_FACTOR, 'M'), TIMER
        T_COST := T_COST + 1

I work on an enterprise infosec tool that just demonstrated 48 trillion MD5s per second using AWS GPUs.

Hashed passwords are cracked so easily it is a minor obstacle at this point. It is a question of when not if a hash table is fully cracked.

Well, nobody should be using MD5 (nor should they have been using it 20 years ago with the introduction of bcrypt). In fact, nobody should be using any hash function that was designed for speed (such as the SHA family) because you don't want fast hashing of passwords.

Modern cryptographic hash functions that are tailored for password hashing (such as scrypt or Argon2) are much harder to brute-force and have tunable knobs to allow you to increase the memory or CPU hardness. Obviously you cannot be safe forever but if you have a database dump of Argon2id-hashed passphrases with very strong parameters you aren't going to break it any time soon.

There are plenty of publicly leaked hash tables running MD5 and the like. Just because modern hash functions exist does not mean they are in use. [1]

Also you do not need the hash table of a hardened system to get useful passwords. You need a reused password from a weak one.

[1] https://hashes.org/leaks.php

>There are plenty of publicly leaked hash tables running MD5 and the like.

Not related to stackoverflow though.

>You need a reused password from a weak one.

If the weak password is already public, what is gained by finding out that it's a weak password in a strong DB? You've just described a dictionary attack.

My mention of MD5 was just a benchmark reference.

I can see from the downvotes the very idea that it is in regular use as triggering for some folks—-but md5 and other weak hashing algos are not just in obscure anime forums but in systems everywhere.

“They don’t like to think it be like it is, but it do.”

And it isn’t about md5 hash rate it’s about the ease of cracking in general due to low cost of compute.

If the SO password hash has leaked even in bcrypt its going to be attacked and many strong passwords will be broken. If they are reused elsewhere, important email addresses will be attempted elsewhere.

Don’t reuse passwords.

> Don’t reuse passwords.

Nobody here disagrees with this premise. I just disagree that "low cost of compute" changes the fact that functions like Argon2 can be tuned to become more expensive to crack based on changes in computation cost. If you're worried about someone spinning up something on AWS to crack hashes, bump up the memory and CPU hardness and now they'll have to spend much more money to crack your passwords. In addition, the design of most modern password hashing functions is such that you get poor parallelism on GPUs.

Isn't the whole point of modern password hashes that the don't scale with GPU compute in the same way as MD5?

They′re not saying nobody is using MD5; they’re saying “nobody should be using MD5” (emphasis mine).

And how does it do on hashes that are not known to be useless, eg bcrypt or argon2?

It is running on top of hashcat, and at the above-mentioned compute it was benchmarking 45 million bcrypts per second. At that point it is more about the attack plan than the compute.


edit: here is the demo video: https://www.youtube.com/watch?v=KnD4f8N1_OE

"X bcrypts per second" is completely meaningless. What was the bcrypt setting? With the right setting, it would not be more than 1 bcrypt per century, or with the wrong setting, an almost equivalent rate to md5. It depends.

More meaningful would be the speedup compared to a single CPU core, which is what the developers (should) benchmark against. They should make it as slow as possible, so if their system can do bcrypt with a cost of 15 in 0.1 seconds, they should set either that or cost 16. (Much more than 0.2s might be annoying to users or be a DOS vector.)

> With the right setting, it would not be more than 1 bcrypt per century

You can't really call that a "right" setting when it takes at least as long to log in...

Right was meant as necessary to achieve that effect (sorry, English is not my native language). Obviously this is not a recommendation but just to point out that its configuration ranges from negligible to (on today's computers) forever.

English is my native language, and I think the way you used "right" was fine. At any rate, I understood what you meant.

> It is a question of when not if a hash table is fully cracked.

...? At 48 trillion hashes / second, you could get the entire hash space in as little as 224 quadrillion years.

Of course, if you had any collisions, it would take longer. A lot longer.

This suggests that strong passwords are still just as strong under md5 as under a more modern hash. No? Use of md5 is a problem because people use passwords that are easy to guess, not because you can enumerate the hash space. The one-way-ness is as secure as ever.

No, MD5 has about 18 bit collision strength. Combined with predictable salting practice, this is crackable with a calculator. Welcome to Merkle-Damgard construction allowing any prefix or suffix.

Given random salt (random placed or mixed) or HMAC, you have to use the more complex preimage attack at 123 bits.

This is crackable with a medium sized botnet or a supercomputer.

48 THash is an underestimate. Specialized hardware easily surpasses this, even FPGA does.

SHA1 salted is a tougher customer with 60-64 bit collision resistance meaning you probably cannot crack it with your calculator. However, it is still prone to length extension meaning predictable salting has this much strength.

At 123 bits, you're five bits short of the 128 bits I calculated with. I don't think a medium sized botnet can rise to the level of doing 8 quadrillion years of work within your lifetime.

What problem are you trying to solve? As I understand it, we're discussing enumerating the hash space, such that:

1. You are given a hashed value, such as 2b0f4e60b80da7ef1e84573d764f1bf4 .

2. The value is someone's hashed password. You need to find any string which hashes to this particular fixed value, but you don't know of any such string to start with.

You can do this by brute force, but it will take you a long, long time.

The problem is NOT:

1. You have a string which hashes to a particular value.

2. You want other strings which hash to the same value.

And it also isn't:

1. You have a string which, with an unknown prefix, hashes to a particular known value.

2. You want to identify hashes which represent the same string with other prefixes applied.

I don't see where salting is relevant to the question. It's a defense against the phenomenon that cracking one user's password automatically also cracks everyone else who uses the same password (since, without salting, they all have the same hash), but it isn't a defense against having your password cracked by a targeted attack (since, in a targeted attack, there are no other hashes to be collateral damage). Why did you bring it up? What attack are you thinking of?

I agree with your thrust that your parent poster (AstralStorm) is barking in a different forest from the tree we care about, but for salting it is also a defence against pre-computation to trade space for time.

With an unsalted hash an adversary can do as much work as they want in advance, store output and then trade that in once they have your hashes to get all or most of the same rewards as if they'd done the work after getting your hashes. Rainbow tables are the most famous example, but they're part of a family of similar attacks.

Salt lets you arbitrarily discount this advance work because the attacker must do it for all possible salt values and you get to choose how many there are - the early Unix crypt() salted pessimised password hash discounts it by a factor of 4096, modern schemes often use many orders of magnitude more salt. An attacker who has $4M to attack my password scheme probably doesn't want to spend $4M now to have a $1000 advantage once they get the hashes, and they certainly won't for a 1¢ advantage.

A rainbow table is the exact attack I'm saying is still infeasible. ("Strong passwords are still just as strong.") It only works by assuming the victim uses one of a known set of weak, easy-to-guess passwords. If they don't, their hashed password is very unlikely to be in the rainbow table at all, because there's just too much hash space. The calculation in my original comment gives the approximate number of hashes necessary to fill a complete md5 rainbow table, on the assumption that you get zero collisions in the process of filling it in. (That is, every time you hash something, you get a hash you've never seen before, allowing you to add new information to the table.) That assumption is not at all realistic. By the time you've filled in half of the space, unless you can choose them cleverly to avoid collisions, half of the strings you hash should be wasted effort.

In a single-target attack, I don't really see the concept of "pre-computation to trade space for time". That hurts you by taking a lot of space, but it doesn't gain you any time, because you spent at least the required amount of time, but almost certainly more, doing the pre-computing. If you can buy someone else's pre-computed rainbow table, then sure, that's an advantage for you. But the adversary actually doing the pre-computing is doing it in order to crack many people's passwords all at once ("this table will let you identify _everyone_ whose password is qwe123"), which is the scenario I described earlier.

(At this point I feel I should clarify that "some people use the same passwords" is a real threat and a real reason to avoid md5. I just don't think the comment I responded to, "It is a question of when not if a hash table is fully cracked", was made in good faith or informed by... anything. To fully crack md5 in that way, you'd need an easily-computed function that inverts it. No amount of hashing speed is ever going to get you there.)

> In a single-target attack, I don't really see the concept of "pre-computation to trade space for time"

In a single-target attack the reason you'd do this is because you expect your target to react in a timely fashion to discovery of some other part of your attack by changing passwords.

e.g. maybe you're sure you can break in to get hashes, but you will trigger a reactive IDS. You figure you have some period of time after that trigger before your target is alerted and changes their password.

Time-space tradeoff lets you avoid doing all the work against the clock _after_ the IDS triggers, instead you can do it all _before_ you have the hashes, and only pull the trigger and set off the alarms when you're ready to quickly break the hash, get in and do whatever your actual attack requires.

It's not a _common_ scenario, but it's important to remember it exists in designing general purpose components like password hashes.

You do not do this by brute force.

You can assume the system uses a certain salt pattern, e.g. 4 byte prefix or 8 byte prefix or suffix. This can reduce work from full crack to some 40 bit crack. (Guess salt then presume stupid concat scheme, use collision attack to get matches.) That one is doable on a modern PC on a GPU. It is a targetted attack. The mass variant are salted rainbow tables.

You usually do not even have to recover actual password to use credentials associated with the hash.

Please try to describe the actual attack you're talking about. What do you have, what do you want, how do you get it.

MD5 is one thing as a password can be retrieved from a hash table. But pulling out passwords from a hashed + salted value (e.g. via bcrypt) is many orders of magnitude more infeasible, no?

Salting was originally important as a defense against rainbow tables - which are more or less obsolete with GPUs that can crank through trillions of hashes per second. The real reason that bcrypt is a better way to store a password isn't just because it uses a salt - it's because its designed to be slow and to use a bunch of memory which makes it much harder to brute force.

I would be impressed if SO's user password table is in bcrypt.

What makes you say that? bcrypt's been the defacto best practice for user password "storage" for probably 10 years now. MD5's been known to be inadequate for much longer.

Even if they had a legacy implementation in MD5, gradually migrating from storing MD5 hashes to storing bcrypt hashes is trivial to do.

From what I understand, many systems do not choose to implement strong hashing algos.

Even PHP's hash_function uses scrypt. Yes, some people explicitly decide to hash everything with sha1 but nothing you or I do will ever be able to stop them.

I would be disappointed if such a high-profile and technically savvy site would be using anything less.

10 years ago, almost to the day, they had a password vulnerability involving unsalted hashes. So not plain text, but who knows if they've learned the right lessons?


Unless I'm badly misreading something, the unsalted hashes in that post were on some other (unspecified) site unrelated to Stack Overflow.

In the first sentence, he links to a previous post where it's made more clear that he's talking about Stack Overflow.

> I found what one could call a security hole in Stackoverflow. I'm curious enough to go digging around for holes, but too ethical to actually do anything with them.

I hate to say this, but did you even read the post you actually linked?

From the second post (the one you linked), where Jeff quotes the hacker:

>>I guess I can tell you, so you don't fall into this trap again. There's a site I help out with that doesn't salt their passwords. They're MD5 encrypted, but if you've got a dictionary password, it's very easy to use a reverse-MD5 site to get the original. I was able to figure out you were a user on the site some time back, and realized I could do this, if only I knew your openid provider...

The "password vulnerability involving unsalted hashes" was on another site. The hacker was only able to gain access to Jeff's Stack Overflow account through a combination of their privileged access to the database for that other site and Jeff's own bad op-sec. The only real security error attributable to Stack Overflow in any way was Jeff's own and had nothing to do with Stack Overflow's infrastructure.

...unless you use Google login or so. It's almost as if there is nothing to get by hacking SO. That's actually really good engineering.

Not all of Stack Overflow is public. E.g. https://stackoverflow.com/teams.

Email addresses do a better job of identifying particular individuals in bulk data than a name field in a variable format.

It's a very short hop from email to link technologies from this dataset to place of work and then potential attack routes for a sufficiently capable actor.

Downvotes are also private. They're probably not that valuable of an asset for either side, but I can imagine unpleasant situations here and there if they happened to get leaked.

I've seen plenty of questions that include API keys or credentials and they're later edited to "remove" them although the revision history is still there. It would make up a minority of content on Stack Overflow but it's still there.

Old revisions are already included in the public data dump:


Emails can be sensitive if they are corporate and patent trolls are the buyers.

They can also be sensitive for people who ask rather personal questions on the site...

Or even worse - someone could steal my identity and start answering to JavaScript questions just to frame me.

I assume this is a joke? I was being serious in my comment -- and I was referring to other StackExchange sites, not StackOverflow.

Passwords that might be similar to other services and then try to access those services as well.

And also all the metadata associated to user's activity

Maybe .. the careers/jobs part in particular?

Possibly StackOverflow Careers data?

All the ads you have clicked on?

> We have not identified any breach of customer or user data.

As usual, this is a meaningless statement. It could mean they have full packet captures they've completely audited, or it could just as easily mean "we don't keep logs of any kind so we have no fucking clue".

I take it to mean that they've looked into their logfiles and accesses to their resources as closely as they can and so far haven't spotted anything. So, not exactly meaningless, but also not entirely reassuring.

At any rate, I still have some degree in trust in the people running things over there to tell us if the reality is different.

Well, I think it is intended, and usually interpreted, as a comment on their own state of knowledge at that point. In other words, "we're not giving you any details because we don't know yet, but we're looking into it". At least, that's how I interpreted it.

What tends to be the first indication of breaches? It's one thing to do a forensic analysis after learning of a breach, and it's another to detect it in the first place.

I worked at a company that logged every single SQL query and made a rule set based on that. May not of been the most efficient but it worked great. There was basically a whitelist of sorts and if the query structure wasn’t in there then action taken. Also worked by knowing what queries came in what order when doing certain things.

This sounds a lot like an IDS for SQL. I've worked with government agencies that focus very heavily on IDS in firewall systems.

SO not only do they catch attacks early, in the perimiter network, but they also often block legitimate traffic and handle such cases regularly.

But it's a default deny policy so that comes with. It also costs a ton of money for the best IDS solutions. I believe it comes from companies like Checkpoint, Cisco and Symantec.

What tooling did you use to audit queries?

Not parent but it reads like they wrote their own (presumably driven by DB server log data with query logging enabled).

Mostly, Troy Hunt e-mailing clueless companies saying "Hey, this data breach I got sent seems to check out as real, at least a few of the users in it have validated that's currently or recently been their $company website credential to me."

(Only mostly joking...)

I’ve done a few small IR jobs in my time, and also have a hobby of reading every breach report that comes out.

It seems the vast majority of breach discovery amongst typical companies is an engineer going “hrmm that’s odd”: a router at 100% CPU because it’s currently part of a DDoS attack. A DBA noticing a huge query they don’t recall running. Unusual login times for administrative accounts. Having email systems sinkholed for sending spam. And of course “all my files are encrypted?”

It depends on attack surface and what tooling you already have in place. But for example:

> Finding suspicious outbound network activity


Interesting that this message is being delivered by the VP of Engineering rather than a VP of Security or another more security focused counterpart with a sufficiently senior title. Wonder if SO has an in house security team with management and executive representation?

Reflecting on this, I wonder if a PaaS solution that is a "vault" of confidential information would be a good thing.

Similar to how Stripe handles payments with a token, we could all store tokens for User information (eg the Id) and query the vault (or operate on the vault, eg, validate login, or return email, etc) using keys.

The service could be hardened (like Stripe) to ensure the data is stored securely, and detect ex-filtration attempts (eg, queries for multiple customers at once being abnormal) and automatically block that.

You've just invented from first principles Single Sign On, OAuth, SAML, and Identity Providers.

You can rent it from AWS, of course. It's called Cognito.


You can also offload that responsibility for user data/credentials to Google/Facebook et al as you see many places with "Login with Facebook", making your users pay in privacy-invasion instead of bearing the burden of properly securing your user's PII yourself...

Mozilla Persona was deaigned so "the identity provider does not know which website the user is identifying on." But it did nto catch on.

As a user, it seemed like "so i need to login to Google on Mozilla.org... it's just a wrapper for my Gmail and/or Mozilla account?"

Not just a privacy invasion but also a single point of failure/an account shutdown.

This is a great idea and someone should do it.

Specialize in storing personal data (name, address etc.)

Provide APIs that only allow gentle exfiltration of data. e.g. < 10K queries per minute/hour/day whatever.

Have alternate paths (e.g. manual procedures) when greater volumes are required (e.g. for disaster recovery testing).

Then get it audited to death by some serious security firms.

Would this be something companies would pay for?

A FAANG wouldn't, but a corporate building e.g. a second-tier system holding customer data might.

Off-topic, but why is there no M in FAANG?

Rejected due to poor culture fit.

The term was originally coined in order to talk about stock price behavior.

There's a group out of the University of Warwick that's trying to commercialize a similar idea.


Like azure Key Vault? I’m sure AWS and google have something similar.

Or host your own by using Hashicorp Vault.

A single target for all gangsters on the planet? A convenient one-stop-shop, as opposed to having to penetrate multiple different services with vastly different technology stacks and protections and to then normalize data stored in a multitude of different formats? Sounds great, for attackers.

Wouldn't this really just turn into a database as a service?

Since it's users, maybe call it a directory... And since it's changing, maybe even an Active Directory? Like this https://aws.amazon.com/directoryservice/ ?

But then we need some sort of Protocol to Access the Directory. It should be something Lightweight, ideally. ;)


No. It should prevent something like "SELECT * FROM users WHERE sex="F" and age=18".

So, a database?

I wonder why they are disclosing this so early, with so little information.

SO is a good company and this is what good companies do

Would you prefer that they didn't?

I think I would prefer to wait until they have something more specific to disclose. The current update gives me absolutely nothing to go with.

It's as if a prison disclosed that the front gate was left unlocked for several minutes and they're still counting the prisoners. I would much prefer to hear about it after they have learned whether anyone escaped.

Because if they didn't, they'd get criticism about not doing so right away. Also...GDPR?

This is not related to GDPR afaics, but at least in Germany, there is an IT security law that governs how companies must disclose security breaches. (Don't get your hopes up, that law is entirely toothless in practice.)

Probably a really stupid question... but how do people detect an intrusion like this?

It's not, but it might have been smart to read a few of the other comments and notice an identical question with two or three answers (depending on your exact time of writing): https://news.ycombinator.com/item?id=19935443

Thanks. I was reading on my phone on the train, comments on the app aren't always easy to follow.

Oh oh... More than ever now, don't copy paste blindly from SO answers!

You'd have to copy/paste a serious chunk of code you don't understand to really cause any damage. I think this comment is either taking the pun or misguided.

Q: "How do I recursively set ownership of folders in Linux?"

A: http://thejh.net/misc/website-terminal-copy-paste

Reader mode exposes the full text of the command, if anyone is wondering and doesn’t feel like doing exactly what the post is telling you not to do.

He/she is making a joke. No user data was accessed so the assumption is questions may have been, a couple pluses changed to minuses could cause a lot of damage (headaches?) when copy pasting.

It didn't say no user data was accessed. It said "We have not identified any breach of customer or user data" Which likely means the attacker had access to user data but there was way to know if they did or did not access it.

I think we've reached a point where it's safe to say that if you're using a service - _,any_ service - assume your data is breached (or willingly given) and accessible to some unknown third party. That third party can be the government, it can be some random marketer or it can be a malicious hacker.

Just hope that you have nothing anywhere that may be of interest or value to anyone, anywhere.

Good luck.

I've made it a point to start self hosting anything that's particularly sensitive that I don't want third parties to have access to. KeePass and SyncThing probably have my most important information, and it's all owned by me.

StackOverflow is a forum, not a password manager or a file storage service. If people only participate in forums they self-host, each will have a community of one.

The IndieWeb people would like to have a word with you...

I think this belief that personally run software is more secure than professionally run software is a bit optimistic.

It doesn't have to be more secure, it just has to be less likely to get hacked.

Less likely to have targeted attacks but you are still at risk of someone finding an exploit in the software and sending a bot to scan the internet for the software

I thought this kind of attack was usually done with relatively old bugs, for which patches are often available.

If you sat on a fresh exploit, would you really waste it with automated, untargeted mass scans, which may draw a lot of attention, causing your bug to burn out quickly?

Um, yes? You'd use it as widely and as quickly as possible, ideally compromising every single vulnerable host on the entire Internet before any sort of coordinated response can be mounted.

You see these kinds of attacks frequently with cryptolocking/cryptojacking software. The more quickly you deploy an attack targeting a new vulnerability, the more victims you'll have.

Probably more at risk too because how rigorous are you really about staying up to date on the most recent security patches? How much time and money did you actually spend setting up security infrastructure like automated security testing or vulnerability bounties? Enterprises, even many of the ones that have had data breaches, dump a ton of time and money into those areas.

Sure. Nobody claimed you could get the risk down to zero.

Its even more of a risk potentially because big companies have people full time working on keeping systems up to date and monitored. How many self hosters have a full monitoring system powerful enough to detect attacks and keep their software up to date and secured as soon as updates come out?

How many people self hosting are even qualified to run a secure system? I bet most of them are just regular devs who know just enough about linux to get something online.

I don't think you understood my point. Yes, one particular risk might be higher. But you don't need to do security better or even on par with a big company. You just need your total risk of data exposure to be lower. You can bet big companies have lots of hackers trying to break into them with the newest 0-days, spearphish their employees, etc... there are so many threats you practically don't face if you're self-hosting.

> Its even more of a risk potentially because big companies have people full time working on keeping systems up to date and monitored.

Beyond the thing about different types and frequency of attacks - sure, I trust Google's security more than my own. But I do trust my own security more than that of Random-Startup.IO, who likely have no full-time security people, and little incentive to get the job right (paying attention to security slows down your incredible journey).

Also, even with big companies, this argument applies primarily to the few like Google, Facebook or Apple. Your Random Megacorp from outside tech community usually focuses its security efforts on satisfying regulators and neutering their own employees, who'd otherwise happily copy out all sensitive data to make their jobs easier.

I'm not sure that off-the-shelf software on your own server is necessarily less likely to get hacked. It's easy to fall behind on security updates when you don't think about deploys regularly.

Look at the logs for your existing infrastructure. I can pretty much guarantee that there are drive-by Wordpress attacks, regardless of what software is actually serving requests. There will be ssh login attempts.

You're not showing anything by telling me to look for attempted attacks. I realize they will be there, I don't even need to check. But if attack attempts are how you measure risk then I'd bet you whatever attack you can think of, Google et al. will have orders of magnitude more of them than I would.

You gotta realize, it's not like I'm arguing you should set up a server with 1234 as the root password. I'm assuming you're reasonably competent in security, just mostly lacking in the bandwidth needed to e.g. keep your server constantly checked and updated on a daily/weekly basis. With those assumptions I have no reason to think the slightly increased risk of getting hit by a brand-new attack through an IP scan or something is going to outweigh all the entire classes of risks that you do away with as a result of not being part of a massive corporate attack target.

Although, heck, if you're absolutely paranoid about random IP scans, you could just move your stuff to some obscure port, which I'm sure you realize already. There you go, you're not going to be found through random mass scans anymore.

Plus, both are great software. KeePass2Android is the best Android password manager, bar none.

I like Chrome/Chromium's password manager. You just login the first time you open it and it autofills passwords. Don't have to install any additional software or configure anything, and it'll also autosuggest passwords you saved on websites in Android apps.

The only thing I miss sometimes is you can't manually add passwords.

The attack surface of a browser makes it a perfect target - I would not advise storing any critical passwords with the browser or in close reach to the browser.

You're going to be entering these passwords into a browser most of the time so if a compromised browser is your problem, no password manager is really going to help you.

That depends on time between compromise and detection. With password manager you'll lose only passwords for sites you actually logged in to. While with browser, you'll lose all passwords instantly.

I'm not sure I follow. If your browser is compromised that's it - it's compromised for everything. Your system is compromised. If I have control over your browser, I don't really need your passwords although I can likely get them out of whatever local password manager you have, to boot.

That's not how it works always. There are tons of compromises that do not imply system compromise, like XSS, or arbitrary browser process memory reads, or extension bugs, or java ghost scripts, etc...

XSS isn't a browser compromise.

What's a "java ghost script"?

There are many different ways to get compromised. Reducing attack surface is always a good idea.

And, yes I do close all browser windows/processes before login, and after logout of important websites for instance to make sure cookies and passwords are gone from browser memory.

I don't think this really addresses my point. You're saying the in-browser password manager is somehow more dangerous than some external password manager. I don't think this is true. And the browser presents the same attack surface if you're, you know, using the browser. If your browser is a vector for successful compromise, you're boned if you use the browser, whatever elaborate protective ritual you follow while using it.

>> You're saying the in-browser password manager is somehow more dangerous than some external password manager.


Last I checked, Chrome on desktop stores all your passwords in plaintext on disk. Unless something's changed... I wouldn't use that.

Firefox at least offers you the ability to set a master password to encrypt all the rest.

Well, on Windows Chrome does use the system crypto API and encrypts, I believe, your whole profile, but only if you have a password set on your system account.

It doesn't anymore, unless last time you checked was quite a while ago. But it probably wasn't such a dreadful thing even when they were doing it.

Pretty sure every single modern browsers has that. The downside with using chrome is handing all your browsing history and bookmarks to Google.

Unless they on-the-fly decrypt your chrome sync (which would require non-encrypted password storing), the stuff you sync to Google is encrypted with your Google password, and if you're paranoid, you can encrypt the sync with a separate password.

> the stuff you sync to Google is encrypted with your Google password

Your Google Password is also available to Google. (At least every time you log in, even if they properly hash and discarded it after authenticating you and just use a token from there.)

can't tell if you are joking or not.

I highly suggest you read chrome's privacy policy on that password sync feature. Hint: when enabled on android the wifi password is unencrypted (or reversible, which is close to the same thing. they claim it must be so to work with wear)

After using KeePass2Android for the last 5 years I decided to donate to the author. It has truly been a useful piece of software for me.

I sync with Seafile over WebDAV.

I should check out Seafile...

I used to be a fan of keepass as well, but I moved to bitwarden maybe 18 months or so ago. For $10 a year for the paid version I get MFA and some other features. I find it a much more seamless experience than keepass/etc, as it works as a browser extension or a discrete app (the Android app uses accessibility features so it detects other Android apps asking for authentication as well as Android browsers such as Firefox). Anyway, just another thing to try if you are looking...

I moved to BitWarden too, but mainly because the browser extension for KeePass (Kee) didn't work well. BitWarden is good, but the Android app is nowhere near KeePass2Android, which I sorely miss.

* KeePass2Android Offline :)


I sync my phone using Syncthing, so that's one way.

Oh I see. (Why not just reply to my comment? :-) ) But then you lose the entry-level syncing capability?

I tried but on the thread page itself the reply button was missing under your comment. I guess I'll open the message direct link next time to reply when it's missing.

Why offline? (And how do you sync?)

I feel it's slightly better to rely on something else for the syncing (even better if you do it manually). I just feel like a password safe would have a draw immense interest from bad actors, so you marginally decrease your chances by using something else for syncing. That way if password storage code was compromised somehow, it can't do much.

Then again, a password storage solution is probably investing so much more into security that it may be actually better than using something else..

What kinds of threats are you imagining though? KeePass2Android doesn't e.g. open any listening ports does it (I haven't checked)? (Not that NAT would make it easy to connect to it if it did anyway?) Are you imagining it would "accidentally" open a port? And you don't browse the web on it or otherwise run untrusted code on it. How are you imagining it would possibly get hacked? If it's connecting to e.g. Google Drive, then Google Drive or your DNS would need to get hacked somehow, and I'd hope it's checking certificates to prevent that (shouldn't be hard to verify this if this is your concern). If it's via Syncthing, your Syncthing would need to get hacked. In both cases your database would be hacked in which case you'd have the same issue with the offline version too...

OTOH you're losing entry-level syncing which is quite the inconvenience...

For me, I sync by plugging my phone into the USB port and copying the .kdbx file over. I've never needed anything fancier, let alone had a reason to send my password database out over the internet.

Wow I see. Props to you... on my end it's so much of a hassle to find a cable and grab my phone and connect it to my computer every single time I update my password database.

I hear you. But I worry that that's not enough. I trust Syncthing and the (many) Keepass (X/C)++ developers, but how hard really would it be to slip something in unnoticed. All it requires is some minuscule bug somewhere. It doesn't need to be in the software itself! It can be in the compiler, or in the crypto or in the machine running it.

If the OpenSSL debacle taught us, open source and the fact that many people can look at the code does not mean it's actually being looked at. Don't get me wrong, still loads better than non-open source, but you're still face a huge risk. I'm slightly competent as a developer (not that much, just enough to barely get by) and still looking at the code base of the many apps, services and platforms I use, I'm astounded by the fact that I have no clue how they actually work, and if there is any obvious attack vector there. MOST people have even less of an understanding of all these than I do.

> All it requires is some minuscule bug somewhere

You can say the exact same thing for any proprietary software or web service. The point of self-hosting is reducing the attack surface and probability of an attack.

What kind of a KeePass bug would compromise your passwords by itself though? Are you imagining instead of saving your passwords it'd "accidentally" upload them to sketchyserver.com?

The biggest thing stopping me is the worry I might misconfigure something after making some change in 6 months when I’m busy.

That could always be an issue. Though I try to make sure I have the minimum number of ports, and services running. But there is a possibility that something I have is exploitable still.

Make sure to keep the amount of selfhosted services to a minimum and as simple as possible to use and maintain.

The simpler and smaller the surface of attack is, the better.

That's my basic assumption now. Which is why any website force-asking me for my date of birth or phone number will get a fake one, and I will use Paypal over giving my physical address.

But it doesn't matter, the damage has been done over and over. Pretty sure I'm in many leaked database already (Hi Adobe!)

Just imagine a Gmail or Mint breach. Oy.

Before 2013, data sent between Google data-centers were in plaintext (!!!) because Google incorrectly assumed that the their private fiber network was actually private[0].

So if you used Gmail or communicated with people who used Gmail before 2013, then a copy of your communications are backuped up in Utah[1] right now.

[0] https://www.wired.com/2013/10/nsa-hacked-yahoo-google-cables

[1] https://en.wikipedia.org/wiki/Utah_Data_Center

Quickbooks Online would also be horrific

Which is why nobody should use Quickbooks Online.

I think a mature state of mind is that you should assume those systems are compromised.

Someone make a service that I can submit my email and you'll generate tons of fake passwords for me, then "accidentally" leak them all. When my email and password make their way into the big password lists I want to dilute the real one with thousands of fake ones. If it became the norm that password lists are suddenly full of junk the demand for them would proboably evaporate.

Interesting, but the result will be that they're going to blast all of those passwords at the login of whatever service you're trying to protect by this trick. That could DOS the service. It also could result in them locking your account, if they're awake at all.

if someone get enough passwords to DoS the service or force you to do a password reset, that's actually a pretty good safety fallback.

Security through obscurity is undervalued.


Yes. It's about minimizing attack surfaces. You can't hack what you can't understand.

I'm glad it was a 'minor' breach. But where is the blog post from the clever and witty founder, about not trying to hire the top 5% of security engineers because everyone is?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact