A few days ago I published my blog to GitHub, with my MailGun API key in the config file (stupid mistake, I know). In less than 12 hours, spammers had harvested the key AND sent a few thousand emails with my account, using my entire monthly limit.
Thankfully I was using the free MailGun account, which is limited to only 10,000 emails/month, so there was no material damage. Their tech support was awesome in immediately blocking the account and notifying me, and then quickly helping to unblock the account after keys and passwords were changed, and repo made private.
I was exactly wondering how they were able to harvest GitHub content so quickly; it couldn't be web scrapping or a random search. This article explains well how to drink from GitHub's events firehose and the GHTorrent project, so everything makes sense now. Thanks for posting it.
EDIT: This other post describes a similar situation. There are some folks monitoring ALL GitHub commits and getting psswords as they are commited, on the fly.
Also, this was a reputable email provider that many of you know of (i believe it went thru one of the incubators).
What I do for most projects is keep the tree containing the working directory in a directory that has some other items that don't belong on github (like the project brief, my emacs bookmarks file, random notes related to the project etc. ) and in that directory there is a .credentials file containing a set of export statements somewhat like:
Then at startup the app goes looking for it's config in the environment. This does create issues for some environments ( solving this for docker is trivial ) but you can usually pass environment variables to whatever is executing your code reasonably securely. Now it's not perfect, and environments can sometimes be revealed externally if an attacker is determined and clever and focused on your app for some reason.
But it does give you a hygienic procedure that keeps your credentials that are equivalent to an open draw on your bank account out of public repositories.
To be fair, I think they just copied the `foreman` tool from Heroku. However, it works great. Most projects don't need anything more than a flat hierarchy of secret keys and values.
Writing your own parser for a `.env` file is a piece of cake, even in shell language.
Adding `etcd` is better, but it's too much work for a small project.
How do the secrets get safely distributed to the machines where they are needed?
How to revoke/rotate a secret, especially once a compromise is suspected?
How to perform all this in DevOps-y, automated systems?
This is the problem space I work in.
I've used fabric and ansible to push configs out to small sets of hosts; and yes assumed that the sensitive bits were OK sitting on the filesystem of the production host. Since if an attacker had access to the filesystem there would be more issues and I'd have to invalidate those credentials anyhow.
At a larger scale you'll want something like etcd or consul or even just a centralized key server that new instances call and ask for their configuration.
The thing is that anything predicated on HMAC secrets is vulnerable to those secrets being exposed. The secret has to be in the clear at some point to perform authentication or signing and a sufficiently determined attacker will be able to get that string.
A system is only as secure as the humans running it can confirm it to be secure. This is why it's best to reduce your attack surface and ensure that you can log access and do process inventory and egress filtering and the whole checklist of prevention, detection and remediation. There is no magic pixie dust that will make your system fully secure; you will always be making tradeoffs and managing risk rather than eliminating it.
Maybe the answer to these questions and others is satisfactory, but getting RSA catastrophically wrong is easy enough that I'm extremely skeptical that a library will get it right. Honestly, I'd be infinitely more likely to use your library if it just used GPG under the hood. That's one less piece of crypto I feel compelled to audit.
I did read the code, and I saw it used the `rsa` library. I read the code for that library, and also saw it claims to use PKCS#1 padding. None of these obviates my point.
There are dozens of other ways to fuck up an RSA implementation. Some obvious, many not. I am not an expert in Python, nor am I an expert in auditing secure RSA implementations. Neither are most of this project's intended audience, I would warrant.
Using RSA like this directly, in my opinion, dramatically increases the likelihood of a significant implementation oversight when compared to something as widely-used, audited, and established as GPG. And it should cause security-conscious users to be much more distrustful of it.
As a security professional, adding to the list of libraries and crypto implementations for me to audit does not reduce my workload: it massively increases it. If it were a conceptually simple wrapper around GPG, I would consider deploying it without a second thought. GPG, while crusty and imperfect, is at least more difficult to misuse. As it stands, I would need to spend significant time relearning RSA implementation best practices and ensuring it adheres to them.
The fact that others aren't likely to do (or be capable of doing) this legwork only makes the problem worse; bad crypto is often little better than no crypto. And until proven otherwise, the default assumption should be that something uses bad crypto.
All it takes is a mistake when deleting the sensitive information, or having pushed without realizing it to be compromised. Even if you're absolutely positive there wasn't a breach, it can be a good excuse to drill for a _real_ breach later.
It never hurts to walk through the practice of what to do if credentials leak when there's no pressure.
Ben Franklin figured this out many years ago:
Three can keep a secret,
if two of them are dead.
The only recourse is to immediately change or revoke access.
Sure, just enumerate any and all possible types of sensitive data, the format they may be in, regex / matching functions to account for them (supported across 20+ programming languages) and I'm sure Github will have that done asap.
Alternatively, don't commit passwords/API-keys/sensitive-info to your repo.
False dichotomy. It doesn't have to be "everything" or "nothing". An 80% solution here is better than nothing.
I still find it useful that gmail warns me before I send an email without an attachment if I've written "I've attached" in an email. Can gmail detect with 100% accuracy if I intended to send an attachment? Of course not. But the 80% solution here still does me and lots of other people a lot of good.
There will never be a 100%-fool-proof "Did you mean to commit this sensitive unicode string?" but getting in front of that with a, "ok, I've checked my code, ran my tests, pruned the sensitive data, is there anything I'm missing?" will go a long way both in present and future times.
There's an issue with your example. Google looks for the substr 'attach', but does it know that a file with a string 20 chars from the newline 50 lines deep with two single quotes is actually your root password? There's a world of difference between keying off of a word/phrase and understanding the context of a larger document+metadata. Trusting computers to do the latter will result in bad times for all, while learning for yourself can help spread the knowledge to those that are both technically inclined or not (infosec is everyone's issue!).
No one's suggesting anything like that. It's just a basic protection that Github could offer its users, because its users are humans, and humans make mistakes.
No matter how clever you get with your pattern matching, you're going to have to always play catch-up with Web Framework N+1's format / weird-ass package manager. The dual approach is the only sensible approach because it expands your coverage. The important thing to this is to internalize the knowledge learned from 3rd parties and integrate that into your native process / tools, but not everyone will do that.
It'd be grand if we could say, "yeah, they should take care of my security for me because I'm paying them" but reality is a bitch. It doesn't matter what they were 'supposed' to do if there was an infosec leak or attack, you can't ctrl-z that (set of) event(s) and that info is out there, so it must be fixed (re-roll credentials, regen keys/certs, etc) and accounted for next time. There really is no 'getting ahead' but 'being less behind' will at least help mitigate getting eaten from the herd ;-)
It's kind of amusing you are arguing that it's impossible to play this game, even though that's exactly what the perpetrators are doing, they are automatically detecting API keys and harvesting the code... maybe their script is hosted on github?
Any time you want to show me a 100% future-proof algorithm for sensitive-info detection that works across any/all code on github, I'd be happy to toss my hat in and say, "I was wrong", until then, people will never ever beat 0days they don't know exist (0day being more than just a SW exploit). Just do.not.commit.sensitive.info.to.github. Period. That is the only sure way to not mess it up. Software only executes what is in the code, regardless of how nonsensical it is (aka, your code will not save you from messing up, something something something, PEBKAC)
I'm not arguing it's "impossible to play this game" It's 100% possible to play it when and how you'd like. I'm discussing the rate of "did I win (read: not get pwnd)?" It's cat and mouse of automation for vulnerable/sensitive info ... but all of that is rendered moot if you ... wait for it ... don't commit it to Github which would mean it wouldn't get to Github's API which means it wouldn't appear in 3rd party services sucking the firehose from Github cloud-y silicon teet.
And expanding on this ... committing your passwords and sensitive info to your code-repo is so misguided it's actually funny. What happens if you have an employee and they go off the deep end? Whoops, gotta rotate all those passwords/credentials/un-fuck every branch/resync dev's machines/etc. Keeping that sensitive info in a private, self-hosted, well-maintained internal repo (with strong ACLs, especially wrt server/hosting environments) will go significantly further for your team's security than submitting a feature request to a company to stop you from making arbitrary mistakes every so often.
Yes, Github can/should help, but developers should not think they're owed it just because they constantly check in sensitive info to a website, that's all.
(I'm just applying your point to another kind of 'security' that is provided to you).
You are basically arguing the "We should all live in the woods, and hunt and kill our own food, because relying on other people is fraught with danger" line.
Or we could accept mistakes are made, and provide warnings/undos/etc. Kind of like how cars have airbags, even though they are rendered moot by just "not crashing your car"
This is, of course, the right answer.
However, it's frustrating that several frameworks make this very easy to get wrong. Anything that has an application.yml, database.yml, or similar configuration file that normally lives within the same directory as the source code, and which is intended to contain credentials, means that lots of people will make that mistake.
It's one of the fundamental errors that you see so often in web frameworks like Rails, this whole idea of mixing application code and configuration files into one big tree. I don't know how this practice caught on, but it has, and it so frequently causes mistakes, both big ones like accidentally publishing credentials, and simply frustrating ones like confusing the difference between application code and configuration in ways that make it more difficult to have multiple local configurations and updating the application code independently.
The point were you define it is sensitive, but the place where you use it are not.
Not sure how GitHub could make it more obvious.
Perhaps if they mentioned there are unscrupulous users out there who have a script that hammers GitHub's events API to search for exposed passwords/keys, then it would reduce the 'oops I only pushed it for a second' thinking that users likely go through.
Typically, I've seen services run in restricted user accounts with limited system access, reading passwords out of an encrypted file. This file is stored in some obscure location on the box to which that user account is the only one with read permissions to.
Keep in mind, every system has weaknesses and I am still interested in listening to others' approaches.
In your public .bashrc, put a line "source .bashrc.secret." Just keep an empty .bashrc.secret in your public repository, and keep your actual secret credential on your machines.
Otherwise, I'd try to use keystore systems available on your respective OS or language platform toolchain (CSP on .NET, JCE for Java, I dunno wtf else you'd use for anything else because the only people I've heard of that want to go this far are all F500 enterprises basically with software in exactly those two languages only).
On windows I create a batch file at a fixed location with all the credentials in it. A script simply runs this batch file and reads the env cars to get values. A compiled program parses the batch file with regex to find required values. This works remarkably well for keeping credentials out of the code base.
Hope that hels someone.
No they don't.
> No they don't.
A better approach is what the rest of the comments here are suggesting:
(1) store your secrets (API keys, certs, credentials, whatever) in a highly-secure system, with both strong encryption and immutable audit logging around their access and modification;
(2) expose those secrets at run/compile-time via variables, such that the secret is never stored on-disk anywhere other than in the highly-secure system from (1) and in transient storage while in use;
(3) wrap an authz layer around variable access, so that only authorized services/users/hosts (those that have authenticated properly and who are allowed access via the authz policy established here) can read/write/mutate the secrets
It's (basically) the "privileged identity management" space; the challenge is that the commercial software in that market hasn't kept up with the combination of automated ops infrastructure and cloud-hosted dev tools. There are some ideas around how to do 1-3 better, with a devops/cloud-native design built in. (Full disclosure: I'm part of the founding team at a company doing this.)
Even AWS CloudHSM is not revolutionary conceptually as much as from a compliance and paperwork standpoint. I think there really needs to be emphasis on a (4) - all secrets must be rotated and revokable on-demand and on semi-random schedule. The goal is to make any credential only valid for a period of time less than what an attacker that is already present on your systems would need to further increase presence or to compromise any of 1-4. Who cares if an instance is owned if it's up for maybe 10 minutes and can literally only communicate on a specific port to a specific server with a specific protocol?
Unfortunately, this is all only reasonable in a highly automated architecture and is basically impossible with almost every single company I've ever seen that's ever uttered the mere word ITIL because those companies tend to be people-driven cultures for everything, not process-driven (most companies try to add policies that are so ineffectual and meaningless that everyone reverts back to tribalism similar to how everyone defaults to e-mail when collaboration tooling is ineffective) that you have to figure out to be effective in cloud environments.
I do devops and security automation as well, and there's nothing self-serving about your points if you ask me.
It's not clear from the doc you linked that you would support AWS STS, which is probably the right way to approach minimal privilege and to reduce the time window that an attacker would have the privileges of the entity compromised. Wish I had a way to calculate that out from the tools I had which helps drastically during an investigation to sift through network logs.
What you seem to have built so far is what could be used to build a more modern shared secret access stack rather than being a full solution itself. Most companies that want to pay for something want to have something that will rotate out keys & passwords or enforce secrets policies like separation of keys across different nodes in your high availability solution for them (eg. the DB, root, and LDAP cached passwords should not be stored on the same data node even in encrypted form). Otherwise, a lot of companies have built equivalent solutions like Conjur already (to varying degrees of success depending upon how dysfunctional their IT already is). A lot of the custom solutions I'm familiar with in Defense / IC space are starting to use Apache Accumulo to enforce a great deal of sharing and storing of secrets. The architecture of that makes it possible to have tables split both column-wise and row-wise across multiple nodes based upon business rules like HIPAA, FISMA, PCI-DSS, etc. Tack on Zookeeper with some SASL and you'll spend the next year or two just arranging the meetings to figure out the security rules.
For an analogy, it seems like you've built a lot of the workings of Postgres missing something important like procedural queries and triggers, but organizations really want an ORM (they just don't even realize it because the whole industry is built around bikeshedding topics in security). Build something respecting the vernacular and culture of engineers, IT opsec / compliance, and (more importantly) the managers of both orgs and you should have a winner. Ok, after you find the right sales guys to get the attention of some F500s that are in terrible industries wracked by compliance BS 24/7.
All in all, good idea and it looks promising, I'll keep your product in mind if I can get a management tool like this even suggested. We're doing some extremely bad practices at present in order to avoid violating OTHER no-nos keeping stuff out of the public cloud, and our IAM across dozens and dozens of AWS accounts is completely bonkers and the bungling of the credentials as the after-effect is probably causing worse security problems than if we just gave them all the same keypairs. It'd be really interesting to see this work seamlessly across both AWS-like environments and a vSphere/vCAC/vCD type of environment using affinity / anti-affinity rules to make initial guesses about your security configuration. Pretty sure everything in an autoscaling group should be by default in the same group or "layer" (in your terminology), for example, and you could start with the same for vSphere compute clusters, unless host anti-affinity rules for a VM are present, which usually means that the VM is not allowed to cross a physical boundary and is a hint at a business level policy rather than a technical one (nobody does cross-geographic clusters besides Google last I saw, and you probably aren't going to be able to sell this to them....).
One thing that would tremendously help in your documentation would be to provide security scenarios for different user stories and potential users. Admins across multiple tenant business units have different use cases than developers that are working in maybe one or two organizations / groups, for example. I found myself expecting a "I am a... X, Y, Z" set of tabs and wanted to see each of their use case scenarios for one or two sample companies with different needs. Besides the "I don't want to be your guinea pig" mentality, this is what companies are really looking for half the time they ask for a reference customer.
This is a bit more nuanced than your summary because GH makes it easy. Without the events API, you would have to poll the various repos to find out if changes happened
Furthermore, the existence of GHTorrent demonstrates the ease with which this information can be harvested
I fail to see how the warning is meaningful. One should assume anything you post to the public internet may be public forever. The existence of an API changes nothing as far as the amount of care you should take.