Hacker News new | past | comments | ask | show | jobs | submit login
Why Deleting Sensitive Information from GitHub Doesn't Save You (jordan-wright.github.io)
313 points by jwcrux on Dec 31, 2014 | hide | past | web | favorite | 82 comments

> In this post, I’m going to show exactly how hackers instantly harvest information committed to public Github repositories...

A few days ago I published my blog to GitHub, with my MailGun API key in the config file (stupid mistake, I know). In less than 12 hours, spammers had harvested the key AND sent a few thousand emails with my account, using my entire monthly limit.

Thankfully I was using the free MailGun account, which is limited to only 10,000 emails/month, so there was no material damage. Their tech support was awesome in immediately blocking the account and notifying me, and then quickly helping to unblock the account after keys and passwords were changed, and repo made private.

I was exactly wondering how they were able to harvest GitHub content so quickly; it couldn't be web scrapping or a random search. This article explains well how to drink from GitHub's events firehose and the GHTorrent project, so everything makes sense now. Thanks for posting it.

EDIT: This other post[1] describes a similar situation. There are some folks monitoring ALL GitHub commits and getting psswords as they are commited, on the fly.

[1] http://www.devfactor.net/2014/12/30/2375-amazon-mistake/

I had a similar but less pleasant experience. I had decided to opensource an old side project of mine, that gets a good amount of users daily. And by that, it was just initially to make the repo public. But I had totally forgot about the mail server keys- this was a paid mail server, so you can imagine my disbelief when I get an email of a $1000 bill and a complaint saying that I had sent upwards of 250k emails with what seemed to be a iOS mail app malware email. Luckily it was resolved within a week with support.

I'm curious. Did they excuse the bill or was this a $1000 lesson?

Yup, it was credited as they checked the IPs of the server that was sending those requests. It was clear that it was malicious, also I had been a long time customer.

To be fair to you, part of being a paid mail provider is dealing with this kind of stuff on the daily, I am surprised they didnt stop it WAY before it hit that send count.

Yeah, its weird because I was subscribed to a much lower email plan anyway. Somehow, that gave them the okay to auto-upgrade my account and 'release the hounds.'

Also, this was a reputable email provider that many of you know of (i believe it went thru one of the incubators).

There's a fairly straight forward pattern for keeping sensitive credentials out of github. It comes straight from http://12factor.net/config store configuration data in the environment.

What I do for most projects is keep the tree containing the working directory in a directory that has some other items that don't belong on github (like the project brief, my emacs bookmarks file, random notes related to the project etc. ) and in that directory there is a .credentials file containing a set of export statements somewhat like:

    export AWS_USER_ID=############
If I'm feeling extra paranoid, I'll encrypt that into a blob that I only decrypt when I'm working on said project.

Then at startup the app goes looking for it's config in the environment. This does create issues for some environments ( solving this for docker is trivial ) but you can usually pass environment variables to whatever is executing your code reasonably securely. Now it's not perfect, and environments can sometimes be revealed externally if an attacker is determined and clever and focused on your app for some reason.

But it does give you a hygienic procedure that keeps your credentials that are equivalent to an open draw on your bank account out of public repositories.

I use the `dotenv`[1] package with Node.js and it does exactly the same thing: environment variable definitions that you can store elsewhere in a dead-simple format.

To be fair, I think they just copied the `foreman` tool from Heroku. However, it works great. Most projects don't need anything more than a flat hierarchy of secret keys and values.

Writing your own parser for a `.env` file is a piece of cake, even in shell language.

Adding `etcd` is better, but it's too much work for a small project.

[1] https://github.com/motdotla/dotenv

12 factor and .env make a lot of sense. However they still leave some questions unanswered, such as:

How do the secrets get safely distributed to the machines where they are needed?

How to revoke/rotate a secret, especially once a compromise is suspected?

How to perform all this in DevOps-y, automated systems?

This is the problem space I work in.

It really depends on the scale you're working at; and whether you can assume that there will be someone available to supply credentials for an instance that had to restart.

I've used fabric and ansible to push configs out to small sets of hosts; and yes assumed that the sensitive bits were OK sitting on the filesystem of the production host. Since if an attacker had access to the filesystem there would be more issues and I'd have to invalidate those credentials anyhow.

At a larger scale you'll want something like etcd or consul or even just a centralized key server that new instances call and ask for their configuration.

The thing is that anything predicated on HMAC secrets is vulnerable to those secrets being exposed. The secret has to be in the clear at some point to perform authentication or signing and a sufficiently determined attacker will be able to get that string.

A system is only as secure as the humans running it can confirm it to be secure. This is why it's best to reduce your attack surface and ensure that you can log access and do process inventory and egress filtering and the whole checklist of prevention, detection and remediation. There is no magic pixie dust that will make your system fully secure; you will always be making tradeoffs and managing risk rather than eliminating it.

Those who use ansible can use ansible-vault to encrypt credentials (http://docs.ansible.com/playbooks_vault.html) and chef has encrypted data bags (https://docs.chef.io/chef/essentials_data_bags.html). Really, any raw config shouldn't ever make it in the repo anyway, other than sample.conf.

If you're feeling fancy, you can use my library to asymmetrically encrypt credentials using RSA keys [1].

[1]: https://github.com/jacobgreenleaf/greybox

The fact that this uses RSA directly seriously worries me. Is the RSA library using OAEP? Does it properly blind it's inputs before signing? What's the modulus? Does key generation avoid using weak keys?

Maybe the answer to these questions and others is satisfactory, but getting RSA catastrophically wrong is easy enough that I'm extremely skeptical that a library will get it right. Honestly, I'd be infinitely more likely to use your library if it just used GPG under the hood. That's one less piece of crypto I feel compelled to audit.

If you read the code it's obvious that it uses a Python named 'rsa' which implements PKCS#1

You have completely missed the point.

I did read the code, and I saw it used the `rsa` library. I read the code for that library, and also saw it claims to use PKCS#1 padding. None of these obviates my point.

There are dozens of other ways to fuck up an RSA implementation. Some obvious, many not. I am not an expert in Python, nor am I an expert in auditing secure RSA implementations. Neither are most of this project's intended audience, I would warrant.

Using RSA like this directly, in my opinion, dramatically increases the likelihood of a significant implementation oversight when compared to something as widely-used, audited, and established as GPG. And it should cause security-conscious users to be much more distrustful of it.

As a security professional, adding to the list of libraries and crypto implementations for me to audit does not reduce my workload: it massively increases it. If it were a conceptually simple wrapper around GPG, I would consider deploying it without a second thought. GPG, while crusty and imperfect, is at least more difficult to misuse. As it stands, I would need to spend significant time relearning RSA implementation best practices and ensuring it adheres to them.

The fact that others aren't likely to do (or be capable of doing) this legwork only makes the problem worse; bad crypto is often little better than no crypto. And until proven otherwise, the default assumption should be that something uses bad crypto.

It should be noted that GitHub's article on removing sensitive data is still applicable if you haven't pushed anything back to GitHub yet. Remember that a commit is just an entry into your repo, it doesn't synchronize with `origin/master` until you tell it to. So if the user has not pushed to GitHub yet, but has committed in their local Git repo, they should follow GitHub's guide and not worry about changing any keys.

While it's absolutely true that if the credentials haven't been pushed then you are not compromised, I would still encourage people to rotate their credentials regardless.

All it takes is a mistake when deleting the sensitive information, or having pushed without realizing it to be compromised. Even if you're absolutely positive there wasn't a breach, it can be a good excuse to drill for a _real_ breach later.

It never hurts to walk through the practice of what to do if credentials leak when there's no pressure.

A credential rotation drill does sound useful, but I'm kind of uncomfortable with assuming like that; it seems like sloppy thinking that can cause trouble down the line.


If you ever put anything out on the Internet, not just to GitHub, consider it to be public information. Forever. You might be able to convince archive.org to remove it, but there are hundreds of players out there who aren't as ethical.

Ben Franklin figured this out many years ago:

   Three can keep a secret,
   if two of them are dead.

So many words for one simple principle: if sensitive data has been publicly accessible or transferred in plaintext over the internet, consider it compromised, logged stored and abused.

The only recourse is to immediately change or revoke access.

I think this problem is widespread enough and there are enough idiots out there(me included),that there should be a feature request for Github to provide a prompt in case Github detects sensitive information in the code hosted.

Amazon crawls Github looking for keys and disabled it. Happened to my company once!

> there should be a feature request for Github to provide a prompt in case Github detects sensitive information in the code hosted.

Sure, just enumerate any and all possible types of sensitive data, the format they may be in, regex / matching functions to account for them (supported across 20+ programming languages) and I'm sure Github will have that done asap.

Alternatively, don't commit passwords/API-keys/sensitive-info to your repo.

> Sure, just enumerate any and all possible types of sensitive data

False dichotomy. It doesn't have to be "everything" or "nothing". An 80% solution here is better than nothing.

I still find it useful that gmail warns me before I send an email without an attachment if I've written "I've attached" in an email. Can gmail detect with 100% accuracy if I intended to send an attachment? Of course not. But the 80% solution here still does me and lots of other people a lot of good.

Also, they could certainly enumerate all the kinds of sensitive data that bots are automatically scraping and detect that, and that would be a major improvement because it'd mean people don't have their API credentials stolen and abused faster than they can remove them.

80/20 rule goes a looooooong way, but relying on a 3rd party to handle your infosec is terribly misguided. Offloading your net/infoseq issues to a 3rd party doesn't guarantee you understand the attack surface nor mitigation steps. The advantage of doing it yourself is you learn and integrate knowledge as you go. Yes, more error prone, but imo it is more lasting as you (generally) retain the knowledge as time goes on. The issue with relying on a 3rd party is it becomes a "don't-care" because 'someone' else is taking care of it for you (github, sr. sw engr, sysadmin, etc.).

There will never be a 100%-fool-proof "Did you mean to commit this sensitive unicode string?" but getting in front of that with a, "ok, I've checked my code, ran my tests, pruned the sensitive data, is there anything I'm missing?" will go a long way both in present and future times.

There's an issue with your example. Google looks for the substr 'attach', but does it know that a file with a string 20 chars from the newline 50 lines deep with two single quotes is actually your root password? There's a world of difference between keying off of a word/phrase and understanding the context of a larger document+metadata. Trusting computers to do the latter will result in bad times for all, while learning for yourself can help spread the knowledge to those that are both technically inclined or not (infosec is everyone's issue!).

> but relying on a 3rd party to handle your infosec is terribly misguided.

No one's suggesting anything like that. It's just a basic protection that Github could offer its users, because its users are humans, and humans make mistakes.

But that's my point. It's not basic because it's so context dependent (hence the comment about the regex/functions).

No matter how clever you get with your pattern matching, you're going to have to always play catch-up with Web Framework N+1's format / weird-ass package manager. The dual approach is the only sensible approach because it expands your coverage. The important thing to this is to internalize the knowledge learned from 3rd parties and integrate that into your native process / tools, but not everyone will do that.

It'd be grand if we could say, "yeah, they should take care of my security for me because I'm paying them" but reality is a bitch. It doesn't matter what they were 'supposed' to do if there was an infosec leak or attack, you can't ctrl-z that (set of) event(s) and that info is out there, so it must be fixed (re-roll credentials, regen keys/certs, etc) and accounted for next time. There really is no 'getting ahead' but 'being less behind' will at least help mitigate getting eaten from the herd ;-)

If only github had a community of developers that could play such a catch up game... oh wait.

It's kind of amusing you are arguing that it's impossible to play this game, even though that's exactly what the perpetrators are doing, they are automatically detecting API keys and harvesting the code... maybe their script is hosted on github?

My point in my OP was to not play the game of catch-up, don't even pitch in your vuln strs.

Any time you want to show me a 100% future-proof algorithm for sensitive-info detection that works across any/all code on github, I'd be happy to toss my hat in and say, "I was wrong", until then, people will never ever beat 0days they don't know exist (0day being more than just a SW exploit). Just do.not.commit.sensitive.info.to.github. Period. That is the only sure way to not mess it up. Software only executes what is in the code, regardless of how nonsensical it is (aka, your code will not save you from messing up, something something something, PEBKAC)

I'm not arguing it's "impossible to play this game" It's 100% possible to play it when and how you'd like. I'm discussing the rate of "did I win (read: not get pwnd)?" It's cat and mouse of automation for vulnerable/sensitive info ... but all of that is rendered moot if you ... wait for it ... don't commit it to Github which would mean it wouldn't get to Github's API which means it wouldn't appear in 3rd party services sucking the firehose from Github cloud-y silicon teet.

And expanding on this ... committing your passwords and sensitive info to your code-repo is so misguided it's actually funny. What happens if you have an employee and they go off the deep end? Whoops, gotta rotate all those passwords/credentials/un-fuck every branch/resync dev's machines/etc. Keeping that sensitive info in a private, self-hosted, well-maintained internal repo (with strong ACLs, especially wrt server/hosting environments) will go significantly further for your team's security than submitting a feature request to a company to stop you from making arbitrary mistakes every so often.

You're straw-manning. The suggestion was that it would be useful to have a best-effort system to try to detect when people make mistakes. I don't think there's any suggestion that it should be something that people rely on, or that the system should or could be perfect - merely that it would be useful.

I wouldn't go so far as saying I was creating a straw-man argument. My point is just that relying on other people to take care of security for you ends up with an "eh" attitude in the long-run, and self-education is more important.

Yes, Github can/should help, but developers should not think they're owed it just because they constantly check in sensitive info to a website, that's all.

So Police are pointless and a result of an "eh" attitude?

(I'm just applying your point to another kind of 'security' that is provided to you).

You are basically arguing the "We should all live in the woods, and hunt and kill our own food, because relying on other people is fraught with danger" line.

Or we could accept mistakes are made, and provide warnings/undos/etc. Kind of like how cars have airbags, even though they are rendered moot by just "not crashing your car"

> Alternatively, don't commit passwords/API-keys/sensitive-info to your repo.

This is, of course, the right answer.

However, it's frustrating that several frameworks make this very easy to get wrong. Anything that has an application.yml, database.yml, or similar configuration file that normally lives within the same directory as the source code, and which is intended to contain credentials, means that lots of people will make that mistake.

It's one of the fundamental errors that you see so often in web frameworks like Rails, this whole idea of mixing application code and configuration files into one big tree. I don't know how this practice caught on, but it has, and it so frequently causes mistakes, both big ones like accidentally publishing credentials, and simply frustrating ones like confusing the difference between application code and configuration in ways that make it more difficult to have multiple local configurations and updating the application code independently.

80:20 rule.I am sure, if we check for api_key,password(variants),key should help 80% of the people from making mistakes

Yeah, but you write api_key all over your code.

The point were you define it is sensitive, but the place where you use it are not.

Similarly, a white hat could watch /events and warn users and/or services when credentials are 'burned'. (A major exploitable service like AWS might even want to do this itself.)

AWS actually does that already — they'll ping you by email if they find one of your keys on GitHub!

Why do they take longer than the black hats, though?

Thinking of actually working on this. Seems like an interesting and useful little project. Can email users or create an issue on the repo.

I don't know about that. It seems to me that you are wanting Github to do what the committer should be doing.

To be clear, the guide from GitHub that's linked at the top of this article clearly states that you should consider the sensitive data compromised. Cleaning it out of the repo is a good move, but it's a companion move to rotating out those creds or whatever for new ones.

That guide highlights this in its own box. In red.

Not sure how GitHub could make it more obvious.

Perhaps if they mentioned there are unscrupulous users out there who have a script that hammers GitHub's events API to search for exposed passwords/keys, then it would reduce the 'oops I only pushed it for a second' thinking that users likely go through.

My advice: USE PRIVATE REPOS! At $7/month Github's micro plan with 5 repos is just $1.40/repo-month. This is the cheapest insurance you can get against the nearly inevitable mistake of committing something sensitive.

Could also have a look at BitBucket instead: unlimited private personal (or teams of max 5) repos at $0/month. Or for $7/month you can host your own at DigitalOcean/Azure/...

Sure, use private repos for private projects but this is about open-source authors accidentally leaving their credentials in config files and the like.

Anecdotally this seems to happen mostly from personal experiments, not from development of open source software meant to be consumed by others. Secrets generally are included in projects that are meant to be deployed (like a rails app, or a blog, or ...), not in thelibraries/gems/modules that make up the bulk of open source found on Github.

Always use environment variables. They are probably the best way to safeguard your API keys.

I've always wondered the proper way to deal with this, and this makes total sense. How would you typically set such an environment variable? In bash init?

Environment variables can work well for development but I wouldn't put them in .bashrc or .bash_profile; if you are like me, you like to store your dot files somewhere public. I typically leave them in an encrypted file on dev systems, but this only solves the accidental over the shoulder problem. Production systems require another level of security altogether.

Typically, I've seen services run in restricted user accounts with limited system access, reading passwords out of an encrypted file. This file is stored in some obscure location on the box to which that user account is the only one with read permissions to.

Keep in mind, every system has weaknesses and I am still interested in listening to others' approaches.

> if you are like me, you like to store your dot files somewhere public.

In your public .bashrc, put a line "source .bashrc.secret." Just keep an empty .bashrc.secret in your public repository, and keep your actual secret credential on your machines.

And don't forget to add it to .gitignore, otherwise when overwritten accidently it might land in public repo.

That's a fantastic way to TELL attackers what filename to search for on a filesystem if they have access to your source code. Randomizing filenames and forcing an attacker to have to write a custom utility to find the path to files keeps you from getting hit by a number of drive-by hackings. And every single incorrect use of a credential must be recorded off-system and monitored. Avoid using defaults in general for any third party software and you can do things like generating random paths to S3 buckets that contain certificates and environment variables in your own software. S3 buckets are incredibly secure if you tack on CloudHSM plus use host certificates effectively with IAM policies.

Otherwise, I'd try to use keystore systems available on your respective OS or language platform toolchain (CSP on .NET, JCE for Java, I dunno wtf else you'd use for anything else because the only people I've heard of that want to go this far are all F500 enterprises basically with software in exactly those two languages only).

You can create a global .gitignore: https://help.github.com/articles/ignoring-files/

Envdir [1] or its python port [2] are one way to organize environment variables

[1] http://cr.yp.to/daemontools/envdir.html [2] http://envdir.readthedocs.org/en/latest/

It depends on language. In node, you can set environment variables in the code with process.env and Python with os.environ and then use those to specify the values on the command line. In fact, even services like Heroku will let you edit these from their web-based client.

That's really more of an implementation detail though. You'll want a file to specify these environment variables, so you can actually have services start on boot, so that means you need a config file (whether it's used for setting the environ, or read directly by the application is the implementation detail). What's really needed is to correctly separate and secure important configuration options outside of source, and designing for that from the beginning. Using environment variables in a way forces this, which is good, but it doesn't help if there's just a startup script that specifies those variables and it gets accidentally committed to the repo.

I’ve had good luck with foreman [0] (if you’re happy with Ruby). Create a .env file in your project root with your variable pairs and foreman makes them available inside your app. Then you just need to make sure .env is in your .gitignore and you’re happy.

[0] https://github.com/ddollar/foreman

It always amazes me to see the sheer amount of API keys left around in GitHub repositories. You can search anything like Twilio API Key and come out with hundreds of thousands of results. I wonder to what extent these keys have been exploited.

A script to post random key containing config-like files to public repos and waste these guy's bandwidth/light them up on amazon's blacklist radar would be a cool idea.

On MacOS theres keychain - it's a designated place for storing secrets.

On windows I create a batch file at a fixed location with all the credentials in it. A script simply runs this batch file and reads the env cars to get values. A compiled program parses the batch file with regex to find required values. This works remarkably well for keeping credentials out of the code base.

Hope that hels someone.

Github has very good cache. In the past, when I deleted a repository I still was able to access some diff and commit information from my own activity pages. I had to request Github team to clear that page manually.

I'm very certain this is a hacker's account configured to follow a great deal of projects and people (2k projects, 1.3k users) for this very purpose -- a suspicious [redacted] [unknown] profile, https://github.com/trnsz

First time I tried to use github I uploaded my gmail password which I was using to send myself an email when something failed. I figured that there would be bots that would scoop up that information right away. Luckily I realized what I had done before people could get into my gmail.

Thinking of actually working on a tool for this. Will have a blacklist of "searches" that might contain sensitive data and perhaps notifying via the email of the committer or creating an issue on the repo. Anyone else want to get involved?


They really don't. You can read them from environment variables (which would be set by your platform eg. heroku) or you could set up a vagrant instance with the necessary services to develop locally.

"API keys [...] need to be in your committed code"


> "API keys [...] need to be in your committed code"

No they don't.

>> "API keys [...] need to be in your committed code"

> No they don't.

A better approach is what the rest of the comments here are suggesting:

(1) store your secrets (API keys, certs, credentials, whatever) in a highly-secure system, with both strong encryption and immutable audit logging around their access and modification;

(2) expose those secrets at run/compile-time via variables, such that the secret is never stored on-disk anywhere other than in the highly-secure system from (1) and in transient storage while in use;

(3) wrap an authz layer around variable access, so that only authorized services/users/hosts (those that have authenticated properly and who are allowed access via the authz policy established here) can read/write/mutate the secrets

It's (basically) the "privileged identity management" space; the challenge is that the commercial software in that market hasn't kept up with the combination of automated ops infrastructure and cloud-hosted dev tools. There are some ideas around how to do 1-3 better, with a devops/cloud-native design built in. (Full disclosure: I'm part of the founding team at a company doing this.)

The PIM software I've seen in enterprise (stuff even older than what Cyber-Ark has) has barely even kept up with software from the early 2000s let alone modern automated operations infrastructure. APIs that are written for XML-RPC and even XDR for crying out loud (that implies that even TCP was a tough sell for them). Automating them has been an exercise in incredible pain for few rewards.

Even AWS CloudHSM is not revolutionary conceptually as much as from a compliance and paperwork standpoint. I think there really needs to be emphasis on a (4) - all secrets must be rotated and revokable on-demand and on semi-random schedule. The goal is to make any credential only valid for a period of time less than what an attacker that is already present on your systems would need to further increase presence or to compromise any of 1-4. Who cares if an instance is owned if it's up for maybe 10 minutes and can literally only communicate on a specific port to a specific server with a specific protocol?

Unfortunately, this is all only reasonable in a highly automated architecture and is basically impossible with almost every single company I've ever seen that's ever uttered the mere word ITIL because those companies tend to be people-driven cultures for everything, not process-driven (most companies try to add policies that are so ineffectual and meaningless that everyone reverts back to tribalism similar to how everyone defaults to e-mail when collaboration tooling is ineffective) that you have to figure out to be effective in cloud environments.

I do devops and security automation as well, and there's nothing self-serving about your points if you ask me.

I would definitely love to hear what you think of our stuff. Here's a link; I have chosen a description of how secrets can be stored, distributed over HTTPS, and wrapped with a script that exposes them as environment variables.


Funny, we just had this just hit the front page against environment variables for secrets. https://news.ycombinator.com/item?id=8826024

It's not clear from the doc you linked that you would support AWS STS, which is probably the right way to approach minimal privilege and to reduce the time window that an attacker would have the privileges of the entity compromised. Wish I had a way to calculate that out from the tools I had which helps drastically during an investigation to sift through network logs.

What you seem to have built so far is what could be used to build a more modern shared secret access stack rather than being a full solution itself. Most companies that want to pay for something want to have something that will rotate out keys & passwords or enforce secrets policies like separation of keys across different nodes in your high availability solution for them (eg. the DB, root, and LDAP cached passwords should not be stored on the same data node even in encrypted form). Otherwise, a lot of companies have built equivalent solutions like Conjur already (to varying degrees of success depending upon how dysfunctional their IT already is). A lot of the custom solutions I'm familiar with in Defense / IC space are starting to use Apache Accumulo to enforce a great deal of sharing and storing of secrets. The architecture of that makes it possible to have tables split both column-wise and row-wise across multiple nodes based upon business rules like HIPAA, FISMA, PCI-DSS, etc. Tack on Zookeeper with some SASL and you'll spend the next year or two just arranging the meetings to figure out the security rules.

For an analogy, it seems like you've built a lot of the workings of Postgres missing something important like procedural queries and triggers, but organizations really want an ORM (they just don't even realize it because the whole industry is built around bikeshedding topics in security). Build something respecting the vernacular and culture of engineers, IT opsec / compliance, and (more importantly) the managers of both orgs and you should have a winner. Ok, after you find the right sales guys to get the attention of some F500s that are in terrible industries wracked by compliance BS 24/7.

All in all, good idea and it looks promising, I'll keep your product in mind if I can get a management tool like this even suggested. We're doing some extremely bad practices at present in order to avoid violating OTHER no-nos keeping stuff out of the public cloud, and our IAM across dozens and dozens of AWS accounts is completely bonkers and the bungling of the credentials as the after-effect is probably causing worse security problems than if we just gave them all the same keypairs. It'd be really interesting to see this work seamlessly across both AWS-like environments and a vSphere/vCAC/vCD type of environment using affinity / anti-affinity rules to make initial guesses about your security configuration. Pretty sure everything in an autoscaling group should be by default in the same group or "layer" (in your terminology), for example, and you could start with the same for vSphere compute clusters, unless host anti-affinity rules for a VM are present, which usually means that the VM is not allowed to cross a physical boundary and is a hint at a business level policy rather than a technical one (nobody does cross-geographic clusters besides Google last I saw, and you probably aren't going to be able to sell this to them....).

One thing that would tremendously help in your documentation would be to provide security scenarios for different user stories and potential users. Admins across multiple tenant business units have different use cases than developers that are working in maybe one or two organizations / groups, for example. I found myself expecting a "I am a... X, Y, Z" set of tabs and wanted to see each of their use case scenarios for one or two sample companies with different needs. Besides the "I don't want to be your guinea pig" mentality, this is what companies are really looking for half the time they ask for a reference customer.

Millions of emails for developers and no one harvesting this info thought it wise to obfuscate it in some way?

Obfuscate the emails that are clearly visible on github.com? I think by this point it should be pretty clear that email addresses, like domain names, are not secret information. Obfuscating the data would have mangled useable information for a goal of blocking people who would have had no trouble gathering that information anyways.

Emails are part of commits, which are hashed as part of normal git operations. There's no way to make repositories public and not make emails public.

You could set your author email address to an empty string before committing.

Are there any open source git hooks that will scan your code for known credential formats?

What would happen if you published a few billion fake credentials?

TLDR: It won't save you because people could have copied the information before you deleted it.


TL;DR: github makes it easy to notice when events have occured, so easy that you can write tools to copy information as soon as it hits.

This is a bit more nuanced than your summary because GH makes it easy. Without the events API, you would have to poll the various repos to find out if changes happened

Furthermore, the existence of GHTorrent demonstrates the ease with which this information can be harvested

So what you are saying is that because Github provides a clean API we should be more careful about posting sensitive data to it versus other things where you can simply scrape index pages?

I fail to see how the warning is meaningful. One should assume anything you post to the public internet may be public forever. The existence of an API changes nothing as far as the amount of care you should take.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact