
Why Deleting Sensitive Information from GitHub Doesn't Save You - jwcrux
https://jordan-wright.github.io/blog/2014/12/30/why-deleting-sensitive-information-from-github-doesnt-save-you/
======
guiambros
_> In this post, I’m going to show exactly how hackers instantly harvest
information committed to public Github repositories..._

A few days ago I published my blog to GitHub, with my MailGun API key in the
config file (stupid mistake, I know). In less than 12 hours, spammers had
harvested the key AND sent a few thousand emails with my account, using my
entire monthly limit.

Thankfully I was using the free MailGun account, which is limited to only
10,000 emails/month, so there was no material damage. Their tech support was
awesome in immediately blocking the account and notifying me, and then quickly
helping to unblock the account after keys and passwords were changed, and repo
made private.

I was exactly wondering how they were able to harvest GitHub content so
quickly; it couldn't be web scrapping or a random search. This article
explains well how to drink from GitHub's events firehose and the GHTorrent
project, so everything makes sense now. Thanks for posting it.

EDIT: This other post[1] describes a similar situation. There are some folks
monitoring ALL GitHub commits and getting psswords as they are commited, on
the fly.

[1] [http://www.devfactor.net/2014/12/30/2375-amazon-
mistake/](http://www.devfactor.net/2014/12/30/2375-amazon-mistake/)

~~~
infinitone
I had a similar but less pleasant experience. I had decided to opensource an
old side project of mine, that gets a good amount of users daily. And by that,
it was just initially to make the repo public. But I had totally forgot about
the mail server keys- this was a paid mail server, so you can imagine my
disbelief when I get an email of a $1000 bill and a complaint saying that I
had sent upwards of 250k emails with what seemed to be a iOS mail app malware
email. Luckily it was resolved within a week with support.

~~~
andyjdavis
I'm curious. Did they excuse the bill or was this a $1000 lesson?

~~~
infinitone
Yup, it was credited as they checked the IPs of the server that was sending
those requests. It was clear that it was malicious, also I had been a long
time customer.

------
olefoo
There's a fairly straight forward pattern for keeping sensitive credentials
out of github. It comes straight from
[http://12factor.net/config](http://12factor.net/config) store configuration
data in the environment.

What I do for most projects is keep the tree containing the working directory
in a directory that has some other items that don't belong on github (like the
project brief, my emacs bookmarks file, random notes related to the project
etc. ) and in that directory there is a .credentials file containing a set of
export statements somewhat like:

    
    
       export AWS_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXX
        export AWS_SECRET_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        export AWS_USER_ID=############
    

If I'm feeling extra paranoid, I'll encrypt that into a blob that I only
decrypt when I'm working on said project.

Then at startup the app goes looking for it's config in the environment. This
does create issues for some environments ( solving this for docker is trivial
) but you can usually pass environment variables to whatever is executing your
code reasonably securely. Now it's not perfect, and environments can sometimes
be revealed externally if an attacker is determined and clever and focused on
your app for some reason.

But it does give you a hygienic procedure that keeps your credentials that are
equivalent to an open draw on your bank account out of public repositories.

~~~
califield
I use the `dotenv`[1] package with Node.js and it does exactly the same thing:
environment variable definitions that you can store elsewhere in a dead-simple
format.

To be fair, I think they just copied the `foreman` tool from Heroku. However,
it works great. Most projects don't need anything more than a flat hierarchy
of secret keys and values.

Writing your own parser for a `.env` file is a piece of cake, even in shell
language.

Adding `etcd` is better, but it's too much work for a small project.

[1] [https://github.com/motdotla/dotenv](https://github.com/motdotla/dotenv)

~~~
kgilpin
12 factor and .env make a lot of sense. However they still leave some
questions unanswered, such as:

How do the secrets get safely distributed to the machines where they are
needed?

How to revoke/rotate a secret, especially once a compromise is suspected?

How to perform all this in DevOps-y, automated systems?

This is the problem space I work in.

~~~
olefoo
It really depends on the scale you're working at; and whether you can assume
that there will be someone available to supply credentials for an instance
that had to restart.

I've used fabric and ansible to push configs out to small sets of hosts; and
yes assumed that the sensitive bits were OK sitting on the filesystem of the
production host. Since if an attacker had access to the filesystem there would
be more issues and I'd have to invalidate those credentials anyhow.

At a larger scale you'll want something like etcd or consul or even just a
centralized key server that new instances call and ask for their
configuration.

The thing is that anything predicated on HMAC secrets is vulnerable to those
secrets being exposed. The secret has to be in the clear at some point to
perform authentication or signing and a sufficiently determined attacker will
be able to get that string.

A system is only as secure as the humans running it can confirm it to be
secure. This is why it's best to reduce your attack surface and ensure that
you can log access and do process inventory and egress filtering and the whole
checklist of prevention, detection and remediation. There is no magic pixie
dust that will make your system fully secure; you will always be making
tradeoffs and managing risk rather than eliminating it.

------
tomphoolery
It should be noted that GitHub's article on removing sensitive data is still
applicable if you haven't pushed anything back to GitHub yet. Remember that a
commit is just an entry into your repo, it doesn't synchronize with
`origin/master` until you tell it to. So if the user has not pushed to GitHub
yet, but has committed in their local Git repo, they should follow GitHub's
guide and not worry about changing any keys.

~~~
ncallaway
While it's absolutely true that if the credentials haven't been pushed then
you are not compromised, I would still encourage people to rotate their
credentials regardless.

All it takes is a mistake when deleting the sensitive information, or having
pushed without realizing it to be compromised. Even if you're absolutely
positive there wasn't a breach, it can be a good excuse to drill for a _real_
breach later.

It never hurts to walk through the practice of what to do if credentials leak
when there's no pressure.

~~~
Dylan16807
A credential rotation drill does sound useful, but I'm kind of uncomfortable
with assuming like that; it seems like sloppy thinking that can cause trouble
down the line.

------
PhantomGremlin
If you ever put _anything_ out on the Internet, not just to GitHub, consider
it to be public information. Forever. You might be able to convince
archive.org to remove it, but there are hundreds of players out there who
aren't as ethical.

Ben Franklin figured this out many years ago:

    
    
       Three can keep a secret,
       if two of them are dead.

------
revelation
So many words for one simple principle: if sensitive data has been publicly
accessible or transferred in plaintext over the internet, consider it
compromised, logged stored and abused.

The only recourse is to immediately change or revoke access.

------
nutanc
I think this problem is widespread enough and there are enough idiots out
there(me included),that there should be a feature request for Github to
provide a prompt in case Github detects sensitive information in the code
hosted.

~~~
mkal_tsr
> there should be a feature request for Github to provide a prompt in case
> Github detects sensitive information in the code hosted.

Sure, just enumerate any and all possible types of sensitive data, the format
they may be in, regex / matching functions to account for them (supported
across 20+ programming languages) and I'm sure Github will have that done
asap.

Alternatively, don't commit passwords/API-keys/sensitive-info to your repo.

~~~
jemfinch
> Sure, just enumerate any and all possible types of sensitive data

False dichotomy. It doesn't have to be "everything" or "nothing". An 80%
solution here is better than nothing.

I still find it useful that gmail warns me before I send an email without an
attachment if I've written "I've attached" in an email. Can gmail detect with
100% accuracy if I intended to send an attachment? Of course not. But the 80%
solution here still does me and lots of other people a lot of good.

~~~
mkal_tsr
80/20 rule goes a looooooong way, but relying on a 3rd party to handle your
infosec is terribly misguided. Offloading your net/infoseq issues to a 3rd
party doesn't guarantee you understand the attack surface nor mitigation
steps. The advantage of doing it yourself is you learn and integrate knowledge
as you go. Yes, more error prone, but imo it is more lasting as you
(generally) retain the knowledge as time goes on. The issue with relying on a
3rd party is it becomes a "don't-care" because 'someone' else is taking care
of it for you (github, sr. sw engr, sysadmin, etc.).

There will never be a 100%-fool-proof "Did you mean to commit this sensitive
unicode string?" but getting in front of that with a, "ok, I've checked my
code, ran my tests, pruned the sensitive data, is there anything I'm missing?"
will go a long way both in present and future times.

There's an issue with your example. Google looks for the substr 'attach', but
does it know that a file with a string 20 chars from the newline 50 lines deep
with two single quotes is actually your root password? There's a world of
difference between keying off of a word/phrase and understanding the context
of a larger document+metadata. Trusting computers to do the latter will result
in bad times for all, while learning for yourself can help spread the
knowledge to those that are both technically inclined or not (infosec is
everyone's issue!).

~~~
jemfinch
> but relying on a 3rd party to handle your infosec is terribly misguided.

No one's suggesting anything like that. It's just a basic protection that
Github could offer its users, because its users are humans, and humans make
mistakes.

~~~
mkal_tsr
But that's my point. It's _not_ basic because it's so context dependent (hence
the comment about the regex/functions).

No matter how clever you get with your pattern matching, you're going to have
to always play catch-up with Web Framework N+1's format / weird-ass package
manager. The dual approach is the only sensible approach because it expands
your coverage. The important thing to this is to internalize the knowledge
learned from 3rd parties and integrate that into your native process / tools,
but not everyone will do that.

It'd be grand if we could say, "yeah, they should take care of my security for
me because I'm paying them" but reality is a bitch. It doesn't matter what
they were 'supposed' to do if there was an infosec leak or attack, you can't
ctrl-z that (set of) event(s) and that info is out there, so it must be fixed
(re-roll credentials, regen keys/certs, etc) and accounted for next time.
There really is no 'getting ahead' but 'being less behind' will at least help
mitigate getting eaten from the herd ;-)

~~~
1stop
If only github had a community of developers that could play such a catch up
game... oh wait.

It's kind of amusing you are arguing that it's impossible to play this game,
even though that's exactly what the perpetrators are doing, they are
automatically detecting API keys and harvesting the code... maybe their script
is hosted on github?

~~~
mkal_tsr
My point in my OP was to not play the game of catch-up, don't even pitch in
your vuln strs.

Any time you want to show me a 100% future-proof algorithm for sensitive-info
detection that works across any/all code on github, I'd be happy to toss my
hat in and say, "I was wrong", until then, people will never ever beat 0days
they don't know exist (0day being more than just a SW exploit). Just
do.not.commit.sensitive.info.to.github. Period. That is the only sure way to
not mess it up. Software only executes what is in the code, regardless of how
nonsensical it is (aka, your code will not save you from messing up, something
something something, PEBKAC)

I'm not arguing it's "impossible to play this game" It's 100% possible to play
it when and how you'd like. I'm discussing the rate of "did I win (read: not
get pwnd)?" It's cat and mouse of automation for vulnerable/sensitive info ...
but all of that is rendered moot if you ... wait for it ... don't commit it to
Github which would mean it wouldn't get to Github's API which means it
wouldn't appear in 3rd party services sucking the firehose from Github cloud-y
silicon teet.

And expanding on this ... committing your passwords and sensitive info to your
code-repo is so misguided it's actually funny. What happens if you have an
employee and they go off the deep end? Whoops, gotta rotate all those
passwords/credentials/un-fuck every branch/resync dev's machines/etc. Keeping
that sensitive info in a private, self-hosted, well-maintained internal repo
(with strong ACLs, especially wrt server/hosting environments) will go
significantly further for your team's security than submitting a feature
request to a company to stop you from making arbitrary mistakes every so
often.

~~~
AlisdairO
You're straw-manning. The suggestion was that it would be useful to have a
best-effort system to _try_ to detect when people make mistakes. I don't think
there's any suggestion that it should be something that people rely on, or
that the system should or could be perfect - merely that it would be useful.

~~~
mkal_tsr
I wouldn't go so far as saying I was creating a straw-man argument. My point
is just that relying on other people to take care of security for you ends up
with an "eh" attitude in the long-run, and self-education is more important.

Yes, Github can/should help, but developers should not think they're owed it
just because they constantly check in sensitive info to a website, that's all.

~~~
1stop
So Police are pointless and a result of an "eh" attitude?

(I'm just applying your point to another kind of 'security' that is provided
to you).

You are basically arguing the "We should all live in the woods, and hunt and
kill our own food, because relying on other people is fraught with danger"
line.

Or we could accept mistakes are made, and provide warnings/undos/etc. Kind of
like how cars have airbags, even though they are rendered moot by just "not
crashing your car"

------
akerl_
To be clear, the guide from GitHub that's linked at the top of this article
clearly states that you should consider the sensitive data compromised.
Cleaning it out of the repo is a good move, but it's a companion move to
rotating out those creds or whatever for new ones.

~~~
fragmede
That guide highlights this in its own box. In red.

Not sure how GitHub could make it more obvious.

Perhaps if they mentioned there are unscrupulous users out there who have a
script that hammers GitHub's events API to search for exposed passwords/keys,
then it would reduce the 'oops I only pushed it for a second' thinking that
users likely go through.

------
femto113
My advice: USE PRIVATE REPOS! At $7/month Github's micro plan with 5 repos is
just $1.40/repo-month. This is the cheapest insurance you can get against the
nearly inevitable mistake of committing something sensitive.

~~~
mmahemoff
Sure, use private repos for private projects but this is about open-source
authors accidentally leaving their credentials in config files and the like.

~~~
femto113
Anecdotally this seems to happen mostly from personal experiments, not from
development of open source software meant to be consumed by others. Secrets
generally are included in projects that are meant to be deployed (like a rails
app, or a blog, or ...), not in thelibraries/gems/modules that make up the
bulk of open source found on Github.

------
xasos
Always use environment variables. They are probably the best way to safeguard
your API keys.

~~~
TTPrograms
I've always wondered the proper way to deal with this, and this makes total
sense. How would you typically set such an environment variable? In bash init?

~~~
sesteel
Environment variables can work well for development but I wouldn't put them in
.bashrc or .bash_profile; if you are like me, you like to store your dot files
somewhere public. I typically leave them in an encrypted file on dev systems,
but this only solves the accidental over the shoulder problem. Production
systems require another level of security altogether.

Typically, I've seen services run in restricted user accounts with limited
system access, reading passwords out of an encrypted file. This file is stored
in some obscure location on the box to which that user account is the only one
with read permissions to.

Keep in mind, every system has weaknesses and I am still interested in
listening to others' approaches.

~~~
jemfinch
> if you are like me, you like to store your dot files somewhere public.

In your public .bashrc, put a line "source .bashrc.secret." Just keep an empty
.bashrc.secret in your public repository, and keep your actual secret
credential on your machines.

~~~
mateuszf
And don't forget to add it to .gitignore, otherwise when overwritten
accidently it might land in public repo.

~~~
devonkim
That's a fantastic way to TELL attackers what filename to search for on a
filesystem if they have access to your source code. Randomizing filenames and
forcing an attacker to have to write a custom utility to find the path to
files keeps you from getting hit by a number of drive-by hackings. And every
single incorrect use of a credential must be recorded off-system and
monitored. Avoid using defaults in general for any third party software and
you can do things like generating random paths to S3 buckets that contain
certificates and environment variables in your own software. S3 buckets are
incredibly secure if you tack on CloudHSM plus use host certificates
effectively with IAM policies.

Otherwise, I'd try to use keystore systems available on your respective OS or
language platform toolchain (CSP on .NET, JCE for Java, I dunno wtf else you'd
use for anything else because the only people I've heard of that want to go
this far are all F500 enterprises basically with software in exactly those two
languages only).

~~~
tlrobinson
You can create a global .gitignore:
[https://help.github.com/articles/ignoring-
files/](https://help.github.com/articles/ignoring-files/)

------
xasos
It always amazes me to see the sheer amount of API keys left around in GitHub
repositories. You can search anything like Twilio API Key and come out with
hundreds of thousands of results. I wonder to what extent these keys have been
exploited.

------
baxter001
A script to post random key containing config-like files to public repos and
waste these guy's bandwidth/light them up on amazon's blacklist radar would be
a cool idea.

------
DenisM
On MacOS theres keychain - it's a designated place for storing secrets.

On windows I create a batch file at a fixed location with all the credentials
in it. A script simply runs this batch file and reads the env cars to get
values. A compiled program parses the batch file with regex to find required
values. This works remarkably well for keeping credentials out of the code
base.

Hope that hels someone.

------
icymatter
Github has very good cache. In the past, when I deleted a repository I still
was able to access some diff and commit information from my own activity
pages. I had to request Github team to clear that page manually.

------
jquast
I'm very certain this is a hacker's account configured to follow a great deal
of projects and people (2k projects, 1.3k users) for this very purpose -- a
suspicious [redacted] [unknown] profile,
[https://github.com/trnsz](https://github.com/trnsz)

------
jpetersonmn
First time I tried to use github I uploaded my gmail password which I was
using to send myself an email when something failed. I figured that there
would be bots that would scoop up that information right away. Luckily I
realized what I had done before people could get into my gmail.

------
jpdlla
Thinking of actually working on a tool for this. Will have a blacklist of
"searches" that might contain sensitive data and perhaps notifying via the
email of the committer or creating an issue on the repo. Anyone else want to
get involved?

------
godzillabrennus
Millions of emails for developers and no one harvesting this info thought it
wise to obfuscate it in some way?

~~~
bhuga
Emails are part of commits, which are hashed as part of normal git operations.
There's no way to make repositories public and not make emails public.

~~~
xai3luGi
You could set your author email address to an empty string before committing.

------
tlrobinson
Are there any open source git hooks that will scan your code for known
credential formats?

------
morkfromork
What would happen if you published a few billion fake credentials?

------
rilita
TLDR: It won't save you because people could have copied the information
before you deleted it.

Duh?

~~~
sheetjs
TL;DR: github makes it easy to notice when events have occured, so easy that
you can write tools to copy information as soon as it hits.

This is a bit more nuanced than your summary because GH makes it easy. Without
the events API, you would have to poll the various repos to find out if
changes happened

Furthermore, the existence of GHTorrent demonstrates the ease with which this
information can be harvested

~~~
rilita
So what you are saying is that because Github provides a clean API we should
be more careful about posting sensitive data to it versus other things where
you can simply scrape index pages?

I fail to see how the warning is meaningful. One should assume anything you
post to the public internet may be public forever. The existence of an API
changes nothing as far as the amount of care you should take.

