
Incident Report: Inadvertent Private Repository Disclosure - jamesfryman
https://github.com/blog/2273-incident-report-inadvertent-private-repository-disclosure
======
Hovertruck
We received an email from Github yesterday informing us that one of our
repositories had been accessed by a third party due to this issue. While it's
not a fun notification to receive, it definitely made our general security
paranoia feel justified – we're lucky that from the get-go we've held best
practices around keeping secrets out of the codebase. Obviously we still
dedicated time as a team to prune through our repository history with a fine-
toothed comb for anything that could potentially be a vulnerability, as we
take this very seriously.

One of our engineers came up with a useful script to grab all unique lines
from the history of the repository and sort them according to entropy. This
helps to lift any access keys or passwords which may have been committed at
any point to the top.

I think this is a great example to illustrate the tough edges of security to
less experienced engineers. Github will most likely never let something like
this happen to you, but on the off-chance that they do it's great to be
prepared. Additionally, the response from Github was very well received. No
excuses, just a thorough explanation of what happened.

I also can't help but mention that we're hiring, if you'd like to work at an
organization that values security and data privacy very highly. :)
usebutton.com/join-us

~~~
foota
I'm curious, how did they calculate entropy? My first thought was to do
something with Huffman encoding.

~~~
jasonmoo
I wrote the script in question and actually used a simple shannon entropy
value.
([http://codereview.stackexchange.com/questions/868/calculatin...](http://codereview.stackexchange.com/questions/868/calculating-
entropy-of-a-string/909#909)). It worked well enough help rule out several
problem spaces.

~~~
pavel_lishin
Would you mind posting the script? I'd love to run it against our codebase and
see what it comes up with.

It might be a fun thing to open source as part of a "I've inherited a project,
what now?" toolkit that helps you decide what to fix.

~~~
jasonmoo
Sure. It's a simple tool but the concept could be augmented toward something
like the scenario you described.

[https://gist.github.com/jasonmoo/06691c8fea09b62aa35235fc93e...](https://gist.github.com/jasonmoo/06691c8fea09b62aa35235fc93ee31b6)

------
jorge_leria
Github takes security seriously, this disclosure post is a proof of that.

------
kozak
This honest report is a good example of transparency.

------
Silhouette
One of the most striking things about this report is the scale that GitHub has
now reached: the whole incident apparently lasted only 10 minutes, but during
that time 17 _million_ requests were sent to their git proxy.

It's obviously unfortunate in this case, since even a relatively small and
quickly fixed bug affecting a tiny proportion of requests still had serious
consequences.

However, it's a remarkable achievement (if also a little terrifying for the
software development industry from a single-point-of-failure perspective).

------
bsder
I approve of the handling, but this just underscores why you want self-hosted
instances.

~~~
asolove
Does it? Except for very sophisticated organizations, I doubt it.

You don't hear about intrusions into self-hosted source repositories. Not
because there are fewer, but because they likely don't have the security
infrastructure in place to know that they ever happened.

~~~
kosievdmerwe
Also, there is very little incentive for them to advertise that they've been
compromised. Whereas, Github has a duty to disclose that they've been
compromised to their clients.

------
0x0
Leaking private repositories is one thing, but if you have a private build
server that pulls and runs scripts, you could be in for a bad time even if you
ended up pulling a random public repository, if the build script is
malicious... hmmm...

------
vemv
In retrospect of course it's always easy to criticise, but still, the diff is
really cringeworthy.

The deleted code is very specific-looking. Nobody writes that just casually or
out of ignorance. Also it is what was at use in production.

It's very naive to just go and replace that with nice-looking, shorter code.

Key lessons:

\- Understand what you are deleting

\- Treat production code as sacred

\- Add reasonably extensive comments for delicate code (as the original one).
Git commit messages aren't enough.

\- Try out infrastructure changes in production-like staging servers. I really
doubt they properly did, as they say the "majority" of 17M requests failed.

------
faitswulff
How did they become aware of the bug so quickly (<10 minutes)? Unless I'm
missing something from the report, it doesn't say.

~~~
luhn
The bug trigger a flood of errors.

> The impact of this bug for most queries was a malformed response, which
> errored and caused a near immediate rollback.

------
webmaven
Interesting that they don't mention expanding the information being logged to
make the multiple joins they had to do unnecessary or more deterministic.

------
revelation
Next step: setup development system ?!

Surely they do some end-to-end testing?

~~~
uxp
They state in the post that of 17 million requests to their git-proxy server,
only 230 of those requests could be identified as successful responses to
incorrect data/repos, at a percent of 0.0013%.

I don't know of anyone that would recommend creating tests, even integration
tests, that hammers a service to check to see if something like one hundredths
of one percent of requests returns invalid data. If anything, the fact that a
script is hammering a service that probably (in a Dev or QA environment) has
much less data in it's database and file stores, and much less protection
(like load balancing and caching) than it would in production would generate
more false positives than it would generate in substantial data disclosure
regression defects.

~~~
smashed
But the overwhelming majority of requests failed with errors. The happy-path
was not tested either.

------
wojcech
Probs to github for the disclosure. And congratulations to gitlab for probably
getting a nice boost in on premise support contracts:)

~~~
OJFord
Github Enterprise is on-premise too.

I don't know that this would make you necessarily want to make both the change
to self-hosting, and the change of platform.

~~~
stonogo
Because there is one critical characteristic in a private repository, and they
failed to execute. Moving on-prem doesn't fix that failure, it just mitigates
fallout.

~~~
iancarroll
It seems highly unlikely this commit made it into a GitHub Enterprise release.

~~~
stonogo
We'll never know, which is a problem unto itself.

------
cheiVia0
The only private repository is one you created on your own computer and didn't
push to github.

