
Status.github.com: “We're failing over a data storage system” - donpdonp
Updates to gist are getting lost. It accepts them as normal but the next page load for the gist is the previous version. Twitter accounts describe the same sort of problem for git repos.<p>status.github.com reports &quot;We&#x27;re failing over a data storage system in order to restore access to GitHub.com.&quot;
======
jypepin
It's been over 6hrs without update, except the hourly message which states the
same. I'm really looking forward this post mortem, but having been in the
situation where I've had to deal with large scale outages like this one, I
guarantee some engineers are having a bad time right now, and I feel for them.

Github engineers, if you are reading me (probably not), KEEP IT UP, it happens
to the best of us! <3

~~~
olingern
Posted at 15:51 Japan Standard Time

"...with the aim of serving fully consistent data within the next 2 hours."

That's somewhat significant

~~~
Cthulhu_
Yeah, sounds like a backup restore or a RAID synchronization is in progress.

------
woogle
They've just posted an "Incident Report":
[https://blog.github.com/2018-10-21-october21-incident-
report...](https://blog.github.com/2018-10-21-october21-incident-report/)

> Multiple services on GitHub.com were affected by a network partition and
> subsequent database failure resulting in inconsistent information being
> presented on our website. Out of an abundance of caution we have taken steps
> to ensure the integrity of your data, including pausing webhook events and
> other internal processing systems.

[...]

> Information displayed on GitHub.com is likely to appear out of date; however
> no data was lost. Once service is fully restored, everything should appear
> as expected. Further, this incident only impacted website metadata stored in
> our MySQL databases, such as issues and pull requests. Git repository data
> remains unaffected and has been available throughout the incident.

------
keyle
I don't know if they keep changing the text updates with a slightly different
version to

\- prove it's a human that typed it

\- there is code the prevents repeating twice the same message

either way it's entertaining... But it's Monday morning in Australia and we
need to release! (yep we do this via pr/tagging etc.)

~~~
Fuzzwah
It is to ensure that the updates get mirrored onto twitter, where exact
duplicates can't be posted.

~~~
hiccuphippo
Isn't it enough to delete the older tweet? Or maybe just add a timestamp to
the message.

Off-Topic: If it's not possible to write a twit with the exact text from a
deleted twit, then a way to prove someone wrote a twit and then deleted it
would be to have them try to write it again.

~~~
askmike
Deleting tweets is a terrible workaround. Once github start tweeting and
people start linking to those tweets they can't go ahead and delete them 50
minutes later..

------
r_singh
Just a few weeks ago my organization was in the position of choosing a version
control platform for our repos. I'm so glad we went ahead with self hosted
gitlab. We installed it on a CentOS server at our premise, SSL'd via Let's
encrypt, I've even set up a dedicated gitlab runner to use Gitlab CI for
continuous delivery and so far the testing is progressing pretty smoothly.

All this for $0.

Update:

I agree, gitlab and github will cost the same in the long run. Time costs
money and self hosted has a lot of downsides as well.

However, gitlab being self hosted (and not costing anything in terms of
separate invoice) makes us feel like we're more in control and are somehow
saving some money (as we add more developers as well). This comment here was
just sparked by a sigh of relief because this week is an important release for
us and had we been on github this monday wouldn't have started too well.

Overall, all the comments listing the trade-offs are true. Time costs $$$ and
a paid hosted solution is worth the minimal cost, time and expertise required.

~~~
Cthulhu_
What did the server cost? How many hours did you spend on it? What's your SLA?
Who will be woken up at night when there's an outage? What's your backup and
recovery procedure?

I mean kudos to you for setting it up but it's a bit naive to believe it's
better than a hosted solution right away.

~~~
vbezhenar
Git is tool for developers. If developer can't fix broken server, he's doing
something wrong. No need to outsource trivial tasks.

~~~
softawre
Might as well have your developers do the support and the testing and
answering the phones and cleaning the bathrooms too, eh?

~~~
vbezhenar
Support and testing may be useful. Cleaning bathrooms no.

------
ComputerGuru
I can't add comments to pull requests("you can't do that right now") and any
commits pushed to branches are not updating updating the visible status in the
web interface, no are newly created branches showing up. However, if you
navigate to a new commit directly with its SHA (so you can share it with
someone if you really want), it'll show up (so they're just not being
indexed).

EDIT:

Obligatory "that's what happens when the whole world relies on a centralized
git repo" and a reference to gitea, which has a very slick github-esque UI and
is incredibly easy and light to deploy/run (on an existing server, your own
PC, a raspberry pi, a docker vm, or whatever): [https://gitea.io/en-
us/](https://gitea.io/en-us/)

~~~
rodorgas
I’m receiving emails (a lot of duplicates) from comments in PR, but they won’t
show up in the browser. I guess people are trying to submit multiple times,
the email is sent but comment isn’t posted.

~~~
majidazimi
Welcome to eventual consistency.

~~~
hnarn
Mañana Consistency(tm)

------
biddlesby
The status updates are bearing a remarkable resemblence to a Windows loading
bar.

• We aim to serve fully consistent data within the next 2 hours.

• _45 minutes later_. On track for serving fully consistent data within the
next 1.5 hours

• _45 minutes later_. On track to serve consistent data within the hour.

• _One and a half hours later_. We estimate they will be caught up in an hour
and a half.

~~~
tjoff
I appreciate that they try. As a user I would never rely on those estimates,
but it at least gives you a sense of what's going on and what they are doing.
I take that over the more common silence or "we are working on it" any day.

~~~
Ma8ee
Exactly. We all know not to trust that those estimates are exact, but they are
much better than nothing. When they say nothing we don't know if it's half an
hour, half a day or half a week.

~~~
radiospiel
Well, so they say it is within the hour, but we still don't know if it's half
an hour, half a day or half a week.

------
tomas789
My Jenkins is creating CI Builds like crazy. If your CI is on the pay-as-you-
go service, make sure you are not burning money (or credits).

------
benmmurphy
My guess at what happened:

They had a split brain when multiple masters were running. Then they were not
able to choose a master to keep because the data in both masters was
'corrupted' so they are now restoring from a backup.

So how do they get data corruption from multiple masters running:

1) Performing reads from slaves during an update operation. If you perform a
read from a slave then you might get data from the other master. If you update
data on one master based on data from another master then you get data
corruption. Probably they don't do this much because if you had any slave lag
then you would notice this problem during normal operation. However, they
might do it for checking permissions. You can imagine because this data almost
never changes it would never show up as a problem normally.

2) Data stored outside of the database. This would be the repositories
themselves and cookie-session storage. Imagine repos have an incrementing id.
Then if you have two masters a repo gets created with the same id on both
masters. This is very bad because now two people can see each others data. You
have the same problem with cookies. Imagine if you have user id as an
incrementing id. The two masters create a user with the same id and an
encrypted cookie or another storage system (redis) stores the user id. Now
depending on which master you get routed to you appear as a different user.

Weirdly enough this would always be a problem based on how their failover
system works. The only safe way i know how to turn a master-slave system into
a safe HA system is how joyent does it
([https://github.com/joyent/manatee/blob/master/docs/user-
guid...](https://github.com/joyent/manatee/blob/master/docs/user-guide.md) :
you basically need a vote at commit time. if you have this then only 1 master
can commit). However, I'm guessing most of the time they only have two masters
running for < 1 minute but probably this time they had two masters running for
a long time.

EDIT: oh. they said it affected issues & pull requests [they have different
dbs for different stuff] so repo and authentication stuff wouldn't apply. oh,
they said no data was lost as well so presumably that would exclude a split
brain running for a long period of time.

~~~
sudhirj
Split brain can run for a while if the all identifiers are UUIDs and tables
are used as append only. Restoration is complex, though.

~~~
benmmurphy
depending on your application logic you can get external inconsistencies. like
your typical CRUD app might dump the state of an entity in an edit form, then
a user might edit one field, and the whole form is sent to the backend and
written over whatever exists there. if you have two different masters then you
can get a series of updates where you can't tell if the update is because the
user meant to make the change or if it was because the change was incorrectly
propagated from the other master. like even if you knew the user read from the
other master you can't tell if the user intended the change or not. maybe they
saw the value was the value they wanted so they didn't change it.

i guess this is a 'bug' because when you do stuff like this you should include
some kind of version identifier to catch a concurrent update in the normal
case. this scenario is also 'storing stuff outside the DB' kind of similar to
the user_id in the cookie getting out of sync.

------
theSage
I'm really looking forward to the post mortem that comes out of this (if it
does). I always learn a lot from reading those.

------
_shadi
I always thought it is silly that we pay for both private repos on github and
host our own enterprise github on premise, turns out whoever have set this up
knew what they were doing.

------
olingern
Local to Tokyo, Github has been down for most of the day. I never realized how
much of my day centers around it: pr review, creating / commenting on issues,
etc.

Looking forward to the write-up. I'm also curious as to how this significant
outage lines up with their SLA for enterprise users.

------
k_
An update has been posted there:
[https://blog.github.com/2018-10-21-october21-incident-
report...](https://blog.github.com/2018-10-21-october21-incident-report/)

> At 10:52 pm Sunday UTC, multiple services on GitHub.com were affected by a
> network partition and subsequent database failure resulting in inconsistent
> information being presented on our website. Out of an abundance of caution
> we have taken steps to ensure the integrity of your data, including pausing
> webhook events and other internal processing systems.

> We are aware of how important our services are to your development workflows
> and are actively working to establish an estimated timeframe for full
> recovery. We will share this information with you as soon as it is
> available. During this time, information displayed on GitHub.com is likely
> to appear out of date; however no data was lost. Once service is fully
> restored, everything should appear as expected. Further, this incident only
> impacted website metadata stored in our MySQL databases, such as issues and
> pull requests. Git repository data remains unaffected and has been available
> throughout the incident.

> We will continue to provide updates and an estimated time to resolution via
> our status page.

~~~
k_
1.5 hour later (a couple minutes ago), despite an estimated time of less than
an hour:

> We continue to monitor restores which are taking longer than anticipated. We
> estimate they will be caught up in an hour and a half.

------
reidrac
Someone posted a comment on an open issue, I got the mail, but there's nothing
on the web interface.

What I don't understand is why they don't set it all to "read only" until the
problem is sorted. Looks like any update goes to /dev/null, just let your
users know!

(unless they plan to replay those updates, somehow)

~~~
askmike
I don't think it's easy to predict what is going to happen to all updates now.
They run a big distributed infrastructure. I doubt they can even predict which
updates will be commited and which ones will not.

If replaying is even on the table (which sounds very dangerous to me), that
requires a huge coordination and I am pretty sure they are not able to tell
right now if that is going to work once the actual issue is fixed.

Better to stay quiet while fixing the issue and only say things that you
actually know 100% to be correct.

------
dfcowell
PR comments are also failing with HTTP 405 error code.

Great way to start the week.

~~~
keyle
Well for the poor sods fixing it in the US, it's still Sunday evening...

~~~
geerlingguy
It's when I typically get an hour or two to crank out some open source PRs and
issue queue cleanup. Sadly, the outage means I don't get that time to devote
this week :(

------
megakid
I'm trying to add a new user to my organisation (Monday morning, new
joiners...!) and despite getting confirmations I purchased another seat, I
cannot invite them. Or when I do, those seats are gone, guess it's down to
which data store I'm hitting on each request.

------
calmconviction
They probably moved their storage backends to Windows 10

------
geggam
Interesting how everyone uses a tool designed to eliminate SPOF in a way it
has a SPOF.

~~~
geerlingguy
GitHub does not equal git; the reason I’m paused is because I use Github’s
issue queues to organize my OSS work. Much easier than self hosting an issue
repository/bug tracker. I am still able to do all my work, run new containers,
etc., but GitHub is more tied into business processes than actual code (which
is what causes the pain during these outages).

------
stephenr
Obligatory: to all the people running around like de-headed chickens: this is
what happens when you focus on a service not on the dvcs.

Your code and your wiki (if you use it) are already in git repos. If the
industry moved forward on issues in dvcs repos, you have another central point
of failure removed.

You also have less hassle to work across dvcs hosts, which is exactly why
GitHub would never embrace this.

------
6t6t6t6
Next: "We are restoring tape backups from some of our storage systems"

~~~
6t6t6t6
"We are continuing to repair a data storage system for GitHub.com. You may see
inconsistent results during this process."

------
josteink
Seems like they're on their way back up:

> We are currently in the later stages of a restore operation, with the aim of
> serving fully consistent data within the next 2 hours."

------
nodesocket
I created a gist four hours ago, which as of now is not returning and showing
a 404. Hopefully the gist did not get lost.

------
mattio
They started to replay all webhook events yesterday evening, triggering all
kinds of events in our systems, as deployments and the like :-(

------
gavreh
My issue comments are not being saved, and if I do a tag push I don't see that
reflected on the "releases/tags" area of GitHub.

~~~
diegoperini
I just lost my session.

------
shry4ns
I don't know if someone's mentioned it yet but TravisCI is also not running on
the commits that do show up on git, in Houston

------
RandomGuyDTB
Earlier GitHub pages was "down for maintenance". Probably not related but
might be of note.

(source: my website is hosted on github pages)

------
beamso
Seeing a lot of issues with Pull Requests.

------
igni
Before all the trolling about Microsoft starts up, does anyone have current
information on what these systems are?

In the enterprise space, a 'data storage system' could be an Array or a SAN or
a lightpath etc, with usually quite long failover times. For an org like
GitHub I'd think more like an object store (an Array-of-Hosts, if you will) or
whatever storage mechanism holds their database files. Do they self host this
sort of thing or is it an AWS/GCE/Azure service?

FWIW, all git commands are working fine for me (create a branch, push,
colleagues can fetch my branch), but the UI doesn't show my branch & nor can I
review/comment on PRs.

~~~
sciurus
For storing things other than git repos, GitHub is heavily invested in MySQL.
AFAIK all of GitHub is hosted on their own hardware.

[https://githubengineering.com/mysql-high-availability-at-
git...](https://githubengineering.com/mysql-high-availability-at-github/)

~~~
rurban
This blog post describes exactly the scenario we were experiencing here. A
master (single writer) failure, with missing fail over. You can only guess
what went wrong with this plan. Looks good on paper, but some unexpected
network or HW or routing problem could have caused the problem to identify the
single writer.

------
karaokeyoga
Is it just me or is that status message wonky? Failing over something to
restore access?

~~~
rphillips
Sounds like an engineer knee deep in diagnosing the issue.

~~~
diegoperini
That person needs our support more than anyone else right now.

------
hartator
I can access GitHub.com here (Austin,TX), but it's taking forever.

------
mellisdesigns
What particular services are down? I noticed I can hit the UI.

~~~
avip
For me: can't fork, can't login (tested incognito), can't clone new repos.

------
BadassFractal
Wonder if GitHub is "too useful to fail" at this point. As in, most people and
companies won't switch git repo providers unless GitHub is down for many days
at a time during the work week.

~~~
avip
Transition to BitBucket or gilab is one click away. Companies surely will move
if incentives are there.

~~~
jononor
Not when you use third party integrations. Like when using Travis for CI, and
Travis doing your deploy.

~~~
Skinney
Travis supports both Bitbucket and Gitlab

~~~
jwilk
No, Travis CI doesn't support any code hosting other than GitHub.

------
dustinmoris
GitHub must be migrating to Azure...

~~~
guardian5x
Actually, they just announced to stay with AWS, as long as things run well.
Maybe now they have a reason to.

------
____Sash---701_
What!?

------
guywhocodes
What's strange to me is how many hours we've been given the same message.
Today is an important day for my organization and not getting any information
that hints how long this will last is a huge problem for our planning.

~~~
steventhedev
Are you a paying customer of GitHub Enterprise? If not, then you're getting
your money's worth.

Snark aside, this is a great time to reassess your deployment strategies and
look into things like local apt and pypi proxies. I'm confident you can find
similar projects that will transparently cache your dependencies.

~~~
samontar
If you’re paying for GHE you don’t have any trouble. That’s delivered as an
appliance you host.

~~~
OJFord
No it isn't:
[https://enterprise.github.com/features#pricing](https://enterprise.github.com/features#pricing)

~~~
akx
Yes it is.
[https://enterprise.github.com/faq#faq-3](https://enterprise.github.com/faq#faq-3)

------
sajithdilshan
I think it is obvious what happened. They must have moved their backend to
Windows 10.

~~~
sajithdilshan
It's a joke people. Why downvote?

~~~
deadbunny
Because it's a boring, obvious joke which wasn't funny even in the 90s?

------
Memosyne
Anyone know if this could be related to the Youtube outage? It's been a while
since I've seen these big websites go down.

~~~
danielhlockard
I'm not sure how you could postulate that they're related

~~~
ObsoleteNerd
Targeted attacks on specific aspects/assets of a major websites
infrastructure? Eg target a specific service that both sites have in common in
their back end?

Not even sure if that's feasible but it's an idea.

~~~
Memosyne
I figured it was just a coincidence, but the HN network is so well informed I
thought I'd just ask the question on the off-chance it wasn't.

I personally don't find a relation likely.

~~~
ObsoleteNerd
Oh I find the relation very unlikely. I was just trying to think up a
potential connection for fun/curiosity, as the coincidence is pretty crazy
that two huge sites known for very few/little outages had major outages in the
same week.

------
platinium
This has been super frustrating, as people have deadlines and are working to
finish projects before Monday morning.

What are the (good) alternatives to Github? Gitlab supposedly is Google-
backed, so I don't want to have my private code there. Is Bitbucket the only
one left?

I don't mind paying monthly, which I already do for GitHub.

~~~
Tecuane
> Gitlab supposedly is Google-backed, so I don't want to have my private code
> there.

This strikes me as odd. May I ask why usage of GCP is a deal-breaker for you?
While I can understand not wanting to use Google products directly as a
consumer, I believe it would be - for lack of a better term - platform suicide
for Google to intercept and perform its usual analytical shenanigans on the
data content of transmissions to/from their platform.

Either way, Phacility's Phabricator[1] is $20/user/mo.

1\. [https://www.phacility.com/pricing/](https://www.phacility.com/pricing/)

~~~
Jedi72
Nobody trusts Google for any reason any more as they have proven unworthy of
our trust.

~~~
hactually
That's a silly statement. By saying 'nobody' \- a single point of data
invalidates your assertion.

I use gmail, I'm quite happy to trust that contract.

~~~
jammygit
You're saying that you trust that contract today, or are you saying that you
have always trusted that contract?

Its only recently that gmail's contract involved keeping out of your data. I
think they also only say they abstain from using your data for targeted
advertising, not that they don't use it for other purposes. I haven't read the
terms in quite a while though and I could be mistaken.

Great products though. I really do wish I could pay for them in exchange for a
real, trustworthy, comprehensive privacy promise.

~~~
deadbunny
There is a whole world of difference between paid for and not paid for Google
services.

