Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Outage
140 points by mre 8 months ago | hide | past | favorite | 81 comments
Just noticed issues with Github handling requests and overall flakiness. Getting a lot of status code 500 errors and decided to open this thread for status updates.



Even though the status page (https://www.githubstatus.com/) shows no issues, I'm still getting the occasional 500. It seems to be happening quite irregularly. They are possibly facing a lot of load.


In my experience its often the case the status site does not reflect reality -- until someone intervenes.


that's the whole point of a manually updated status page. you don't want automation to update it because that automation can fail. automation likely caused the outage you want to know more about.

you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.

this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.

this is normal.

(this is for future readers, more than the person I am replying to.)


IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.

I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.


> IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

you are right about legal reasons; some companies count SLA by the time and date stamps on the status page.

people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

if you can design and run a 100% reliable status page which never reports incorrect information, while also reporting useful information, you will be a hero to many.


> people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

Thankfully people are not hiding it as in a conspiracy to pretend nothing is wrong. But, as you see in many comments in this thread, status pages rarely reflect that something is down immediately (because they are updated manually by humans).

This delay, codified in processes, is very convenient, and to me this is purposely hiding that a service is down. People are not hiding it, but the processes that control the status page are, indeed, hiding this information. This makes status pages less useful, IMO.


Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1]. No, it doesn’t automatically create incidents, but it’s nice to confirm an issue is happening site-wide after experiencing it.

Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago

0: https://discordstatus.com/

1: https://www.redditstatus.com/


GitHub has such a graph too, but limited to internal employees only. In fact that’s how they detect failures with their deployments, elevated 500 statuses. Some individual teams will have their own dashboard, but some do not (like k8s upgrades) and they only monitor 500s.

Rest assured someone is looking into this problem right now


> Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1].

that's awesome. doesn't exist for github, yet. would be nice if it does come.



I'm going to disagree here. The point of a manually updated status page is appearances.

With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.

Manual overrides for status pages should exist for when the automation doesn't work of course.

At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.


This is so wrong that it makes me wonder if it's satire.

"The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.

Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.

Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.


not satire.

no company will put any amount of monitoring online for anyone to see, no matter how high level. for it to be useful info, it must contain details, and information about infrastructure is usually well guarded for very good reasons.


> no company will put any amount of monitoring online for anyone to see, no matter how high level

Many companies used to do this. I remember the first time someone on HN commented, "Hey, is it possible this status page is just a useless blog now?" And people were trying to figure it out.

Companies arguably have a contractual obligation to be transparent about this data with their customers anyway, so a company like Github (where such a huge percentage of the industry is a customer) is going to leak the data one way or another.



Exactly, status pages tend to be updated by the humans responding to the incident, they're not automatic (that'd be pretty useless, you already know it's down, you want to know when they know it's down). Coordinating what to put on the status page when an incident happens can take time, getting the correct scope of impact from responding engineers etc.


Sorry, I'm not following you, how do you know it's down when the status page says it's all working? At that point you assume it's not down and start checking your own systems. They're just lying to avoid fallout; it's not better than an automated page.


"Humans responding to the incident" is what Twitter and email communications are for. Status pages are supposed to be realtime status, and they should show downtime as soon as users suspect it.

As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.


Static sites need editors and editors sometimes have to ask permission to post.

At least that's what AWS Health[1] looks like to me.

[1] https://health.aws.amazon.com


from the message I'm getting it seems like the load balancer is not able to spawn up server to handle new connections. Again, status page needs to reflect that, which means the status server page is NOT running on the same infrastructure as the main server group. Stop using AWS(or whatever fill in the blank hosting provider) for the status and production environments.


I monitor GitHub externally here: https://github.onlineornot.com/

Seems like a huge spike in load.


Load as in load average? Or load as in traffic?

Spikes in request latency can be because of bunch of stuff, including more traffic, but in my experience, it's usually around non-existing optimizations for some data structure that got triggered after N items or new deploys containing code that wasn't as optimal as the author of the code thought. Especially when dealing with distributed systems, where sub-optimal code in one part can cascade performance issues to various parts in the system.


Downdetector's got hundreds of reported failures in one huge spike, and I can't load it at all.


Status pages are absolutely useless. I've never seen them accurately reflect an outage


You are missing the point of a status page. They're not automatic things that tell you instantly when something is down -- that'd be pretty pointless, you already know it's down. They're updated by the folks responding to the incident, so you know they know there's an issue and that they're looking into it.


> that'd be pretty pointless, you already know it's down

How would I know? What if my website doesn't have any monitoring and I use a payment system, shouldn't I automatically be notified when that payment system is down? What if it's down for a week? I think service-providing companies should always announce outages and even suspected outages.


I agree with GP. If I am trying to, let's say, watch something on Netflix and it is not working, a status page would confirm that Netflix is down in my region, and I would know that there is nothing wrong with my connection, DNS, or any other potential cause.

Because of this reason I believe they would not be pointless if they were simply status pages, instead of "incident response pages". My hypothesis for them being this way instead is it is too much transparency for some companies for PR and legal reasons.


Then it's not an operational status, it's an engineering status. Clearly it is very misleading. I think most people, even devs, think these pages are supposed to reflect the current situation. Btw the Github one still doesn't


That's the premise of my whole business - there's definitely a market for an automated status page!

(https://onlineornot.com)


Been running into many unicorns for the last few minutes, had a moment where it came back but seems to be down again. Even the unicorn image won't load on the unicorn page.


Which is weird because the unicorn is an inlined image (png), encoded in base64. Seems like they broke it.


Yah, looks like their CSP blocks it for some reason


I haven't pushed to GitHub in over a year. Now I'm setting up a new page on github.io with a new repo and GitHub goes 500 just when I try to push.

Those GitHub badges... they are as ugly as it gets.


well, I for one would like more unrequested critique of artwork on a code sharing website. ಠ_ಠ


> Those GitHub badges... they are as ugly as it gets.

Bingo. Not everything in this world needs to be gamified.


Do we need to create a HN post for every outage? It happens every other week.


In the beginning of status pages, most of them were automatic one way or another, or engineers quickly threw up "We know of the problem, stay tuned" messages there.

But soon after, legal/executive team got ownership of them apparently, and the status pages are no longer automatically showing downtime/response time and notice about when things are actually down can take a while.

So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.


What else are you supposed to do when you can’t work because you don’t have access to your source code.


you still have access to your source code, you just can't push. or pull, but you can sneakernet around that, or have a second remote set in the repo for just this occasion, so you can collaborate as a stop-gap measure while GH gets fixed up.


git works completely fine offline.

However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.


Yep, noticed it with comments on an issue (had timeouts while submitting but it eventually went through).

Now 30 mins later, i've refreshed the issue and see that my reply and the comment I was replying too (by another user) are both gone. Hopefully, it's eventually consistent and these comments will re-appear later.


It's completely down for me. Status page says "all systems operational".


The service seems very flaky right now. Even the unicorn isn't loading properly.


Still getting this same error for past 10-12 hours. Tried in different times.

{ "code": 500, "message": "internal server error" }

Does anyone have luck? Any workaround to fix it?


According to Metrist monitoring (disclosure: I work there), the errors were very rare, and didn't happen enough for us to call the product "down." Looks like around 1% of requests.


I'm unicorning hard rn


How reliable are Github cron action workflows? I set one up to run every 15 minutes recently, but it seems to actually be running closer to once an hour.


I'm trying to clone a repo at a whopping 6KB/s from Kenya.

EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.


‘No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists.’


Interesting how outages like this seem to happen mostly on Monday^w Tuesday mornings.


Its Tuesday my man :)


They are probably Canadian (Canada had Monday off due to Labour Day, so people get confused).


Canadians are never confused, how dare you! I ... oh wait.

Sorry, was confused.


> Sorry

Definitely Canadian.


If you have a risky commit, esp with a three day holiday weekend, you wait to land it until after the weekend.


it is tuesday my dudes :P


Monday was a holiday in USA


I get an error saying the action can't be performed when trying to star a repo.


this is what happens when you sell important community infrastructure to M$FT


This predates them getting bought by MS. GitHub was notoriously flaky from the very beginning.


Its back


Seems to be back right ?


Same issue. Site is also very non-responsive.


They’ve been having issues since yesterday.


I'd say since Friday night Pacific time.


Time for some deep thumb-twiddling


Yeah I noticed it earlier.


Seeing the same.


Github outages are the bored engineer's equivalent of getting a surprise snow day when you were in school, full of unbridled joy.

For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.


I can be both bored and engaged, don't test me buddy /s


> equivalent of getting a surprise snow day

Not if you self-host Git


Self-hosting Git is easy, throw up a ssh server and point git to it.

Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.

Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.

P2P/Local First software for everyone! \o/


> Self-hosting everything else GitHub does is harder.

You can self-host the whole of GitHub can’t you?


As far as I know, you can self-host an enterprise version of GitHub, but it'll still be available from one location (the server you deploy it on). I cannot run it locally and federate with my colleagues instances for example, just like I can with Git.


gitlab


GitLab is not P2P/Local First software like Git itself. It suffers from exactly the same problem as GitHub, reliance on a central server (which is run either by the company making the product, or by your own team).

What I'm talking about is being able to access everything like issues, wikis, PRs and whatever, even when you're 100% offline.


fair, i'd love that!


Not really. I'd rather be getting stuff done. ¯\_(ツ)_/¯

edit: oopsie I misread.


Would you characterise yourself as a bored engineer?


today I might characterize myself as someone who made a mistake while reading a comment, and replied to the mistaken understanding instead of the intended one.


You still can use git, you just can't push the code.

Not a huge problem, unless it lasts for hours or gasp, days.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: