you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.
this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.
this is normal.
(this is for future readers, more than the person I am replying to.)
Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.
I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.
you are right about legal reasons; some companies count SLA by the time and date stamps on the status page.
people hiding a real outage when users know damned well there is an outage is thankfully not common at all.
if you can design and run a 100% reliable status page which never reports incorrect information, while also reporting useful information, you will be a hero to many.
Thankfully people are not hiding it as in a conspiracy to pretend nothing is wrong. But, as you see in many comments in this thread, status pages rarely reflect that something is down immediately (because they are updated manually by humans).
This delay, codified in processes, is very convenient, and to me this is purposely hiding that a service is down. People are not hiding it, but the processes that control the status page are, indeed, hiding this information. This makes status pages less useful, IMO.
Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago
Rest assured someone is looking into this problem right now
that's awesome. doesn't exist for github, yet. would be nice if it does come.
With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.
Manual overrides for status pages should exist for when the automation doesn't work of course.
At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.
"The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.
Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.
Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.
no company will put any amount of monitoring online for anyone to see, no matter how high level. for it to be useful info, it must contain details, and information about infrastructure is usually well guarded for very good reasons.
Many companies used to do this. I remember the first time someone on HN commented, "Hey, is it possible this status page is just a useless blog now?" And people were trying to figure it out.
Companies arguably have a contractual obligation to be transparent about this data with their customers anyway, so a company like Github (where such a huge percentage of the industry is a customer) is going to leak the data one way or another.
As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.
At least that's what AWS Health looks like to me.
Seems like a huge spike in load.
Spikes in request latency can be because of bunch of stuff, including more traffic, but in my experience, it's usually around non-existing optimizations for some data structure that got triggered after N items or new deploys containing code that wasn't as optimal as the author of the code thought. Especially when dealing with distributed systems, where sub-optimal code in one part can cascade performance issues to various parts in the system.
How would I know? What if my website doesn't have any monitoring and I use a payment system, shouldn't I automatically be notified when that payment system is down? What if it's down for a week? I think service-providing companies should always announce outages and even suspected outages.
Because of this reason I believe they would not be pointless if they were simply status pages, instead of "incident response pages". My hypothesis for them being this way instead is it is too much transparency for some companies for PR and legal reasons.
Those GitHub badges... they are as ugly as it gets.
Bingo. Not everything in this world needs to be gamified.
But soon after, legal/executive team got ownership of them apparently, and the status pages are no longer automatically showing downtime/response time and notice about when things are actually down can take a while.
So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.
However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.
Now 30 mins later, i've refreshed the issue and see that my reply and the comment I was replying too (by another user) are both gone. Hopefully, it's eventually consistent and these comments will re-appear later.
"message": "internal server error"
Does anyone have luck? Any workaround to fix it?
EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.
Sorry, was confused.
For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.
Not if you self-host Git
Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.
Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.
P2P/Local First software for everyone! \o/
You can self-host the whole of GitHub can’t you?
What I'm talking about is being able to access everything like issues, wikis, PRs and whatever, even when you're 100% offline.
edit: oopsie I misread.
Not a huge problem, unless it lasts for hours or gasp, days.