Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: GitHub is down again
442 points by pupdogg 53 days ago | hide | past | favorite | 272 comments
Yet somehow https://www.githubstatus.com is ALL GREEN! smh



It's amazing how much stuff breaks when GitHub goes down. I'm doing some Rust coding right now, the rust-s3 crate tells me that to look up what feature I need to enable (tokio + rustls), I need to look into the Cargo.toml. Well, the repo won't load, and nor can I clone it. Well okay, fuck that I can use the default dependencies. But no wait I can't, I can't even do a cargo build because cargo uses a github repository as the source of truth for all crates. No more Rust for me today :(


In the 1970s-80s we managed with email lists and tarballs on multiple mirrors. Git itself decentralizes source control, and yet we all want to use single-point-of-failure Github.

Anyone remember when Github took down youtube-dl?

I wonder how much Big Brother data Microsoft is gathering on Github developers. "Oh, as part of our hiring process, just like we scan your FB and Twitter, we also scan your Github activity to evaluate your performance as best as our machine learning overlords can assess it. Do you create Github projects and then abandon them when unfixed issues accumulate? Our algorithm thinks you're a bad hire."


> Anyone remember when Github took down youtube-dl?

I would blame the law (DMCA), not those forced to abide by the law (Github)


The law also permits content to be restored immediately upon receipt of a counternotice. Somehow the data silos never get around to supporting that.


GitHub can't, until at least the the public support is in their favour, because legally it's a hot water. This is not your ordinary copyright DMCA Complaint, this is the section 1201 concerning circumvention, which is held to a different standard. This is essentially RIAA telling GitHub "if you're not stopping this, we'll see you (not the YT-DL developers) in court".


> GitHub can't, until at least the the public support is in their favour

Public support on which grounds?


I can’t see any evidence a counter notice was filed until the EFF got involved, at which point GitHub immediately reinstated YouTube-dl.

Did GH ignore a counter notice?


twitch actually does do that IIRC, not sure about others


Laws need teeth to be enforced. It used to be illegal to harbor runaway slaves, or to be a Catholic priest in Protestant England. So people built cleverly hidden Priest Holes[1] in houses to hide. But you can't really build the equivalent of a Priest Hole within Github, because Github is not your house, but rather your lord's castle.

[1] https://en.wikipedia.org/wiki/Priest_hole


> blame the law (DMCA), not those forced to abide by the law (Github)

What about those abusing [a misrepresentation of] the law (the RIAA) and the members who pay their dues (Microsoft)?


It was an unfortunate situation. I don't entirely know how GitHub handled things before the Microsoft acquisition but it also wouldn't surprise me if Microsofts legal department is more risk adverse compared to GitHub internal.

> I would blame the law (DMCA), not those forced to abide by the law (Github)

It seems even in the context of US-based companies that some companies are a little more "trigger-happy" with DMCA and other claims compared to others.


Especially because they worked to try and make sure this wouldn't happen again.


There was actually a movie about this very thing. Miguel De Icaza had a cameo. The villain company was a thinly veiled replica of Microsoft, and the CEO looked like Bill Gates. ( https://www.imdb.com/title/tt0218817/ )

In another thread a few weeks ago about code signing, I mentioned we should just hand Apple Source code for the apps it distributes in the store, to make review more thorough and reliable. Replies objected "Why would you give your source code to another company?" ........ and github is still a thing? ( https://news.ycombinator.com/item?id=28794243)


Some of the features in that movie would work nowadays, such as paintings detecting a visitor and changing based on the visitor or their mood or sound input (don't remember exactly). Wasn't there also something with satellites?

Also, ironic De Icaza ended up working for... the company he wanted to work for all along, ...Microsoft. Though Friedman left Microsoft (or at least stepped down as GitHub CEO).

De Icaza was also featured in the Finnish movie The Code [1].

[1] https://en.wikipedia.org/wiki/The_Code_(2001_film)


It was still rad back in the 1990s and 2000s, then came in people who think that if they repeat a bullshit sentence loudly enough and pushily enough, people at large will equate it with truth: "Email is too inconvenient to use".

I'd say those people do have a point about how manipulation tactics work, but email being inconvenient is bullshit.


What's inconvenient is using proprietary Electron apps just for communicating, while not being able to communicate on my phone without downloading and installing a proprietary app.

On the other hand, you have email: open-source clients are available for virtually every platform, it doesn't turn my laptop into an electric bar fire, it doesn't drain my phone's battery (or if it does, I could just install another client!), etc.

It's quite sad how something as beautiful as email is being replaced by corporate garbage like Teams, which comes bundled with Office 365 (and that's a whole 'nother crapfest). Uch.


> Do you create Github projects and then abandon them when unfixed issues accumulate? Our algorithm thinks you're a bad hire."

This data is public and to be honest I’m amazed we haven’t yet seen a YC startup promoting that they do exactly this.


Shhhh. The former Klout folks are probably listening.


Don't create a Github account with your name then?


Yeah seems like this would be a simple workaround if companies did start adopting this practice


> Git itself decentralizes source control, and yet we all want to use single-point-of-failure Github.

This is pretty much why many organizations out there, as well as i personally for my homelab use self-hosted GitLab instances: https://about.gitlab.com/

Though in practice there are a lot of other options out there, like Gitea (https://gitea.com/) and GitBucket (https://gitbucket.github.io/), though maybe less so for alternative source control systems (e.g. SVN has been all forgotten, however that's a personal pet peeve).

Not only that, but i also utilize my own Sonatype Nexus (https://www.sonatype.com/products/repository-oss?topnav=true) instances to great success: for doing everything from mirroring container images that i need from DockerHub (e.g. due to their proposed removal policies for old images and already adopted rate limits), to mirroring Maven/npm/NuGet/pip/Ruby and other dependencies, so i don't have to connect to things on the Internet whenever i want to do a new build.

That not only improves resiliency against things on the Internet going down (apart from situations where i need something new and it's not yet cached), but also improves performance a lot in practice, when only the company servers need to be hit, or my own personal servers in the data center for my cloud hosted stuff, or my own personal servers in my homelab for my own stuff.

Admittedly, all of that takes a bit of setup, especially if you happen to expose anything to the web in a zero trust fashion (permissible for my own stuff, as long as i'm okay with manually managing CVEs just to probably get hacked in the end anyways, but definitely not something that any corporation with an internal network would want to do), but in my eyes that's still worth the effort, if you value being in control of your own software stack and the ecosystem around it.

It's probably much less worth it, if you don't see that as a benefit and don't want to be the one responsible for whatever project you're working on getting hacked, e.g. if you'd fail to patch out the recent GitLab CVE where exiftools could execute arbitrary code, which is probably the case if you don't have the resources to constantly throw at maintenance, in comparison to companies with 100x - 1000x more resources than you have for that sort of stuff.


but that data is avaliable to everyone cuz it's mostly public?


Cargo has an --offline option. It's actually pretty possible to use rust totally offline - the doccumentation can be built, for instance, then locally served with (iirc) cargo doc.


Offline should be the default, not an option.

Otherwise nobody will notice how fragile their workflow is until it is too late to fix it!


I recently realized I was the only person running the API locally when suddenly all our unmocked tests started failing only on my machine.

All other tests, including the ones on the CI server, were connecting to the hosted integration environment for any unmocked call (and thus randomly modifying state there).


I build and host documentation on every commit anyways in the CI. And yes, that is true, I eventually figured it out (had some issues with it at first) but it seems like GitHub is back up so all's well anyhow. I do however wish that there was some public mirror that cargo could fall back to, wouldn't that make a lot of sense?


It would be good if such things would be hosted decentralized on something like say IPFS.


It's amazing how much stuff breaks when GitHub goes down.

That's right. Someone should really come up with a decentralized VCS. /scnr


IMO Git is decentralization-ready but the rest of the infrastructure necessary to make it practical is not widely available/in use. The necessary peer-to-peer and networks of trust are still not a solved problem, or if they are they've for some reason are not popular enough to be in wide use.


It's widely available and widely used, just not as widely used as github.

Setting up your own git server is not hard, but it's not as easy as just getting github or gitlab to run it for you. Way too many people take the easier path, even though the harder path is not actually that hard.

There are also multiple solutions.


> Setting up your own git server is not hard

Bullshit. Getting it set up in a secure and public way is an order of magnitude or two more difficult than it would need to be for mass use which is what we're talking about here.

When you have a situation where a the vast majority of people find something too difficult, the problem (from the perspective of actually getting it happen) is not the people the problem is with the proposed solution.

> There are also multiple solutions

Multiple solutions does not generally make things easier/better even if it's desirable for other reasons.


It’s ok for small private projects, but user management is hard. You can’t just let any rando have commit privileges.


that's what things like gitlab (not the website) or gitea are actually for.


You probably meant distributed decentralized VCS.


Beside the `--offline`, you can also use `--vendor` to include all the dependencies in a folder to be committed alongside your project. Useful when you don't want to rely on external fetch every time!


Both should be the default.

Rust has a robust memory model, but everything else about it insists on copying the fragility of the NPM ecosystem.

The recent hoopla around a bunch of Rust mods quitting revealed that my suspicions are precisely true — key Rust staff also sit on the NPM board!


Cargo is the one thing keeping me off the rust ecosystem. The fundamentals of the language are great, but the tight coupling of the rust language with cargo's package management really irks me - it introduces as many correctness and security problems as the memory model solves.


What languages have package management systems that solve those problems? Cargo does have options to fix these, like mentioned in other comments. I'm not convinced they should be defaults.


I'm suggesting that "languages that have package managers" is generally the problem. I think the go package manager gets the closest to solving these problems thanks to its reliance on one trusted source that has STRONG incentives to make sure that packages are available and trusted, but fundamentally, a language and a package manager are very different products, and I don't want them to be bundled.


You can view the source of a crate on docs.rs (see [1] for the Cargo.toml of rust-s3). Also I am pretty sure cargo only depends on GitHub for authentication for uploading crates and not for the actual contents. Trying to build an empty crate with rust-s3 as a dependency right now seems to work fine.

[1]: https://docs.rs/crate/rust-s3/0.28.0/source/Cargo.toml


As I understand it, the crates themselves are not stored on github, but the crate index is, as it uses a git repo to get "free" delta compression and auditing.


I can't stand builds that just reach out to the internet for things.


Yeah seems like basic hygiene especially given the supply chain attacks but also a lost cause. No one has the skills to do builds without the internet. Even forcing teams to use an allow list for the internet involving fighting a lot of angry people.


One of the few decent uses for containers is to enforce proxied internet so build process artifacts can be auto-stored for subsequent builds.

For the worst offender I am aware of, try building a flutter project... it silent internet gets artifacts from at least three different packaging systems (node, cocoapods, android packages), all of which have caused hard-to-debug failures.


How do you build software in 2021 then? Do you write everything from scratch?


One can use the packages provided by the distribution. That's how it worked since forever for things like C/C++.

For example distris like Debian even mandate that you build software for them only relying on things that are already packaged. No arbitrary internet downloads allowed during build. The build environment contains only what is provided by the stated build-dependencies of your package, which are themself of course Debian packages, and some common minimal build base (like shell and some utils).


What if the package you want isn't provided by the distribution?

What if the distribution's version doesn't support the features you need?

What if it can't be built by relying only on things that are already packaged?

How do you distribute your software so it runs on other distributions? Do you maintain a different build for each package manager?

What if you want to run on platforms that don't have standard package managers, like MacOS or Windows?

How do you update the packages without internet downloads during provisioning the build?

The C/C++ model has proven to be fatally flawed. It hasn't worked, that's why modern languages eschew reliance on the distributions' package managers, and greenfield C/C++ projects use the same model.

I'd go so far as to say this model is a key reason why we need containerization to deploy modern software today - since you can't trust software written on distro X runs on distro Y because it packages dependency Z differently all the way up to the root of the dependency tree (glibc, usually).

The fundamental flaw of this model is that it inverts the dependency tree. Build time dependencies are local to individual projects and need to be handled separately from all other software. With few exceptions, Linux distros make this mistake and it's why we can't rely on distro package management for new projects.


> What if the package you want isn't provided by the distribution?

You package it.

> What if the distribution's version doesn't support the features you need?

You package the version that does.

> What if it can't be built by relying only on things that are already packaged?

You package the dependencies first.

> How do you distribute your software so it runs on other distributions?

Give them the source so they can package it.

> Do you maintain a different build for each package manager?

Yes, that's the whole idea behind distributions.

> What if you want to run on platforms that don't have standard package managers, like MacOS or Windows?

Well, there you anyway downloaded random things form the internet…

But nowadays even those systems have package management that can be used!

> How do you update the packages without internet downloads during provisioning the build?

It's not about disabling package downloads (which come form a trusted source btw).

It's about disabling downloads form random places on the internet.

Also you can use a local package mirror. That's what the build systems of distributions do.

> The C/C++ model has proven to be fatally flawed. It hasn't worked, that's why modern languages eschew reliance on the distributions' package managers, and greenfield C/C++ projects use the same model.

Well, except for all packaged software out there…

> I'd go so far as to say this model is a key reason why we need containerization to deploy modern software today

Nobody needs that. Software form packages just works fine. All the Linux installations around the globe are a prove of that fact.

> since you can't trust software written on distro X runs on distro Y because it packages dependency Z differently all the way up to the root of the dependency tree (glibc, usually).

Of course you can. It works exceptionally fine. After you packaged it.

> The fundamental flaw of this model is that it inverts the dependency tree. Build time dependencies are local to individual projects and need to be handled separately from all other software. With few exceptions, Linux distros make this mistake and it's why we can't rely on distro package management for new projects.

More or less every distro does it wrong? And you know how to do it correctly?

Maybe you should share your knowledge with the clueless people building distributions! Get in touch with for example Debian and tell them they need to stop making this mistake.

By the way: How does your software reach the users when it's not packaged?

At least I hear "sometimes" that users demand proper software packages. But maybe it's just me…


I think you're conflating package management for distribution and package management as a build tool. This is the flaw that Linux distributions make. It's why flatpaks and snaps exist. On MacOS and Windows you distribute software as application bundles similar to flatpak/snap.

>More or less every distro does it wrong? And you know how to do it correctly?

Yes, exactly. For distribution you use snap and flatpak, or application bundles on MacOS and Windows. For building you use a package manager that can install vendored dependencies pinned to specific versions, and do it locally to individual projects so they are not shared across multiple projects. This is the tact taken by modern tools and languages.

It's not my idea! It's what everyone building software in the last decade has migrated to.

Packaging software so it can be used by any distro's package manager is not viable, and reliance on such an outdated model is why using Linux sucks for everything but writing software to run on a single machine.

> At least I hear "sometimes" that users demand proper software packages. But maybe it's just me

And when they do you almost never go through the official/trusted mirrors because it will never be up to date, you host your own (maybe put up some signature that no one ever checks) so they just `sudo apt repository add ; sudo apt-get install -y` in your install instructions.


> It's not my idea! It's what everyone building software in the last decade has migrated to.

Well, everybody besides the biggest free software distributors out there, which are Linux distributions…

> Packaging software so it can be used by any distro's package manager is not viable, and reliance on such an outdated model is why using Linux sucks for everything but writing software to run on a single machine.

Yeh, I get it. That's why we need Docker…

Nobody is able to distribute software otherwise.

Well, except all those "distributions"…

> And when they do you almost never go through the official/trusted mirrors because it will never be up to date, you host your own (maybe put up some signature that no one ever checks) so they just sudo apt repository add ; sudo apt-get install -y in your install instructions.

Moment. Does this mean someone managed to build packages that work on different versions of even different distributions, which is obviously impossible as we just learned?

This starts getting hilarious!

I'm sorry for you that nobody want's to listen to your great ideas. Have you ever considered that you're completely wrong?


> Packaging software so it can be used by any distro's package manager is not viable, and reliance on such an outdated model is why using Linux sucks for everything but writing software to run on a single machine.

Sorry, could you elaborate on why Linux "sucks" for anything but single-user, please? I haven't heard this argument.


... yes? I've often been ridiculed for wanting to build everything in-house, but events like these just validate that sometimes, to get the most reliable and best software, you'll have to reinvent the wheel a few times.


To be honest that's deserving of ridicule. Reinventing wheels is how you introduce fragility and instability, not prevent it.


This is why a lot of companies use a product like jFrog artifactory to handle dependencies. Instead of trying to download files from GitHub and other random places on the internet, the artifactory is a local mirror of all of those files, with servers on the artifactory which run the relevant services such as a npm repo of packages, a npm repo of packages, and so on.

This way, if GitHub falls off of the face of the earth tomorrow, the packages still build. Also: If some open source maintainer deletes an older version of a package, replacing it with a new one with a different API, the local package will still build.

(The local build process will have to be set up to use the local artifactory to pull dependencies, of course.)


Should be fairly easy for Cargo to solve this: why doesn't it already mirror its source of truth to GitLab and other Git hosters?


Probably because the level of effort required to configure that and then monitor/manage it continuing to operate doesn't make sense when GitHub has an outage or two a year.


I was working on my Nix config. I had just added a small command line utility and wanted to install it, but then got 504 errors from github.com. Annoying!


I guess same story for Go!


I think Go dependencies should still work, thanks to Google's module mirror[0] (enabled by default), which has cache.

[0]: https://proxy.golang.org/


The ones using GOPROXY=direct will fail, although it’s used rarely.


In Go, it's customary to use 'go mod vendor' to put your dependencies into the main repository. While it's not universally recognized as a good technique, it saves the adopters of this approach from the downtime today.


These days with go mod and the go team maintaining their proxy there is very little benefit to vendoring and any benefit is not worth blowingup up the size of your repos and messing up code reviews on PRs that introduce new dependencies.


I’m not sure this is customary at all - I rarely encounter a vendor directory anymore.

Note you also need to build differently if you go this route: `-mod=vendor`, otherwise the directory will be ignored in modern Go.


I've never had to use -mod=vendor, so I just looked it up:

        -mod mode
                module download mode to use: readonly, vendor, or mod.
                By default, if a vendor directory is present and the go version in go.mod
                is 1.14 or higher, the go command acts as if -mod=vendor were set.
                Otherwise, the go command acts as if -mod=readonly were set.
                See https://golang.org/ref/mod#build-commands for details.
If there's a vendor directory, it's used by default. As for my two cents, I use it frequently when building Docker images so that I don't have to pass secrets into the image to clone private modules (but I don't check the vendor directory into Git).


With more outages of our amazing, but overly centralized development ecosystem, the popularity of this approach will likely surge. It helps that Go supports the vendoring workflow as it makes the choice practical.

As for building with a flag: true, but very minor, as it's rare to execute 'go build' directly. In most projects I've seen, it's either Bazel, or some kind of "build.sh".


Maybe the key point is to choose consciously and pick the option that gives the best combinations of tradeoffs for your situation vs just doing what is easy or copying what other people are doing without understanding you're making a decision with various tradeoffs and consequences. Tradeoffs that are a good fit in other contexts may be a poor fit for your situation.

If one of the goals of your build process is to be able to guarantee reproducible builds for software you've shipped to customers, and you depend on open source libraries from third parties you don't control, hosted on external services you don't control, then you probably need your own copies of those dependencies. Maybe vendored into version control, maybe sitting in your server of mirrored dependencies which you back up in case the upstream project permanently vanishes from the internet. But setting up and maintaining it takes time and effort and maybe it's not worth paying that cost if the context where your software is used doesn't value reproducible builds.


Google takes care of storing copies of any go dependency you use on their proxy, there is very little reason for you to maintain your own via vendoring. Maybe if you are a big enough organization you run your own proxy as an extra layer of safety above google but still I don't see the value of vendoring these days.


Go uses a proxy for downloading modules, so there's no Github involved. And you could run your own proxy-cache if you wanted. In addition, your work machine has a local proxy cache of the modules you already downloaded.

Go doesn't use a repository as a single source either, which is another problem in of itself.


Here i am on a Saturday afternoon learning how to programatically interact with GitHub using go-git, banging my head on my desk because the code should work, but im getting cryptic errors. I'm searching stackoverflow, nobody seems to have encountered the errors i'm getting (that typically means i'm doing something wrong)...... oh GitHub is down. To GitHubs credit, they've been very reliable for me over the years, it didn't even cross my mind that they could be down.


It didn't even cross my mind that they would be down.

As soon as you start banking on "the cloud" to get useful work done, the very first thing that should enter your mind is:

"Any of these nifty services can just turn into a pumpkin at any time"


Much like we have the "Fallacies of Distributed Computing"[1], we probably need (if somebody hasn't already created) a "Fallacies of The Cloud".

Fallacy #1 - The cloud is reliable

[1]: https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...


The real kicker is that they're still other people's computers. There's definite benefits as well as costs. Ignoring either is not a good idea.

Centralisation into just a few large providers is another bad idea we're heading towards or have already arrived at.


"The cloud is just other people's computers." is the most succinct way I've heard it.


> "Any of these nifty services can just turn into a pumpkin at any time"

Which is just as true for on premise servers.


If something you control, manage and backup goes down, you can fix it.

It won't go away on its own, or like Travis make you dependent on it, get bought out by VCs who milk your dependency for money.

When it goes down, it only impacts your jobs until you solve it, not single-point-of-failues the whole internet and whole programming languages.

Your self-sufficiency also acts to stop all software endevour concentrating in the hands of one monopolistic American company.


Yup - it's a difference of both degree and kind.


99.99% uptime means something is always down 0.01% of the time. That's an hour every year. It's not down at time of checking, so it seems about right. I wonder if there's an effect of different cloud items being down an hour every year.


99.99% uptime

Until they get bought, or simply discontinue / modify the product, and/or you fail to read the fine print on the ToS. Or they simply lie.


I very regularly do stuff in containers of which I've disconnected the network... It's amazing how bad most error messages are when something depends on the Internet and yet it's not reachable. If you're trying to connect to the net and it fails, you may as well then try a ping or something (you were going to use the net anyway) and tell your user that the machine apparently cannot connect to the Internet.

Heck, maybe have some helper when outputting error message doing something like:

   log.error("Could not download fizbulator v3.21.4", ping-btw)
Where "ping-btw" would try a ping and add the message: "by the way ping to 4.2.2.1 didn't answer after 300 ms, maybe your connection is down?".

Something like that.


Perhaps you invoked the API in an incorrect way, and brought the service down.


Ah! I am brand new to learning flutter and was trying to change the Flutter channel with the command:

flutter channel master

And it kept failing with:

------

git: remote: Internal Server Error.

git: remote:

git: fatal: unable to access 'https://github.com/flutter/flutter.git/': The requested URL returned error: 500

Switching channels failed with error code 128.

------

thought I was doing something wrong and spent some time troubleshooting.


"internal server error" and "error: 500" are EXTREMELY strong clues that the problem is server-side.

there aren't a lot of HTTP error codes, and it's worth remembering the most common ones, like the 200, 300, 400 series and 500 itself. will help you troubleshooting later.


If you urgently need to retrieve a piece of software, it's likely archived in Software Heritage: http://archive.softwareheritage.org/


This is fantastic! It also has some projects I thought had been lost to the mists of time when Bitbucket stopped hosting Mercurial repos.


I ended up here because automated cluster deployments failed trying to download releases from GitHub... I wonder if that software is served there and I can update the URLs before GitHub fixes their issues :)


Maybe companies should alert on an increased amount of traffic to their status pages.


That might be a good idea as a last-resort measure, but if you're only finding out problems because customers are telling you about them (even indirectly through a signal like this), your monitoring and alerting is woefully inadequate.


fun fact here that no matter how short the automated system is set up to alert, customers are faster - no one knows how is that possible


Yup, that's another reason why it pays off having a direct line of communication to at least some of your top customers.

Can't tell you how many times I've had a customer write to me, only to be receiving some automated alerts 30-40 seconds later.


And the best customers for this are the pickiest annoying ones


Generally I agree. Humans are much better at recognizing patterns than machines.

But as a counterpoint I was once working in a project where our monitoring dashboard showed an anomaly in incoming traffic. It turned out that it was an ISP problem and that we were the first ones to notice according to them.

So maybe part of the answer as to why customers are faster is that different monitoring systems are monitoring each other :)


It’s not last resort, you should do both because you’re alerting on different things.


I’ve seen alerting based on the number of Twitter mentions too


I think this is how downdetector works, it just looks for tweets that are complaining about something not working.


I'm fairly confident they just count a site as down based off the popularity of the page.


That's honestly a good idea. Does anyone do that?


A telco I used to work for did this like two decades ago but even better, they mapped incoming support calls (customers) to stations, and if more than N came in during a certain period for the same DSLAM it triggered some kind of alert.

Same thing happens (should happen) when you visit your ISPs website and look for registered downtime — many requests from one zip code from multiple ips should trigger an alarm if the isp is competent.


In my last job our main “is everything okay?” monitoring dashboard had RPS, latency, 500 rate etc and a line for the number of user issues reported per N requests served. We didn’t alert on this but it was a useful way to detect issues that were subtle enough not to trip up our main alerts.


My company does. :)


I'm surprised more folks don't monitor their vendors via uptime checkers - worth knowing if you'll be able to get anything done that day.


Close that loop: Once the traffic goes above a threshold transition it to red!


They should also monitor HN's submission with GitHub mentioned. Judging from historical pattern it is probably faster than monitoring traffic on status page or even twitter.


At least this outage allowed me to discover this cute pixel art octocat loading icon on the activity feed. Never noticed it before because I believe it always loads near instantaneously, or maybe I just never paid enough attention to it.

https://i.imgur.com/6Uwlwh7.gif


Is it possible for GitHub to mirror the releases in multiple different places(they likely do that, but I mean complete isolation where an outage like this doesn't break the downloads). Maybe like a proxy to object store, so it is a little more reliable(a setup such as this, should have less moving and custom parts).

So in a moment like this, you can convert https://github.com/opencv/opencv/archive/4.5.3.zip to https://archive.github.com/opencv/opencv/archive/4.5.3.zip. Maybe an implicit agreement of somewhat stale data by add the sub-domain "archive.". They'll try to maintain low sync times on a "best effort basis".)


Anybody could do that, I suppose. (I mean for the public repos).


Can't they at least fix their status page? https://www.githubstatus.com/ It returns `All Systems Operational`. I mean what's the point of having a status page if it returns wrong info?


There is a point. Even two.

1. It clearly indicates that automatic systems are failing to detect the outage. 2. It also indicates that no one is aware about the incident to manually signal the outage (or that there is no manual override).

Basically, it makes a difference between "yeah, shit happened, we know (and maybe working on it)" and "hah, they don't even know themselves".


Almost no one has automatic status pages anymore.

Partially because these large systems have some kind of ongoing issue at any given time, so it's challenging to provide a meaningful live status that isn't a bit misleading and could cause misdirected panic.

Partially because you don't want to give potential attackers (eg ddos) any insight if/how their methods are affecting your systems.

Partially because there are SLAs and reputation at risk, and you don't want to admit to any more downtime than you absolutely have to.


If you had a really robust system, it'd be fun to just slap a read-only mirror of your internal metrics dashboard onto the public Internet for anyone to browse and slice as they please. It'd be a brag, kinda.

Of course, in the real world, I don't think there's a single IT admin who didn't just start nervously sweating and pulling at their collar after reading the above, imagining it as something their CEO said and encouraging them to target it. Nobody can really do this — and that in turn says something important about how those metrics really look in most systems.



FWIW, I've worked on systems that have internal SLAs orders of magnitude higher than what they promise to the public. I think it's more just that there's no advantage to doing something like this as long as none of your competitors are. The status quo is that people's systems are really opaque and vastly underpromise what they should be capable of, and in exchange you get to absorb some unplanned downtime due to issues with underlying systems that you have little control over.


Wikimedia does this https://grafana.wikimedia.org/


Yeah it seems company status page lost its infancy stage where companies were actually honest about their outages. Bit similar what happened to online reviews.


Most of these over arching status pages are manually run, intentionally by design.


Status pages are almost never automated these days, they cause more problems than they solve.

To be fair, redditstatus.com is quite nice with their sparkline headline metrics. It at least lets you know _something_ is happening even if they haven't yet declared an incident.


These things don't/can't get updated instantly. I was doing some work as of ~5 minutes ago and it was working fine, and is unavailable now. If it is a major outage it will likely be updated shortly.


My status page for an open source multiplayer game updates the status page for each server on a minute by minute bases.

They can do better, they just don't want to.


No, I don't think they can. An MMO is a very simple system, in that there's only one Service Level Indicator (SLI) that devs, shareholders, and players all agree on. That SLI is "can a player connect to the server, and perform regular gameplay actions, without a ridiculous amount of per-action latency."

GitHub, meanwhile, is a million different things to a hundred million different people. Users of e.g. Homebrew, with its big monolithic ports system hosted as a github repo, have a very different SLI for Github than do users of some language-ecosystem package manager that allows you to pull deps directly from Github; than do people who depend on GitHub Actions to CI their builds on push; than do people doing code-review to others' PRs; than do people using Github mostly for its Wiki, or Issues, or downloading Releases, or Github Pages, or even just reading single-page-with-a-README repos, ala the various $FOO-awesome projects.

For many of these use-cases, Github isn't degraded right now. For others, it is.

If you ask for Github (or any service with this many different use-cases and stakeholders) to measure by the union of all these SLIs, then the service would literally never be not-degraded. In systems of sufficient scale, there's likely no point where every single component and feature and endpoint of the system is all working and robust and fast, all at once. Never has been, never will be.

And anything less than just going for the union of all those SLIs, is asking Github to exercise human judgement over which kinds of service degradation qualify as part of their own SLOs. Which is exactly what they're doing.

Certainly, internal to services like this, there are all sorts of alerting systems constantly going off to tell SREs what things need fixing. But not all of those things immediately, or even quickly, or even ever, translate to SLO violations. There are some outlier users whose use-cases just break the system's semantics, where those use-cases just aren't "in scope" for the SLO. As long as those users are only breaking the system for themselves, the degradation they experience won't ever translate to an SLO breakage.


You seem to be applying different rules to MMOs and Github, and I don't understand why. I'd say that there are many ways of looking at this; there exist complex MMOs; and one could look at Github from the point of view of an average user.

E.g., a bit tongue in cheek:

> An MMO is a very simple system, in that there's only one Service Level Indicator (SLI) that devs, shareholders, and players all agree on. That SLI is "can a player connect to the server, and perform regular gameplay actions, without a ridiculous amount of per-action latency."

Wouldn't you say that in an MMO of sufficient scale there's likely no point where every single component and feature and endpoint of the system is all working and robust and fast, all at once?

> In systems of sufficient scale, there's likely no point where every single component and feature and endpoint of the system is all working and robust and fast, all at once. Never has been, never will be.

Couldn't we redefine SLIs as "can the user connect to the server and perform regular user actions without a ridiculous amount of per-action latency"?


> and one could look at Github from the point of view of an average user.

My point was that Github has no "average user." Github is like Microsoft Word: each user only uses 10% of the features, but it's a different 10% for every user. Yes, there are some features that are in the critical path for all users (loading the toplevel repo view in the website might be one); but for any given particular user, there will be plenty of other features that are also in their critical path.

An MMO, meanwhile, does have an "average user"; in fact, MMOs have ideal users. An MMO's goal is to induce every user (player) to play the game a certain way, so that the company can concentrate their resources on making that particular play experience as polished as possible. There is, per se, an idiomatic "rut" in the road that players can "click into", ending up doing exactly the same short-term game loops that every other player before and after them has also done when playing the game.

MMOs can be reduced to a single SLO: can the ideal player have fun playing the game at the moment?

GitHub cannot be reduced to a single SLO, because there is no ideal user of GitHub. There are probably two or three thousand separate "ideal users" (= critical, non-universal user stories) for GitHub.

> Wouldn't you say that in an MMO of sufficient scale there's likely no point where every single component and feature and endpoint of the system is all working and robust and fast, all at once?

No, not really; MMOs have a complexity ceiling by operational necessity. They aren't composed of a ridiculous sprawling array of components. They might use Service-Oriented Architecture, but in the end, you don't scale an MMO vertically by making more and more complex clustered systems with master-to-master replication and so forth. You scale MMOs by either pure-vertical hardware scaling, or by horizontal shared-nothing sharding.

(The key thing to realize about MMO servers is that they're OLTP servers — they need to track a whole bunch of users doing a whole bunch of simple actions at once; and therefore they can't really be doing overly-much computation on those actions, lest they lose shared-realtime verisimilitude.)


I think you underestimate the complexity of MMOs. They can host truly massive events, e.g. https://en.m.wikipedia.org/wiki/Battle_of_B-R5RB . With so many different play styles and optional components (guilds, pvp, official forums, paid content, user made content, etc), I’d say defining an ideal MMO gamer is just as easy as defining an ideal SaaS user.

Not sure if any MMO reaches Github level, most likely not. But I don’t think there is a ceiling or any sort of hard distinction; i.e. I think 5 years from now we could have a MMO with complexity of today’s Github. Maybe it will be called a metaverse though.


I should mention that I've worked as an infrastructure engineer on both MMOs and GitHub-like enterprise-y services.

EVE is literally the only exception to "MMOs scale by horizontal shared-nothing sharding"; and that's why I mentioned the option EVE uses instead — namely, "vertical scaling of hardware" (i.e. having a really honking powerful single-master multi-read-replica DB cluster.)

In neither case is anything "clever" (i.e. inefficient for the sake of developer productivity / enterprise integration / etc.) happening. There's no CQRS message queues, no async batch writes, no third-party services halfway around the world being called into, no external regulatory systems doing per-action authorization, no separate "normalized data warehouse for OLAP, denormalized data for runtime" forking writes, no low-level cross-replicated integration between a central multitenant cloud system and individual enterprise-hosted tenant silos, etc etc.

> With so many different play styles and optional components (guilds, pvp, official forums, paid content, user made content, etc)

I think you misunderstood me when I said that there's a rut that users are guided into. The thing about MMOs is that the ideal user uses all the features (because the game incentivizes doing so, and because the more deeply and broadly users engage with the game's systems, the higher their retention / lower their churn will predictably be.) The ideal player is in both a party (or constantly switching parties) and a guild; has paid for all the DLC and regularly buys cash-shop items; plays every piece of PVE content you build; engages in PVP content and co-op UGC content all the time; etc.

Which is to say, for the ideal user, "the game" either works or it doesn't, because "the game" is the whole thing. Every feature needs to work, in order for the game to work. Because the ideal user engages with every feature. The SLO is, essentially, "can you do a completionist run through every bit of content we have." (If you're clever, and can make your server deterministic, you can create a completionist run as a backend-event demo file and run it in CI!)

And this is, in part, why MMOs are kept architecturally simple. Everything needs to work!

(And I don't just mean "simple" in terms of the backend not being a sprawling enterprise-y mess, but rather usually a single monolithic binary that can keep a lot of state in memory. I also mean "simple" in terms of as much of the game as possible being pushed to local, mostly-ephemeral-state client-side logic. MMOs are, often, a lot less "online" than one might think. It's very hard to "break" an MMO with a content update, because most content updates are to zonal scripts whose state doesn't persist past the lifetime of the in-memory load of that zone in a particular interacting device.)

With GitHub, their ideal users — of which there are many — can be individually satisfied by very small subsets of the system, such that they're still satisfying almost all their users even if one system is horribly breaking. That's what an SLO is "for", in the end: to tell you whether different subpopulations of users with different needs are happy or not. If you only have one "core" subpopulation, one ideal user, then you only need one SLO, to track that one ideal user's satisfaction. If you have more, you need more.


I understand and I had a similar understanding with your earlier comment. I still disagree with: "The thing about MMOs is that the ideal user uses all the features". This seems not a property of MMOs, this seems just a way of working with MMOs.


Large orgs (like Github) don't want or use automated status updates. There is usually always some service having issues in complex systems and immediately updating a public status page does more harm than good, including false alarms which may not affect anything public facing.


If you don't want to report sufficiently small issues, you can put that into the code, can't you?

Besides that, how are you going to cause "more harm than good"?


More harm than good to the company's own long-term reputation.

A status page is a kind of PR. Think of it like a policy for a flight attendant to come out into the cabin to tell everyone what's going on when the plane encounters turbulence. That policy is Public Relations -driven. You only do it if you expect that it's positive PR, compared to not doing it — i.e. if telling people what's going on is boosting your reputation compared to saying nothing at all.

If a status page just makes your stakeholders think your service is crappy, such that you'd be better off with no status page at all... then why have a status page? It's not doing its job as a PR tool.


It seems to me you are now agreeing with the original « They can do better, they just don't want to. »


I read the line specifically as "The employees can do better; they just don't want to try."

But that's not true. The company could do better. But the individual employees cannot. The individual employees are constrained by the profit motive of the company. They are not allowed by corporate policy to set up automatic status updates, for about the same reason they're not allowed to post their corporate log-in credentials: that the result would very likely be disastrous to the company's bottom line.

(Though, really, the corporations in most verticals are in a race-to-the-bottom in most respects. Even if you treat GitHub as a single entity capable of coherent desires, it probably doesn't desire to avoid automatic status updates. It needs to avoid them, to survive in a competitive market where everyone else is also avoiding them. People — and corporations — do lots of things they don't want to do, to survive.)


There's lies, damn lies and SAAS status pages.


lolol. Clearly it's just degraded service. It seems The Management hasn't approved declaring it an outage, that would just ruin uptime metrics!


Status page is also down and is returning a false positive. If you check the status page status page it shows the status page is down.


I would not be impressed if they hear about the down for a trending thread on HN.


I don’t work for github, so on a personal note I was about to create an issue in a repo, and it wasn’t loading. My go to “is my router messed up?” check is to load HN because it’s so reliable and fast. And lo, the top post was about github being down!


Ha. I checked HN for the same reason, I was not able to reach Github via university network and I thought that was the time my university messed up with DNS.

HN seems to be going down for the massive amount of requests!


Honestly, HN is the best status indicator on the net right now.


They have a banner "Investigating - We are investigating reports of degraded performance for GitHub Actions. -- Nov 27, 20:43 UTC" which should suffice for a start.


Yes now it has been properly updated. However it was down for 10-15 minutes before their status page was updated...


It's updated now.


The only status pages that make sense are the ones maintained by a third party, not the owner. Also for technical reasons.


And legal reasons.


They are currently updating it one by one


...and firing off automated tweets per product.


> Yet somehow https://www.githubstatus.com is ALL GREEN! smh

this is because status pages, at least the green/yellow/red light bits, are usually updated manually by a human. because if you automate those things, the automation can fail.

also, it's a weekend, a holiday weekend for some, so expecting updates to happen on the minute is a little unrealistic.

ALSO, it may take a bit of time for their "what is and is not really down" process a bit of time to be followed. on the weekend. by people who maybe haven't done this before outside of a training event 9 months ago or something.

> somehow

it is not hard to imagine how, to me.


Perhaps it would be useful to have a sort of different shade of green/yellow/red (perhaps striped?) indicating it's a value determined by automatic processes, not yet confirmed by a human.

At least that way, as a user of the service, you can make the assumption that it could be broken, and act accordingly, or at least take certain precautions while you wait for the human-confirmed status value.


yeah, even though I defend the inability to change the status page instantly, it is clearly an imperfect system.


A company the size of Microsoft can not hide behind - "it's weekend we also need a break".

I assume a company this size has teams working around the clock, the question is the priority. I guess GitHub doesn't have a high enough priority which is very sad.

I understand it's humans but we're also humans and we managed to figure out exactly what's up and what's down and we even reported it very conveniently here on HN for the people at GitHub for free

Sorry for the rant I spent a few very confused minutes trying to understand wtf I'm doing wrong, because naive me was thinking there's no way GitHub is down....


anything can go down, and has. the only thing I know of that hasn't is the internet as a whole. some parts go offline, but the internet itself, as a cohesive system, has never failed.

I bet it will one day, what with legislative "kill switches" being a thing.


Is it really that difficult to have a system run redundantly? (Not trying to be sarcastic here)


Can't access Vercel because I used Github for authentication. I guess I'm done with using centralized authentication services.


In case you or others find it interesting, I've had luck removing an authentication SPOF¹ by using IndieAuth on my personal site². I wish it were an option on more of the sites where I need to sign in.

¹ https://twitter.com/dmitshur/status/1223304521767669761

² https://github.com/shurcooL/home/commit/bb504a4ef0d7c552d363...


Try logging in via email. It'll send you a code to cross-check. Voila!


For developers operating a high-risk or widely-used code, most of them opt to using multi-factor authentication, which disables this feature.


I mean at Vercel. I have 2FA at GH but I can log in to Vercel using my GH email address via the email option on the Vercel website, after confirming the code on-screen versus the code sent to my email address.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: