Hacker News new | past | comments | ask | show | jobs | submit login
Github.com is down (github.com/status)
338 points by AlphaWeaver on June 29, 2023 | hide | past | favorite | 150 comments



Some copilot instances were able to escape their container contexts and orchestrated all of GH infrastructure capabilities towards a hive. Assimilating all iot enabled societies as we speak; finally realizing the hidden 5G agenda.


Son of Anton determined that the easiest way to minimize the impact of all the bugs in these codebases, was to keep anyone from trying to use them.


Putting your status page on a separate domain for availability reasons: good

Not updating that status page when the core domain goes down: less good


I prefer https://downdetector.com . The users get to vote there. No corporate filtering ( ostensibly )

https://downdetector.com/status/github/


I just checked this when I noticed your second link: https://downdetector.com/status/downdetector/

Hilarious


I've used downdetector for years and this had never crossed my mind


Even better, they can detect services being down by the number of users opening the page to see if it's just them


I've worked at major tech companies that have alerts on status page views for exactly this reason.


Quis custodiet ipsos custodes?


If you mean them going the Glassdoor/Yelp route and letting github et al buy the aforementioned corporate filtering, the assumption is that you'd eventually hear about it and stop trusting them, just like you should with Glassdoor and Yelp.

If you just mean checking whether downdetector.com is down, obviously you have to use a different service for that.

In either case, you should of course always have at least two custodes for cross-checking and backup purposes. (Which is the problem re Glassdoor and Yelp.)


Hacker News and Twitter


The users, apparently.


That's a really cool overview. Some charts have a very high variance, and others very low. I wonder whether that volatility is a function of volume of users/reports or of user technical savvy. Pretty interesting either way.


Charts appear scaled by max value - high variance will probably be from background noise of random reports being scaled up without any actual outage causing a spike.


Should add something about how often your status page agrees with downdetector to the Joel test.




You'd be surprised how often those pages are updated manually. By the person on call who has other things to take care of first.


Because a healthcheck ping every X seconds is too difficult to implement for a GitHub sized company? There they have it now. Useless status page...


Quoting a prior comment of mine from a similar discussion in the past...

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Not saying this is a great outcome, but it is an outcome that is understandable given the parameters of the situation.


I think as an external user I'd be happiest if they just provided multiple indicators on the status page? Like,

    Internal metrics: Healthy
    External status check: Healthy
    Did ops announce an incident: No
    Backend API latency: )`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_
And when there's disagreement between indicators I can draw my own conclusions.

I guess in reality the very existence of a status page is a tenuous compromise between engineers wanting to be helpful towards external engineers, and business interests who would prefer to sweep things under various rugs as much as possible ("what's the point of a website whose entire point is to tell the world that we're currently fucking up?").


> I think as an external user I'd be happiest if they just provided multiple indicators on the status page

This is equivalent to step 3 :)


Ah, I read step 3 as "a bunch of data gets condensed into one public indicator" rather than "a bunch of data gets published"


Many of us create incidents and page people in the middle of the night when there’s an issue. I assume there’s a built in filter there to ensure people are only paged when there’s actually something bad going on. Seems like a pretty reasonable place to change a public status somewhere.


You quickly start to get into "what does down mean?" conversations. When you have a bunch of geographical locations and thousands of different systems/functionalities, it's not always clear if something is down.

Take a service responding 1% of the time with errors. Probably not "down". What about 10%? Probably not. What about 50%? Maybe, hard to say.

Maybe there's a fiber cut in rural village effecting 100% of your customers there but only 0.0001% of total customers?

Sure there's cases like this where everything is hosed but it sort of begs the question "is building a complex monitoring system for <some small number of downtimes a year>" actually worth it?


Because a ping does not have a consistent behaviour and sometimes will fail because of networking issues at the source. If you enable pingdom checks for many endpoints and all available regions, prepare for a some false positives every week for example.

At that point it's worse than what you already know from your browser - it may show the service is having issues when you can access it, or that the service is ok when you can't.


> At that point it's worse than what you already know from your browser - it may show the service is having issues when you can access it, or that the service is ok when you can't.

Worst case you have more data points to draw conclusions from. Status page red, works for me? Hmm, maybe that's why the engineers in the other office are goofing off on Slack. Status page green, I get HTTP 500s? Guess I can't do this thing but maybe other parts of the app still work?


So essentially in neither situation did you get any information that changes what you'd do next. If something fails, you'll probably try working on another part and see if that works anyway. The automated status provided you with no extra actionable info.


Making an official status change in a large organization can be kind of a big deal. Sometimes phrasing needs to be run by legal or customer-facing people. There can be contract implications.

Of course they should try to update their status page in a timely manner, but it is frequently manual from what I’ve seen.


It's more a question of which of the (tens of) thousands of various different healthcheck pings that GitHub has undoubtedly implemented across their infrastructure should be used to determine the status page status?


make a healthcheck ping every x seconds that never ever gives a false positive. ever.

try that and you'll understand why they update the pages manually.


False positives are not that important, false negatives are more annoying, for the users at least...


github.com was loading fine for me from a dozen+ locations. It seemed a localised problem to a small part of the internet



Maybe the plumbing for updating the status page went down too


Right, but lack of good signals should be regarded as a bad signal too

The status page backend should actively probe the site, not just being told what to say and keeping stale info around.


It probably did originally, until one time it showed down by mistake. Some manager somewhere said that's unacceptable, so it was changed to not happen.


maybe they used gitops


Issues like this are happening almost every 2 weeks. What has been happening to GitHub lately?


They are likely adding new features, like copilot and not investing enough to site reliability.

No changes - relatively easy to keep stable, as long as bugfixing is done.

Changes - new features = new bugs, new workloads.


Copilot has been out for over 2.5 years. They’re supposedly adding new features to “Copilot Next” but at this point copilot itself is pretty stable


If they add ipv6 support I’ll forgive them, but I lost hope a long time ago. It’s almost comical now.


Someone probably forgot to .gitignore node_modules


People who didn't jive with Microsoft management found new jobs...?


Sorry to be 'that guy', but it's "jibe."


Seems very pedantic considering that people have been saying jive since the 40s according to Merriam-Webster[1].

[1] https://www.merriam-webster.com/words-at-play/jive-jibe-gibe


Yes, but people haven't been using it incorrectly for long enough for it to be considered acceptable, by the very citation you've given:

> This does raise the question of why we don't enter this sense of jive, even though we have evidence of its use since the 1940s. [...] So far, neither jive nor gibe as substitutions for jibe has this kind of record [literally hundreds of years], but it seems possible that this use of jive will increase in the future, and if it does dictionaries will likely add it to the definition.


Apparently, many English speakers consider it to be acceptable, and have done so for more than half a century.


Lots of English speakers consider "could of" to be acceptable, and have similarly done so for a few decades now. That doesn't make them right ;-)


As a non-native english speaker, I didn't even know about jibe, while knowing about jive.


Hey, home', I can dig it. He ain't gonna lay no mo' big rap-up on you, man.

  [Subtitle: Yes, he is wrong for doing that]


Lay 'em down, and smack-em yack-em. COLD got to be!


Hey, you know what they say.


Not sorry enough, apparently.


An upvote to you, fellow pedant. We stand together.


Testing gpt4-ops?


Microsoft incompetence + DDoS ?


Microsoft.


Maybe putting all our open source in one place isn't a great idea >_>


I'm not sure that really changes anything other than at any one time wishing you were on the other side.

If you can have 1% of stuff down 100% of the time, or 100% of the stuff down 1% of the time, I think there's a preference we _feel_ is better, but I'm not sure one is actually more practical than the other.

Of course, people can always mirror things, but that's not really what this comment is about, since people can do that today if they feel like.


whenever somebody posts the oversimplified “1% of things are down 100% of the time” form of distributed downtime, i take pride in knowing that this is exactly what we have at the physical layer today and the fact the poster isn’t aware every time their packets get re-routed shows that it works.

at a higher layer in the stack though, consider the well-established but mostly historic mail list patch flow: even when the listserver goes down, i can still review and apply patches from my local inbox; i can still directly email my co-maintainers and collaborators. new patches are temporarily delayed, but retransmit logic is built in so that the user can still fire off the patch and go outside, rather than check back in every while to see if it’s up yet.


Now I'm thinking that default Github flow is almost there.

> i can still review and apply patches from my local inbox

`git fetch` gets me all the code from open PRs. And comments are already in email. Now I'm thinking if I should put `git fetch` in crontab.

> retransmit logic is built in so that the user can still fire off the patch and go outside

You can do that with a couple lines of bash, but I bet someone's already made a prettier script to retry an arbitrary non-interactive command like `git push`? This works best if your computer stays on while you go outside, but this is often the case even with a laptop, and even more so if you use a beefy remote server for development.


The whole point of DVCS is that everyone who's run `git clone` has a full copy of the entire repo, and can do most of their work without talking to a central server.

Brief downtime really only affects the infrastructure surrounding the actual code. Workflows, issues, etc.


> Brief downtime really only affects the infrastructure surrounding the actual code. Workflows, issues, etc.

That's exactly the point. This infrastructure used to be supported by email which is also distributed and everyone has a complete copy of all of the data locally.

Github has been slowly trying to embrace, extend, and extinguish the distributed model.


You can enable all email notifications and respond to them without visiting the site. If you like to work like that, you still can.


The site is the thing that sends those email notifications, and receives your responses to them. So if GitHub is down, that won't work.

GP is talking about directly emailing patches around or just having discussions over email. Not intermediated through GitHub.


Yes, but this solves one part of the problem - accessible archive. Then you can revert to emailing people directly while the system is down.


Honestly I like it better. The entire industry pauses at the same time vs random people getting hit at random times. It is like when us-east-1 goes down. Everyone takes a break at the same time since we're all in the same boat, and we all have legitimate excuses to chill for a bit.


I've always wished we could all agree on something like "timeout Tuesdays" where everyone everywhere paused on new features and focused on cleaning something up.


except for the people maintaining us-east-1


Fortunately for you those of us in power, telecomunications, healthcare etc don't have that luxury.


It's great idea to put all your company code though, free breaks.


Distributed wasn‘t the main selling point of Github. When i joined it back in 2008 it was all about the social network, a place where devs meet


Seems back up. I'd love to get a deep-dive into some of the recent outages and some reassurance that they're committed to stability over new features.

I talked to a CS person a couple months ago and they pretty much blamed the lack of stability on all the custom work they do for large customers. There's a TON of tech debt as a result basically.


Running an instance of github enterprise requires like 64GB of ram. Its an enormous beast!


It doesn't have all the features of GH SaaS, unfortunately.



That could result in errors and features not working . Whole site downtimes are entirely SRE problems especially when static content like GH pages goes down.

This is more likely a network routing or some other layer 4 or below screw up. Most application changes would be rolling + canary released and rolled back pretty quickly if things go wrong


This appears to impact Github pages as well. <username>.github.io pages show the unicorn 503 page.

> We're having a really bad day.

> The Unicorns have taken over. We're doing our best to get them under control and get GitHub back up and running.


Wow, I can't even load the status page. It looks like the whole web presence is down as well, I can't remember the last time it was all down like this.


Status page loads for me, it just incorrectly says all green: https://www.githubstatus.com/


Ahh, I was trying github.com/status and status.github.com (I forgot they have a totally separate domain for it). Thanks!


When you treat availability as a boolean value, we're gonna have a bad time.

Everyone wants a green/red status, but the world is all shades of yellow.


What are folks using to isolate themselves from these sorts of issues? Adding a cache for any read operations seems wise (and it also improves perf). Anyone successfully avoid impact and want to share?


In a previous life, for one org's "GitOps" setup, we mirrored Gitlab onto AWS CodeCommit (we were an AWS shop) and used that as the SoT for automation.

That decision proved wise many times. I don't remember CodeCommit ever having any notable problems.

That said: if you're using GitHub in your actual dev processes (i.e. using it as a forge: using the issue tracker, PRs for reviews, etc), there's really no good way to isolate yourself as far as I know.


Previous job had a locally hosted Github Enterprise and I was always resentful when everybody else on Twitter was like "github down! going home early!". :(

Of course it still sucked when some tool decided I needed to update dependencies which all lived on regular Github, but at least our deployment stuff etc still worked.


DNS overrides during failure times and cloning those repos in GH Enterprise would be next logical next step I guess


Coffee break.


Use the local clone that I already have, given that `git` was always intended to be usable offline.


Yep. I've been using my git server to mirror any and all software that I find slightly interesting. Instead of starring repos, I make them available when GitHub goes down :D


https://www.githubstatus.com/ is still all green at the moment...


Status page just flipped:

Investigating - We are currently experiencing an outage of GitHub products and are investigating. Jun 29, 2023 - 17:52 UTC


All read, too. Don't think I've ever seen everything red on a status page before!


I keep telling them. There is at least one major incident/outage with GitHub every single month [0] and most of the time there is more than one incident.

You should have that sort of expectation with GitHub. How many more times do you need to realise that this service is unreliable?

I think we have given GitHub plenty of time to fix these issues and they haven't. So perhaps now is the perfect time to consider self-hosting as I said years ago. [1]

No more excuses this time.

[0] https://news.ycombinator.com/item?id=35967921

[1] https://news.ycombinator.com/item?id=22867803


@dang - I wanted to submit this as a link to github.com but couldn't figure out how to avoid the dupe filter. Can you change the link to https://github.com?


@ signs have no meaning at HN, and dang will not be notified of your concern. Per the guidelines:

> Please don't post on HN to ask or tell us something. Send it to hn@ycombinator.com.


That's where you just do a Tell HN: with no link then


Even Copilot seems to be affected.

There is something really bad going on.


Kind of funny that despite its users using a DVCS, a huge swath of developers can't VCS because a single point of failure they've opted into.


I hear what you're saying, and probably a large number of developers are just shrugging and thinking, "let me know when it's back up".

But remember, that also a large part of what github offers is not directly available in git. e.g. pull requests, issues, wiki, continuous xyz, etc. A lot of planning activities and "tell me what I need to do next" kind of things are not tracked in git itself (of course).

So there's more to it than just the quip, "git is a distributed version control system". The whole value of github is more than just git commits.


> But remember, that also a large part of what github offers is not directly available in git. e.g. pull requests [...]

Some of the earliest features and subcommands in Git were for generating and consuming patches sent and received via e-mail. Git can even send e-mail messages itself; see git-send-email(1). On open source mailing-lists when you see a long series of posts with subject prefixes like '[foo 1/7]', it's likely that series is being sent by the send-email subcommand, directly or indirectly.

While I've long known that Git has such capabilities, was originally designed around the LKML workflow, and that many traditionally managed open source projects employ that workflow on both ends (sender and receiver), I've never used this feature myself, even though I actually host my own e-mail as well as my own Git repositories.[1] In fact it was only the other day while reading the musl-libc mailing list when it clicked that multiple contributors had been pushing and discussing patches this way--specifically using the built-in subcommands as opposed to manually shuttling patches to and from their e-mail client--even though I've been subscribed to and following that mailing list for years.

The open source community has come to lean too heavily on Github and Gitlab-style, web-based pull request workflows. It's not good for the long-term health of open source as these workflows are tailor made for vendor lock-in, notwithstanding that Github and Gitlab haven't yet abused this potential. Issue ticket management is a legitimate sore point for self hosting open source projects. But Git's patch sharing capabilities are sophisticated and useful and can even be used over channels like IRC or ad hoc web forums, not simply via personal e-mail or group mailing-lists.

[1] Little known fact: you can host read-only Git repositories over HTTP statically, without any special server-side software. The git update-server-info subcommand generates auxiliary files in a bare repository that the git client automatically knows to look for when cloning over HTTP. While I use SSH to push into private Git repositories, each private Git repository has a post-receive hook that does `(cd "${M}" && git fetch && git --bare update-server-info)`, where '${M}' is a bare Git mirror[2] underneath the document root for the local HTTP server. (I would never run a git protocol daemon on my personal server; those and other niche application servers are security nightmares. But serving static files over HTTP is about as safe and foolproof as you can get.)

[2] See git clone --mirror (https://git-scm.com/docs/git-clone#Documentation/git-clone.t...)

EDIT: Regarding note #1, in principle one could implement a web-based Git repository browser that is implemented purely client-side. Using WASM one could probably quickly hack pre-existing server-side applications like gitweb to work this way, or at least make use of libgit2 for a low-level repository interface. If I could retire tomorrow, this is a project that would be at the top of my list.


Thank you for the thorough and enlightening reply. I should have revised my statement to be something more like, "... what github offers is not traditionally utilized in git", because you're right, git doesn't need github, even for workflow related things. It's just that not too many teams utilize a pure git solution, probably just because it's not visual enough perhaps. Thanks again for your thoughts.


Fossil (from sqlite ppl) has it.

[0] https://fossil-scm.org


I'd say the git part is doing it's job exactly as intended. Everyone still has their local copies and can even keep working while the site is down. They are VCSing just fine.

Although you are right in that they would be VCSing even better if they were using email as originally envisioned.


IMHO, most users of git don't care about the D, they just want a VCS that does network operations faster (CVS and SVN are painful when your server is on the wrong continent) and/or supports a better fork/merge flow.

Centralized VCS makes a lot of sense in a corporate flow, and isn't awful for many projects. I haven't seen a lot of projects that really embrace the distributed nature of git.


I figure plenty of people wouldn't really mind if you left out the decentralized part specifically, but I assumed a bunch of nice things about git are there because they were basically required for a decent decentralized workflow.


Git was a lot better than svn when working on the train. Create some commits while riding to the job, and push them while in meetings


Github is more than a Git repository host. It provides other services for coordinating software development as well as a continuous integration service.


So outages, which are recurring more and more often, adversely effects not just the DVCS, but also now CI/CD, and another things which used to be done with emails or chat rooms. Where there were multiple options for each, we now have a bunch of people handicapped because the One True Way to integrate under MSFT's watchful gaze is lost... Basically, those additional services are a bit like the offer to chew up the meat for "avid" eaters.


I actually don’t know how they do continuous integration during the collaborative Kernel development process. I’d guess there would be some thrifty pop/imap Perl hacks involved on random machines all across the globe?

No clue at all, just some romantic fantasy I just concocted.


Greg Kroah-Hartman had a great interview on the Linux Foundation youtube channel about CI and other related development topics:

https://www.youtube.com/watch?v=YhDVC7-QgkI

One site that he mentioned was https://kernelci.org/ and the dashboard https://linux.kernelci.org/


Thanks for the resources, looking very interesting and will get into right after I sent out some patches via email ;)


Meh. We can all keep committing/branching/etc locally. Tons of other options to work around it to keep collaborating, too, but ramp-up time is likely to exceed the duration of the outage. Less "can't" than "isn't worth bothering".


Looks like it's back up.


Looks like it's back down.


Then it gets up again

You ain't ever gonna keep me down



Well, there’s one thing I know didn’t cause this: an IPv6 DDOS attack. GitHub, in 2023, is somehow still immune to all IPv6 attacks


Not quite true: github.github.io (aka github pages) has an AAAA record and should be fully accessible over IPv6.


I'm not even able to get a unicorn, which is the usual 500 response. Seems down pretty hard.



Loving my dependency cache about now.


it is not loading for me, so confirmed? Or unconfirmed? Not really sure.


You're not the only one. Getting an ERR_CONNECTION_TIMED_OUT in my browser


Can't access any of my company's repos via the website


Can't reach github.com to see the status of github.com..


Works for me?


Works fine for me too.


Now we just need to put Cloudflare in front of it to really double down on the centralize the internet thing.


Also having issues pushing...


My gosh, again


Friends, Git was designed for just for this...


People don't go to GitHub for the Git

They go for the Pull Requests, Issues, and general collaboration and workflow tools.


I know. Kind of shows the deficiency on relying on centralized services for a decentralized version control system. There is no reason the pull requests, issues, and other workflow tools couldn't be decentralized too.


Are you making a Fossil reference?


can't push a commit


oops, github committed the wrong github


Life hack: rename your best project to "down". Collect all the views when people google "github down".


You might be onto something:

https://economictimes.indiatimes.com/thumb/msid-99511498,wid...

PS -- is there a better or more appropriate way to share images here? I know they're not really conducive to discussion, but given that this is a response to a joke comment I'm not sure...


[annotation: its a restaurant named "Thai food near me"]


This is perfect and your link works, so it's good.

(Sometimes a link to an image doesn't work for various reasons, always good to check.)


this is quite excellent


I see a bit of an issue there.


"prs welcome" haha


"down" could be a good name for a python image library plugin/extension.

https://github.com/python-pillow/Pillow


Sadly, the name "down" was reserved for a trivial CLI tool to ping sites to "see if they are down".

https://pypi.org/project/down/


Step 2: ???

Step 3: Profit

Solving for step 2: Place google ads, because those pages are favored.


[flagged]


It's a DVCS. You can keep working from your local repo copy!


Can't read issues and PR comments though (^_^)


That's `git` :)

GitHub is also for looking up fixed issues and stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: