Hacker News new | past | comments | ask | show | jobs | submit login
GitHub availability report: October 2022 (github.blog)
49 points by edmorley on Nov 2, 2022 | hide | past | favorite | 39 comments



Of all the many SaaS vendors I use, GitHub has the worst availability by far.

There isn't a month that goes by without our devs being impacted.

GitHub - please just work on fixing this. Your product is great but your availability is your biggest problem. It's beyond a joke at this point.


GitHub’s reputation for reliability really did a 180 after the Microsoft acquisition, as some predicted. It’s strange to me that despite this, you still get vociferous argument every step of the way to blaming Microsoft. People don’t remember GitHub’s past reputation for excellent reliability, then they accuse you of rose tinted glasses, then they say we’re just noticing it more now, then they say GitHub’s complexity significantly changed at a time that just happened to coincide with the acquisition, then they say better reliability is impossible. No, man, Microsoft acquired it, and when they got around to transitioning it to their infra, reliability plummeted.


I know. They have been very unreliable for years as I have predicted in here [0] and you can see all the times it went down or had intermittent issues [1]. I'm not really surprised to see GitHub become less reliable than someone self-hosting a typical Git server.

This is why it makes no sense going 'all in' on GitHub services.

[0] https://news.ycombinator.com/item?id=22867803

[1] https://news.ycombinator.com/item?id=32752965


I absolutely second this. Their APIs are incredibly flaky - especially with Actions, and it's incredibly annoying that they don't report on their _real_ availability - I can't remember the last time there was any form of related service degradation on their customer facing monitoring during an outage.


> Of all the many SaaS vendors I use, GitHub has the worst availability by far.

Have you ever tried using GitLab?


Thankfully the web app was relatively stable last month compared to September: https://github.onlineornot.com/


> There isn't a month that goes by without our devs being impacted.

+1 -- same experience here for a medium size (60) eng team.


>Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues.

Interesting - we have this kind of thing quite often. Basically, an event is stuck in the queue due to a logic error or a prior race condition, and it's endlessly retried blocking the rest of the events from being processed. We can't just automatically remove such an event from the queue because events must be processed in order or client data can get corrupted. It requires manual intervention (we have alerts in place), and every time it's a new event so we have to be creative and think quickly - how to unblock the queue without corrupting client data by skipping events. After an event is unstuck, there's a huge queue of unprocessed events which can take up to a few hours to be emptied in worst cases. Fortunately we have some sharding in place so there can be several independent workers processing the same global queue - with workers' shard affinity we can process shard data in order AND in parallel, so SRE can temporarily increase the number of workers when the queue gets too large, to speed it up. I still don't know how to solve this kind of problem once and for all (i.e. to have zero manual intervention). Is it even solvable?


It sounds like you've identified the issue yourself. You are relying on ordering when processing events. You need to either loosen that requirement or do better testing to prevent head of line blocking.

I don't know much about your application but the fact that you can mitigate the problem by scaling the number of workers suggests that the order requirements might actually be fairly weak. As a worst case outcome you may be able to push all events interdependent to the one with an error to a DLQ using a temporary blacklisting mechanism, but by that stage I think I would just prefer better testing.


Sounds like a DAG based task orchestrator could be a good fit. Where tasks state their dependencies and are allowed to run only when they have all completed.


It's amazing how many people here are able to precisely diagnose what GitHub should 'just' do, especially without access to their code base or experience with working at their scale.


I'm not sure about the technical details behind their outages since they're a little vague on that, but it's funny how every Rails developer champions Github as a Ruby on Rails shop as why Rails should continue living on when their availability is some of the worst in the tech scene. Lazy evaluation is great until it's not.


You're not sure about the technical details, but it's clearly the technology's fault?


What, do you know anything about Rails?


All the incidents seem to be platform-agnostic:

- improper database validation

- older component not tested against configuration change

- uncontrolled automation DOS

- incompletely distributed secrets


Isn't it GitHub Actions that has been having the availability problems for a while now? It started having problems after Microsoft acquired GitHub and there has been speculation that it is because it was ported to .NET and Azure. Is codespaces created with Rails?


It tends to have a lot of abuse where people try to run crypto miners on free github action accounts. Azure has been more stable the past few years and .NET wouldn't have any issues with stability at scale. Likely just a hard problem to solve at the scale they run at.


IDK what is the reason, all I know it's been down so much we've blacklisted it for anything of importance and migrated projects already using it to something else.


Trust me, nothing has been ported to .NET, nothing would be gained from such a move. That's not how Microsoft works. Source: I work at Microsoft


Actions was not a thing until after GitHub was acquired, but idk if they used something else in the early days of actions.


Maybe start moving away from Ruby on Rails. Good web stacks in Golang and Rust do exist and at that scale they're likely the only sensible choices.

How far must the sunk cost fallacy go before something is done?


Which Golang or rust web stacks are as good as rails?


I'm not saying they are as good -- I'm saying that they are good.

Meaning that at one point extra programmer difficulty is worth it if your everyday web stack can't keep up.


Oh I see. It’s difficult to justify replacing rails, that has been successfully used for many years, by something not as good (worse?).


Rails is convenient and intuitive, I don't think anyone reasonable is arguing that.

My point is that if the stack regularly falls over then the programmer convenience has to be sacrificed in favor of stable and mega-fast alternative that requires more programmer energy.

I love working with dynamic languages. I can prototype almost anything that I want to do, in hours. But I also recognized the need for a hardcore stack for a previous contract and went the long and painful route with Rust.

Result: the project is running for 7 months now, has only been restarted 4 times for updating it (re-deployment), never crashed once, handles 5000+ network connections and streams data from them 24/7.

Peak CPU usage on a 4-core VPS: 27%.

Peak memory usage: 180MB. Normal average memory usage: 80MB.

Right tool for the job.


Why do you assume the outages are language related and not due to the complex product having bugs? How does Rust prevent bad schema changes or missing data in the DB?


Because I worked with Rails for 6.5 years. Outages beyond smaller scales were at best a weekly occurrence.

Obviously I can't know for sure but it's not an uninformed assumption.


You know you can just click on the post title, that will open the posted link in which you can read the detailed cause of all the outages they had that month.

If you do this, you will realize that none are close to what you describe.

Also have you considered that if you had weekly outage when billion dollars companies continued to stick with Rails, maybe you were the problem?


I did read the article. One of the incidents was about their webhook worker(s) being swamped -- plus had errors due to deleted DB workloads that were necessary for the event to be processed. So I'd count that one as a slow endpoint attributed to Ruby on Rails (and it's famous for that).

And even if zero of their incidents alluded to performance problems with Rails I still worked a lot with it and I know for a fact that it's a factor.

Your snark doesn't change reality but you are free to pretend otherwise, fine with me.

> Also have you considered that if you had weekly outage when billion dollars companies continued to stick with Rails, maybe you were the problem?

Indeed, a programmer not having executive powers to influence change of deployment tech and server (was Puma at the time) is indeed me being a problem, surely. Especially after he made a study demonstrating the problems and calculated how much programmer time is wasted on these matters every week and he still got ignored. Perhaps I am the problem indeed!


> their webhook worker(s) being swamped

That's a capacity problem caused by a logic bug. Nothing stack specific. If you throw more work at a system than it is designed to handle, you'll hit a bottleneck.

> Your snark doesn't change reality

What reality? You are just barking your uneducated opinion. No one who ever worked on a service anywhere close to the scale of GitHub (regardless of the stack) would make such statements.


> However, many of these events caused exceptions in our webhook delivery worker because data needed to generate their webhook payloads had been deleted from the database. Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues.

I bet you I could cause this bug on a Rust product if you let me near the code ;)


Oh, absolutely. It can happen everywhere -- in theory.

In practice however, I found people working with certain languages and stacks to be more thorough. Still largely depends on the person in the important position though, that much is always true.


Rails is a huge problem, but the most mature libraries/frameworks in Go/Rust are all micro-frameworks, which isn't much of a replacement. Maybe some .NET frameworks would be a better choice.


Well I mean if stability plus performance are the main requirements then I'd disqualify everything except Rust.

Though I'd personally do it in Elixir but again, speed. GitHub is huge and should rise up to the challenge.


It’s quite the opposite: Rails is a big enabler.


The only thing Rails enables is getting started quickly. Maintenance is a complete nightmare. I have no problem refactoring large Scala apps or F# apps or even Rust apps...but that one fucking Rails app I have in my portfolio is the absolute worst. Still stuck on rails 4 because the time I upgraded it to 4 took two whole weeks of my life away from me. Never again. I'd rather rewrite from scratch than do that again.


Only in the first project phases. After that the teams start fighting over which MV-ABCXYZ abstraction to use. :)

I've seen good and productive Rails teams but they had to deliberately stop themselves from certain practices, otherwise they ran into problems. Long topic though, and people get very emotional and preachy defending Rails so it's a fruitless discussion 99.9% of the time.

In the end use what you feel works best for you and your team. Objective differences in programmer productivity, machine speed, iteration speed and other metrics does exist though and it's very tiring to see people constantly pretend otherwise.


Because you like Go or Rust better? Makes sense… muhahaha


I don't like either very much. But I've worked with them extensively and they are a better fit for when you want to squeeze more resources and more stability (the latter depends on certain details but it's certainly easier to achieve compared to Rails).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: