Edit: Cloudflare seems to be working now. Wow, that was really easy to setup.
From my perspective HN doesn't actually bring that much traffic, so I find it curious which software is so fragile to fall over when getting HN front page traffic.
I hope for more diversity, for many reasons. One of which is that they are a know man-in-the-middle, that can see the plan text of what appears to be secure communication to the user. It operates in the US, it can be compelled to, and to do it silently, into using their position to spy on people.
That said we use them at work and I use them personally. I'm not doing anything "risky" at all. But there are many cases in history where suddenly the government changes and now you are. And apparently black people are always doing something "risky" in the US.
We'd have to basically quadruple our expenses without CF. Not to mention their available protections.
I vote for more diversity rather.
EDIT: I don't imply the product is bad, sorry
Can you be sure their product will not?
edit: MS in general, not GitHub specifically.
This is just clickbait. The "GitHub down" schadenfreude has really gotten tired.
Incidents: Before 89. After 126. What is the chance of this happening if the 'rate' of occurrence has not changed ?
Assuming an unknown but constant Poisson rate, we get the probability of observing what has been observed to be 0.00225.
A fortuitous thing about this test is that one does not need to know what that unknown constant rate is.
That feels like a mighty big assumption. Probably big enough that trying to calculate the probability is more misleading than enlightening.
As long as the incidents are spaced out enough, that the possibility of one incident affecting the other is low, Poisson can be surprisingly realistic. Quite remarkable, given how simple it is. All in all not that bad an assumption for a back of the envelope calculation in a meeting.
In practice, however, given more time, I would be looking at the statistics of inter-incident times more carefully. If those look sufficiently different from Exponentially distributed, a non-Poisson renewal process might be more appropriate than a Poisson process.
Strictly speaking, when looked at through a fine toothed comb, yes the assumptions are very likely wrong. All models are wrong , but some of them are useful.
The question is can we get some useful conclusions from such a simple model. In my experience I have been surprised by how often low failure rates are captured well by Poisson processes. Yes the assumptions could be wrong, but are they very likely to lead to wrong conclusions ? Empirical experience and math says otherwise.
There are sound reasons for why this happens. If you are interested, you can pick that up from Feller. These   links might also help.
Given the data that we have, its a plenty good first cut, but that's what it is -- a first cut. With more data one can do a more refined analysis.
What resource would you recommend to get an intuitive grasp of statistics?
To give you an idea about what kind of resource (book) I'm looking for: I'm currently reading Elements of Statistical Learning and I enjoy that it has all the mathematical rigour I need to really understand why all of it works, but also that it's heavy on commentary and pictures, which helps me to understand the math quicker. Counterexamples: Baby Rudin one one side of the spectrum, The Hundred-Page Machine Learning Book on the other.
Books like ESL are front end books, they cover the shiny and the methods. Feller is more of the backend.
I'm already looking forward to it.
It seems to me that it's considered a classic, although I've never heard of it (probably due to my ignorance). Do you have any more such nice recommendations up your sleeve? Don't limit yourself to probability, I'm looking for some reading for the summer :-D
You said you are familiar with linear algebra. The logical next stop could be Hilbert spaces. It looks at functions as vectors and analyzes their properties using linear algebraic tools that work even in infinite dimensional spaces. This sees quite a heavy use in traditional machine learning. Before diving into Hilbert spaces proper, you could revisit linear algebra in Halmos' "Vector Spaces" there he pretends to teach you linear algebra but actually teaches you about Hilbert spaces -- in other words, teaches you linear algebra but without the restriction of finite dimensionality.
And you are right, books are so damn expensive. India is somewhat better in the sense that we have 'low price editions' same content but printed in lower quality paper, not the prettiest things, but very student friendly. Note these are legit printings, not pirated copies.
But I still want to nitpick the details a bit. If you want to determine whether there was a change in the failure rate, you need to use rate statistics - failures per service-hour. Your analysis is only using the numerator while we know the denominator (the number of services in GitHub that can go out) has increased over time - GitHub Actions and Packages are relatively new.
More or less downtime (as reported by a status page) is probably affected more by changes to policies re: how / when incidents are posted publicly between "Github" and "Microsoft Github".
Subjectively speaking (using Github daily) I haven't noticed a difference. In general, Github has never been extremely reliable even pre-Microsoft.
shameless and useless advertising.
I've heard that for a number of years the Product org at GitHub essentially considered the product "finished", and that it did not need more features. Things like the rise of GitLab and the "Dear GitHub" letter, plus I believe a change of leadership, helped that completely turn around. Obviously that took time to yield benefits, so I think they've been in a much better place for quite a few years now.
Except for Atom :(
> [..] GitHub has been down more since the acquisition by Microsoft. But that could be all a part of coordinated effort to be more transparent about their service status, an effort that should be applauded.
Github pre-Microsoft was rudderless. They were more prone to implementing silly 3d model diff tools and things instead of supporting enterprise features or building powerful CI/CD tools and automation.
New Github is on the right track.
- Round Avatars (all catgirl ears are now cut off)
- Collapsing of some similar messages on the Dashboard into one (i missed some things because of this)
- Not just one, but several redesigns (im old)
- The whole "marketplace" functionality (imitation of an app store)
- The "explore" and "trending" functionality (i see this like a Facebook feed)
Not sure about others, but i used GitHub as a git host and issue/MR tracker. All the other stuff is just distraction in my eyes.
I agree. I'll add one: README files with tables now require horizontal scrolling, and that's utterly disappointing.
> The "explore" and "trending" functionality
I do work in ML and computer vision, and this is how I discover all of the new models and code people are using. It's awesome.
> Not sure about others, but i used GitHub as a git host and issue/MR tracker. All the other stuff is just distraction in my eyes.
You're missing out! Deploying code is so easy once you're using CI/CD. Github actions are powerful.
Testing something 100% has diminishing returns.
Therefore -> you will never be able to prevent all issues. As you are not aware how many changes are happening, you don't have a value which could indicate the healthiness of it.
I personally think that when i'm getting older and better in my job, i still make errors, the amount and severity is going down though.
But well, no QA effort is complete enough to really assure the quality of a large system. Thus, the GP holds true.
Don't know if that has anything to do with the Microsoft acquisition, but it is concerning.
They have paying customers that are being inconvenienced by this. We lost time over this multiple times in the last few weeks.
So, not good. IMHO they are having some major issues with their release process that they need to address. Standards have slipped there; they used to be better at this.
(of course I'm working on other stuff in the meantime but splices focus unnecessarily)
I wonder if COVID has affected this somehow. Anecdotally I've heard of at least one other ~peer company with a large rise in incidents/outages since April.
Strange since both companies previously had a strong culture of remote work (maybe 1/4 to 1/3 of eng was remote) going into the pandemic, so I'd be quite surprised if all-WFH contributed somehow...
No, MS prioritising Teams land grabbery is what caused it.
And generally speaking Github has been 100x more active post Microsoft acquisition. So I am not surprised at the downtime.
And I will gladly trade another few hours if not more downtime if they could just rollback the side panel design.
update: the website is back and well now from my end.
We have Github enterprise, and Slack notifications when Github has any issues. Nearly every week there’s a problem, sometimes it’s resolved in a few minutes, other times it goes for an hour or so. I’ve pondered the question if there has been more outages since the MS acquisition and in my experience that’s a hard yes.
My point is that in big complex systems sometimes there is not a straight line between cause end effect. Sometimes there’s just effect and you need to work out the cause.
Unless your deploy reconfigures some networking component that makes a large part of your network inaccessible. Then you need to fix the network issue before you can rollback to a previous version. That may require someone driving up to a datacenter and logging into a racked server.
And then you may need to restore data if the network misconfiguration caused data to be corrupted somehow (I admit this is getting a bit worst-possible-case-scenario) and, if the data got crossed - that one client could see data from another - you'll need to prevent access until you are sure everything is where it should be.
Finally, depending on your scale, the deploy of a new version can take a long time by itself. People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.
Even better - then they shouldn't even need to make another deploy, just flip the feature flag back off. And if you need to make network changes, then test those out behind a load balancer in parallel to the existing topology, so you can start routing more traffic to the new setup, but can stop doing so if any problems arise. I'm not saying any of this is trivial, but the point is, best practices exist to start deploying pretty much any kind of change in a way that can be undone in minutes or even seconds. When you have access to the resources and talent that Github has, then there's zero acceptable reasons why your site would ever be down or degraded for hours on end - zero.
Have you ever encountered a write rate that exceeds your db replica's ability to keep up with async replication? There's nothing to "roll back" in this case, and it takes time to determine whether the increase in write rate is from legitimate usage growth vs some recent feature (possibly deployed hours/days ago) writing more than expected during peak periods vs DDOS/bot activity.
Have you worked on multi-region infrastructure, where traffic is actively served from multiple geographic regions, with fully automated failover during regional outages? This is impossible to fully automate every possible situation -- even Google and Facebook have outages sometimes! Even just as a first step, it's hard to figure out conclusively which situations should be automated vs which ones need to alert humans.
Have you ever implemented read-after-write consistency for multi-region infrastructure, where multiple async DB replicas, caches, and backend file stores are not automatically in sync, but need to appear in sync to users making writes from non-master regions? The network latency between regions is sufficient to make this a complicated problem even when things are stable, let alone when there's other sources of replication lag to consider. There's no "out of the box" solution for this; every company needs to handle it in a way specific to their infrastructure and product.
Have you ever implemented a realistic dev/test environment for a massive infrastructure involving dozens to hundreds of services, and many different data stores, some of which are sharded? Again, no "out of the box" solution exists. You need to do something custom, and there will be plenty of cases where it doesn't accurately mirror production.
Or for a non-technical one: have you ever worked for a medium-to-large size company whose exit was via acquisition, rather than IPO? In my experience this always results in a major increase in attrition of the acquired company's top engineers. With an IPO, early folks are more incentivized to stay on; there's a better feeling of ownership, and the efforts of good talent can directly impact the stock price. But when it's an acquisition by some corporate behemoth, the opposite dynamic is at play: there's very little that the acquired company can do to impact the parent company's stock price, leading to a feeling of helplessness. Couple that with different policies and values mindset (say, a contract with a government agency that puts children in cages) and you can guess what happens.
For example, on June 22nd they had an issue whereby half of their nameservers were responding with an empty answer for queries for github.com. A very nice explanation is here: https://news.ycombinator.com/item?id=23605409. So for roughly half of users (including myself) this would have manifested itself as a complete outage. It also lasted for a good couple of hours. Yet on their status page it's listed as a 46 minute degradation only.
So relying on their status page reporting to draw conclusions about availability (as Statusgator seems to) will mean that an overly optimistic picture of availability is presented.
For binary artifacts, you can get around dependencies on a single provider by mirroring, but because git is mutable, you can't just mirror the repo and allow changes there, because you need the same permissions, ssh keys etc. for the repositories and because changes will need to be synced back to the source repository. You might as well not use Github in the first place and self-host (with all the problems that entails).
As far as I know, none of the major git providers offer this - I've experienced outages with GitLab, Bitbucket and GitHub that all affected the production environment (luckily, it's never been critical so far).
Bitbucket is the worst in this regard.
Part of it might be due to covid, but it started happening before that.
On the positive side, github had tone of changes since then, all seem to be good ones, so its understandable that it has more problems now as well.
However, I have no hard data to suggest that, only my experience as someone who's had to manage and maintain reports like this.
Can we merge them? Mods
And the comment that is currently at/near the top and mostly discussed happens to link to exactly this site.
Disclaimer: I don’t work at MS so no clue, but have been part of acquisitions
Last time it was down was just 6 days ago: 
I think its time to look for alternatives or frankly in the long term, follow what some open-source orgs are doing and self-host instead.
This has bumped self-hosting all my repositories much higher up my list.