
The GitHub Availability Report - mountainview
https://github.blog/2020-07-08-introducing-the-github-availability-report/
======
jsnell
> We strive to engineer systems that are highly available and fault-tolerant
> and we expect that most of these monthly updates will recap periods of time
> where GitHub was >99% available.

So that's them striving for two nines most of the time, which seems to be a
fancy way of saying one nine. Be careful you're not setting the bar too high!

~~~
kristianc
That doesn’t seem to be what it’s saying. Wording is unclear, but it seems to
be saying that _for the incidents recounted in these updates_, most of the
time the GitHub service itself will be 99% available. That’s not the same as
saying they only strive to make GitHub 99% available at all times.

~~~
jsnell
Hm. Not sure I see that. The updates are monthly, so the period of time
they're recapping will be the last month. And they're expecting that on "most"
recapped periods the availability was >99%. I.e. they also expect to have
months where availability is <99%.

------
KingOfCoders
Three out of four times MySQL was involved.

~~~
bob1029
I think the hosted SQL model may be fundamentally flawed when you are trying
to do something as complex as what GitHub is doing. You are basically saying
"Here, MySQL/Postgres/Oracle/et.al., I trust you with 100% of my replication &
persistence logic. Good luck optimizing all that SQL!".

I have started looking at tying together clusters of business apps that each
have independent SQLite datastores using an application-level protocol for
replication & persistence (over public HTTPS). This allows for really flexible
schemes which can vary based upon the entity being transacted. With hosted SQL
offerings, you are typically stuck with a fairly chunky grain for replication
& transactions. If you DIY, you can control everything down to the most
specific detail.

For instance, when going to persist a business object's changes, you can have
a policy configured that reflects over the type and determines how many nodes
need to be replicated to and which ones should replicate synchronously vs
asynchronously based upon it. Log entries may only be async to 2 nodes, but
you may decide that any accounting transactions should be replicated to 2 near
nodes synchronously, with 2 more far nodes being async.

Additionally, you could inspect various business facts in the entity (or
session related to the transaction) in order to determine ideal persistence
strategy. One powerful example here could be to use a user's zip code to
determine the geographically-ideal node to store the primary replica of their
account & profile data. This could allow for lower latency access for that
specific user.

~~~
hn_throwaway_99
> I have started looking at tying together clusters of business apps that each
> have independent SQLite datastores using an application-level protocol for
> replication & persistence (over public HTTPS). This allows for really
> flexible schemes which can vary based upon the entity being transacted. With
> hosted SQL offerings, you are typically stuck with a fairly chunky grain for
> replication & transactions. If you DIY, you can control everything down to
> the most specific detail.

Ugh, I'm totally going to sympathize with whoever eventually joins your
company after you've left who has to own this.

~~~
bob1029
I don't think I have provided enough contextual details in order for this type
of conclusion to be reached.

I'd be happy to offer a hypothetical if you are interested in debating the
merits of my approach.

------
TekMol
I don't understand the first one:

    
    
        a shared database table’s
        auto-incrementing ID column
        exceeded the size that can
        be represented by the MySQL
        Integer type
    

Followed by:

    
    
        GitHub’s monitoring systems
        currently alert when tables
        hit 70% of the primary key
        size
    

So why was there no alert?

~~~
Doxin
More importantly, why is their ID type small enough to where this is even a
possibility?

~~~
hvidgaard
The "default" is a 32 bit int, large enough for almost 4.3 billion records.
That is enough for the vast majority of tables, and going to a 64 bit int has
some performance implications, which at GitHubs scale is absolutely something
they need to take into consideration, just as they should have with the table
size.

------
aetherspawn
Yes, a lot of down time for sure, but each of these is quite an unexpected
edge-case. I wouldn’t think that many of these issues would be reoccurring as
they have added regression tests and process in place for each.

~~~
rsa25519
Yep. And several of them seem like issues that would be very difficult to
reproduce outside of production (e.g. overflowing primary key index), so it
makes sense that they were not caught earlier

~~~
dx034
It's actually an error I've seen multiple times in the past, and I'm not even
working with databases full time. I do think it's surprising that no one had
thought of implementing at least checks for these conditions.

~~~
capableweb
Yeah, I think most people who touch/interact with backend/database code, even
if not actually backend/DB developers, have a reaction nowadays to seeing
auto-incremental IDs because it's famously hard to scale to any distributed
architecture + introduces issues when you hit the limit.

Projects started in the last three years that I've collaborated/been part of
have all ditched the auto incremental IDs.

~~~
baq
Auto incrementing integer ids have some very desirable properties though. Not
sure what you replaced them with.

~~~
dijit
Something I see more and more is a primary key based on guid/uuid;

I'm not fully aware of the merits of either approach, so I'm just putting the
info I have at hand out there.

------
RcouF1uZ4gsC
A couple of observations:

It seems that MySQL seems somehow connected to all the outages, especially the
unexpected crashes.

>GitHub’s monitoring systems currently alert when tables hit 70% of the
primary key size used. We are now extending our test frameworks to include a
linter in place for int / bigint foreign key mismatches.

In 2020, should we have Integer(32-bit) primary keys anymore? I think at this
time, everyone should just go with BigInt or UUID for primary keys/foreign
keys, and basically not have running out of key space be an issue you have to
worry about.

~~~
minxomat
I always use

    
    
        UUID PRIMARY KEY DEFAULT uuid_generate_v1mc()
    

In Postgres. It will give you UUIDs and the larger keyspace, without being
excessively random (they're basically almost sequential). I do this _always_ ,
even if the table has other int columns that look like friendly values that
could be used as PKs. Some time in the future they _will_ ruin your day.

~~~
abhishekjha
Isn't UUID discouraged for a PRIMARY KEY pertaining indexes?

What would the B+Tree look like for over 2 billions UUIDS where there is no
order within? Also, the cache locality problem.

~~~
tasogare
This is why GP wrote "It will give you UUIDs and the larger keyspace, without
being excessively random (they're basically almost sequential)". The main
point for not screwing up the clustered index is the "almost sequential" part.

------
markwaldron
Good on GitHub for being transparent. That said, I hope they return to being a
more reliable platform soon

------
stunt
Everything is easy and obvious in retrospect. If you wonder how they missed
something as simple as PK size.

As part of maturity model, every product should have periodic milestones base
on their scale where engineers should reassessment their infrastructure and
the choices they made earlier. But, you can always overlook things unless you
have a checklist for everything.

------
nickjj
> GitHub’s monitoring systems currently alert when tables hit 70% of the
> primary key size

This is interesting. I wonder if they query the key size every N seconds as
part of their monitor, or if they report the key size to the monitor on write.

~~~
t3rabytes
We have a similar setup that just queries for the auto increment value every X
period of time, creates next to zero additional load to do so.

~~~
nickjj
Thanks.

Yeah that seems sane. The odds of you exceeding your monitoring threshold in 1
interval seems close to impossible too. Such as going from 70% to 100% in 30
seconds or whatever your monitor interval is set to.

------
rhy_bee
I'm curious about how Github generates 32 bit IDs in a distributed system.

------
hoseja
The last one is really funny to me in a slapstick way and anti-flapping sounds
like a really appropriate term.

------
ksec
I am wondering if Github ever consider switching to Postgre?

------
realchucknorris
facebook can learn something from github

~~~
dmpetrov
External production systems depend on GitHub. For FB - it is fine to fail from
time to time. Users will be even more productive :)

~~~
kevsim
External production systems unfortunately depend on FB too as we've seen with
all the iOS apps crashing due to issues with FB's iOS SDK.

------
kellenmurphy
In other words, so many outages are happening that we're only going to write
about it once a month.

------
daiyanze
Interesting...

------
poorman
Who doesn’t think to use bigint for their auto incrementing IDs in Rails? This
seems like it should be a non-negotiable for a company of scale such as
GitHub.

------
maallooc
I wonder whether these problems are caused by bureaucratic and incompetent
developers, poor SQL engine or both.

~~~
dragonwriter
Bureaucratic and incompetent engineering management is more likely than
bureaucratic and incompetent developers.

~~~
aaomidi
They're the same thing.

------
shenli3514
I'm the VP of Engineering from PingCAP.

As a devoted customer and a big fan of GitHub, it’s devastating to see service
interruptions and our business impacted by these problems; as the team behind
TiDB, we (PingCAP) believe there is something we can do to help solve the
database high availability and scalability problem. We would like to propose
the database team in GitHub to consider the TiDB platform. TiDB and TiKV can
work as a scale-out MySQL and it has been battle-tested in all kinds of hyper-
scale scenarios.

Additionally, just in case you didn’t do this, we also recommend GitHub
considering the Chaos-Mesh project ([https://github.com/pingcap/chaos-
mesh](https://github.com/pingcap/chaos-mesh)) to do Chaos Engineering. Chaos
Mesh was our internal chaos engineering platform and we open-sourced it on
Dec. 31, 2019. It can be used for simulating different kinds of failures
including network partition, flaky disk, node outage. You can easily use it to
simulate failures in a test environment and confirm that the high availability
works as expected.

~~~
capableweb
If you read the published report, you'll see that only one of the issues are
actually related to scalability in any way (and most newly created projects
learned their lesson of not using auto-incrementing IDs for many reasons). The
same goes for many companies, the downtime is not because of any scaling
issues but rather maintenance, code deploys or other things where humans are
involved.

So while the reward of you promoting your product on HN can be high, unless
it's very specific and actually solves their need, it just looks spammy.

~~~
jrockway
The big problem that I took away from the availability report was the one
where they lost acknowledged writes. They might not need the scaling, but they
do need the ability to lose the leader without losing writes. (It would be
preferable to reject the writes and become unavailable rather than to
acknowledge writes only to discard them minutes later. No external API user
can handle the case where "Github acknowledged my writes, and I read them
back, but now they're gone". But they can handle "Github returned 'service
unavailable' when I made a write".)

