
MySQL infrastructure testing automation at GitHub - samlambert
https://githubengineering.com/mysql-testing-automation-at-github/
======
andy_ppp
I take it from this blog that any engineer at GitHub can get a copy of
anyone's data. I find this vaguely terrifying, especially because Yahoo! was
so careful internally about what an engineer could actually get access to.
Most data like this had usernames/names/emails/other sensitive info redacted.

Can anyone at GitHub get a copy of my private repo(s) because I'm out if they
can.

"Move fast and take a copy of the referrers table."

~~~
philovivero
I'm curious: what have you ever seen Github publish that would indicate there
was any such security around your private repos within their organisation? I
would have assumed there was none myself, but you seem to've assumed they'd
treat your private repos like financial data or something.

Legit question, in case someone thinks I'm trolling or whatever.

~~~
andy_ppp
I think they'd treat my data as if it's like gold and not any engineer can
take a copy because:

a) potentially threaten GitHub's business (blackmail, disgruntled employee,
internal threat, security services, concerned comments like this etc.)

b) real companies store their source code on GitHub; their source code is, at
least, 50% of their business.

c) GitHub isn't really a startup anymore, they have hundreds of employees, you
can't let that many people have unfettered access to data. Someone could loose
a laptop...

d) As mentioned, Yahoo! was very careful with customer data

e) It's called a Private Repo

I'm making a big set of assumptions, you are right. I'll move my GitHub to my
own server now... installing GitLab which I absolutely hate but it's better
than trusting an unknown set of engineers.

I've just realised probably any Digital Ocean engineer can get root on my box
if they want to right :-(

~~~
leesalminen
Why do you hate GitLab? I've had it deployed for nearly a year and some recent
releases have resolved many of my concerns.

> I've just realised probably any Digital Ocean engineer can get root on my
> box if they want to right :-(

I would assume yes.

~~~
andy_ppp
Just can never bloody find the thing I want. It's the same when I drink Pepsi
or Burger King; purely psychological...

~~~
Aeyris
You want to at least have some control over the hardware. Disk encryption on a
dedicated server or (even more ideally) a caged, locked, colocated rack.

The system administrators of anything multi-tenant can access your data. The
problem is they probably don't care. It's up to you where to draw that line,
though.

 _" Administering a mail host is sort of like being a nurse; there's a brief
period at the start when the thought of seeing people's privates might be
vaguely titillating in a theoretical sense, but that sort of thing doesn't
last long when it's up against the daily reality of mess. Now that I think
about it, administering a mail host is exactly like being a nurse, only people
die slightly less often."_

------
jivid
Super interesting post. Would love to read more detail about their backup and
restore infrastructure.

If Tom and/or Shlomi are reading this: you mention taking multiple logical
backups per day. What benefit does this bring versus just having one per day
and doing a point-in-time restore using binlogs? Is this just a tradeoff
between time taken for a restore and storage you're willing to dedicate to
backups?

Disclaimer: I work on Facebook's MySQL backup and restore system
([https://code.facebook.com/posts/1007323976059780/continuous-...](https://code.facebook.com/posts/1007323976059780/continuous-
mysql-backup-validation-restoring-backups/))

~~~
shlomi-noach
@jivid the logical backups are done per-table, not per-server. Per-table
logical backups are useful to the engineers owning the data. It makes it easy
for them to restore data from a single table.

When an engineer loads logical backup data, it loads into a non-production
private zone where the engineer has access to the data, and can then make
informed decisions on whether there is need to re-apply data changes (due to
bug, due to need to review historical data, etc.).

This of course has the advantage of quicker restores (only need a single
table), and this happens to cover the vast majority of cases. This doesn't
cover the case where we need to restore consistent data for two or more
different tables.

------
Thaxll
What's the difference between gh-ost and the Percona tool? ( pt-online-schema-
change ) Also did you try to use a recent version of MySQL that supports live
migration?

~~~
samlambert
[https://githubengineering.com/gh-ost-github-s-online-
migrati...](https://githubengineering.com/gh-ost-github-s-online-migration-
tool-for-mysql/)

------
ngrilly
Great post! Do you use semi-synchronous or asynchronous replication? If you
use asynchronous replication, when a server crashes and this triggers the
automated failover, do you lose the last transactions?

~~~
ngrilly
It looks like the answer to my question is here:
[https://githubengineering.com/orchestrator-
github/](https://githubengineering.com/orchestrator-github/)

~~~
sciurus
The author of Orchestrator also goes into questions like that at
[http://code.openark.org/blog/mysql/mysql-high-
availability-t...](http://code.openark.org/blog/mysql/mysql-high-availability-
tools-followup-the-missing-piece-orchestrator)

Tangentially, in that blog post I was impressed by the list of companies
running orchestrator.

> orchestrator is actively maintained by GitHub. It manages automated
> failovers at GitHub. It manages automated failovers at Booking.com, one of
> the largest MySQL setups on this planet. It manages automated failovers as
> part of Vitess. These are some names I’m free to disclose, and browsing the
> issues shows a few more users running failovers in production. Otherwise, it
> is used for topology management and visualization in a large number of
> companies such as Square, Etsy, Sendgrid, Godaddy and more.

~~~
ngrilly
Thanks for the link: it's a fantastic read!

