
Too Perfect a Mirror - mercurial
http://jefferai.org/2013/03/24/too-perfect-a-mirror/
======
Confusion
The number of commenters that think the KDE sysadmins were stupid enough to
not know that "raid isn't a backup strategy" is depressing. You either didn't
read the article or you haven't understood it.

Git repositories contain redundant information to perform consistency checks.
If a bit flips in one of your repos, _git fsck_ will immediately catch it. The
KDE sysadmins thought these consistency checks were triggered when keeping a
repository clone in sync and thus any FS level corruption would be caught at
the first subsequent attempt to sync.

If you think "Gee, if I know this, surely they do?!" then perhaps the answer
is: "They probably do and this issue is more subtle than you initially
thought. Reread carefully before implying how dumb others were."

~~~
pja
Relying on consistency checks will not save you if someone does the git
equivalent of rm -rf on the master repository. Which is why mirroring (of any
description) is _not_ _a_ _backup_ _strategy_.

Yes, git --mirror should probably automatically invoke the moral equivalent of
git fsck by default in order to catch internal repository inconsistency like
the other git commands do, and the KDE team have been caught out by this. But
they still don't have any protection against user error leading to loss of
data with this setup as far as I can see, and that seems like a huge
oversight.

~~~
michaelt
Surely every backup system has some equivalent of an rm -rf ? A disgruntled
employee could phone the off site tape archival company and tell them to toss
all the tapes in the shreader.

~~~
reeses
In the specific case of offsite tape archival, let's say someone who has the
authority to do so requests that they destroy all tapes.

Most service providers (who stay in business) have enough compliance features
in place so that multiple authorized people have to be in collusion, a
sufficiently senior executive must make the request, or there will be a
"package" ready to be served to the client and relevant policing unit (police,
FBI, whatever) so that charges can be brought against the malefactor
efficiently.

While you may be limited in financial compensation by your contract with the
service provider, it is absolutely in their best interest to avoid the
situation with procedures (they do not want to be a party to a crime) and if
those fail, to provide extensive records of the movement and disposition of
those tapes.

------
chris_wot
How many times does it have to be said? Mirroring is NOT a backup strategy!

The number of times I've seen some sysadmin(s) base their entire organization
on this faulty premise is absurd. Mostly it's because they have decided that
RAID 1 or RAID 5 should be a decent "backup" strategy, but then there are
those who believe mirroring systems is how to do backups.

They never, ever, take into consideration what happens when something
corrupts/is deleted/is compromised. Without a way of going back in time (i.e.
an _actual_ backup) they are forever stuffed.

Sysadmins: MIRRORING IS NOT A BACKUP SOLUTION. STOP DOING THIS!!!

~~~
richo
Did you actually read the article? The mirroring in question is NOT block
level like you'd see with DRDB or RAID.

It's the --mirror option to much of git's plumbing, and it's not the same
thing.

~~~
jlgreco
I think you should re-read chris_wot's post.

> Mostly it's because they have decided that RAID 1 or RAID 5 should be a
> decent "backup" strategy, _but then there are those who believe mirroring
> systems is how to do backups._

I think he is implying that this is a case of the second. That is to say, I
think he is saying that --mirror is not a backup strategy.

~~~
chris_wot
That's correct. Yes, I read the article :-)

~~~
jefferai
You didn't. Mirroring in this case refers to using git --mirror.

You're assuming it works like a traditional file system or block level mirror,
but it doesn't. Corruption would in most cases have been caught. The weak (and
accidental) link was relying on the server to give us a proper accounting of
the current valid repositories.

~~~
jlgreco
> _You didn't. Mirroring in this case refers to using git --mirror._

We have established that he knows they used git --mirror, and I am pretty
certain that you could not possibly know that he did not read the article.

------
kurlberg
>The root of both bugs was a design flaw: the decision that >git.kde.org was
always to be considered the trusted, >canonical source.

It seems that an even bigger design flaw is that they (still) aren't doing
regular backups. The mirroring of course provides some redundancy, similar to
what raid does, but as they say: "raid is not a backup solution".

~~~
sho_hn
Backups address the problem only to the degree that they give you an older
revision to revert back to. The interesting thing that happened here is that
corrupted data was propagated through the mirror network, which syncs more
often than backups get made, and how to prevent that. Because while having a
safety net is nice, avoiding developers being inconvenienced by a failure is
the real challenge.

Plus unchecked backups of corrupted data aren't worth a lot, and corruption-
proof mirroring acts as further (and timely) backup.

~~~
Nitramp
Having a hot standby/failover is nice, but of course only the icing on top of
your backup strategy.

I read the article same as the OP, it doesn't mention any backup system, only
the mirroring. In fact, they did restore from a replica that was out of date
in the end (from projects.kde.org). Had they had actual backups, the article
would surely mention they only used projects.kde.org because it was somewhat
more recent than their last backup?

The story about planning to do regular ZFS snapshots hints to the same, if
they had a backup system they wouldn't need that.

edit: sorry, is that your post/do you have more insight? In that case I'm sure
you know better than me speculating on what the author meant ;-)

~~~
sho_hn
> edit: sorry, is that your post/do you have more insight? In that case I'm
> sure you know better than me speculating on what the author meant ;-)

Nah, I'm not Jeff. I have some general insight because I was involved with
setting up our git infrastructure in its early days, but I haven't worked on
the mirroring code, and I've been out of the loop on day-to-day admin
operations for a while, so I can't comment on the backup schemes that may or
may not be in action on the servers right now.

------
emilsedgh
I must say a big thank you to KDE's sysadmins. Please remember that they are
doing this as volunteers.

------
nathanstitt
Something I haven't seen anyone mention in these threads is that KDE is an
open source project - therefore they have hundreds (if not thousands) of
backups.

Even if all the official repos where destroyed, all they'd have to do is ask
the last person who'd pulled to give them a copy of their clone.

No doubt it would be a pain to do, but no data should have been lost.

As Linus said: "Only wimps use tape backup: real men just upload their
important stuff on ftp, and let the rest of the world mirror it"

------
nathell
An amazing thing about git (and other DVCSs as well) is that even if a much
more serious catastrophe had happened (e.g., if a nuclear bomb had struck the
KDE datacenter), it would probably still be possible to reconstruct (an
approximation of) the master repo, simply due to the fact that it was fully
cloned on hundreds of developers' machines worldwide.

Linus Torvalds once coined an adage that "real men don't make backups. They
upload it via ftp and let the world mirror it." Well, the FTP bit isn't true
anymore, but otherwise DVCSs have enabled this for mere mortals.

~~~
sho_hn
In terms of how this would be done practically: We did have intact gitolite
logs I believe, which record the credentials involved in pushing any ref and
what they're getting updated to, so we'd have known what data we would have
needed to locate and who we could contact to provide it. And since the commit
hashes describe their content, there wouldn't have been a risk of manipulated
data.

~~~
jedbrown
Presumably the mirrors also did not run an aggressive 'git gc' immediately
after 'git remote update', so they would still have non-corrupt commits in the
object store, in which case you could recover by "just" resetting any corrupt
refs.

------
lucaspiller
It would be interesting to here what someone from GitHub has to say about
this, they must have dealt with this issue at some point.

Also I would be interested to know why KDE don't use something like GitHub or
BitBucket? It would be cheaper for the organisation, and they could still
setup web hooks to get notifications of commits.

~~~
lbeltrame
KDE has a policy of not relying on non-foss services, as explained more
eloquently by the other reply to this comment.

~~~
kawsper
So they want to support Hetzner but not Github?

~~~
sho_hn
There are open alternatives to GitHub - we managed to put one together we're
reasonably happy with, after all - but unfortunately the buck right now tends
to stop at the hardware.

As for Hetzner: The git.kde.org master server isn't at Hetzner, nor are a
bunch of the mirrors. Our infra is pretty distributed and eclectic as far as
hosting locations go, partly because a lot of the resources are donated from
all over. We don't "support" any hoster in particular.

------
mehrdada
To me, it sounds like the mirroring system is circumventing Git and is syncing
the underlying directory structure, in which case, Git is absolutely not to
blame. It's not a Git reliability issue. Had they been using "git fetch" on
the mirror servers to clone from the backup servers, checking SHA1s while
doing so, the issue would not have happened and the corrupt files would not
have gotten silently replicated across servers.

~~~
masklinn
> To me, it sounds like the mirroring system is circumventing Git

The mirroring system _which git provides_?

~~~
avar
The mirroring they were using is explicitly meant to be a fast local ad-hoc
clone that _doesn't_ do integrity checks.

They used the safe version before, because they were running into problems
_with the integrity checks_ , i.e. ref deletions and non-fast-forwards.

What they should have done was to write a hook or a script that did those non-
safe updates manually (maybe only for some repositories, and some refs, don't
want to rewind e.g. master).

But instead they completely bypassed the safety mechanisms and got screwed by
corruption.

------
abbot2
Unfortunately many of us learn this simple truth the hard way: a mirror is not
equal to a (daily) backup.

~~~
RyanZAG
Non-checking backups are not a perfect solution here. Running on non-ZFS
filesystems, you can get slowly building corruption in files. When you take a
backup, you copy that same corruption over to your backup as well.

Going back through years of backups to find a non-corrupt copy can take a lot
of time during which your service is down. Not a perfect solution by a long
shot. Discovering which files have been updated and which are corrupt is also
non-trivial.

~~~
abbot2
Do your daily backups using rsync+hardlinks (rsnapshot, dirvish or something
similar) and keep a long history. This is slower than copy-on-write ZFS
(obviously), but works reliably on any Linux/Unix file system and the storage
cost is roughly the same as for ZFS.

~~~
raphinou
This is kind of what I'm doing as backups, but I still don't feel safe (I'm
kind of paranoid for my backups):what if an attacker gets in your server and
wipes out all your data and backups? And you know Murphy is always ready to
strike... I'm currently looking at making regular backups offline, on DVD or
blue ray disks, and automating the process. I wonder if this might be a
service people are interested in. Let me know what you think... (I put a
landing page at <http://www.offlinebackups> to test reactions)

~~~
abbot2
It is never a good idea to keep backup copies at the same place as the source
data, so normally it should be not that common for an attacker to be able to
wipe both original and backup. Regarding the offline optical disc backups,
they are still ridiculously expensive compared to magnetic spinning drives or
tapes. Backup, especially an automated one, is always an extra security risk
to consider, but apparently there are no other good ways...

~~~
raphinou
A lot of people put their backups on S3, with a script running on the server.
Even if you limit the rights with IAM to only put files, the attacker can
overwrite existing files on S3. The only way I thought to prevent that is to
give only write access with no listing access, and append a random number to
the file name. But, who does that? I'm sure 90%+ of the servers backing up on
S3 are not safe for this scenario.

The reason I thought of DVDs is that they're not sensitive to electromagnetic
fields as disks and tapes. (You never know:
[http://www.telegraph.co.uk/science/space/9097587/Solar-
flare...](http://www.telegraph.co.uk/science/space/9097587/Solar-flares-
everything-you-need-to-know.html) )

~~~
andrewf
If you turn on file versioning in S3, then you'll be able to get to the data
that was "overwritten". I don't _think_ there's a way for someone with only
PUT access to work around this.

------
shurcooL
Corruption scares me even to this day.

My backup strategy is as follows: my most important files are in my Dropbox
folder. So they're both on my computer and on Dropbox' servers.

But what if my drive goes bad and pushes corrupted files to Dropbox?

That's why I have another client with Dropbox that I only turn on every week
or so. I hope that if something goes wrong (including Dropbox itself wiping
all my files both remotely and locally), I can still get the older versions.
That, and time machine backups (they include Dropbox folder).

------
rdl
It's funny how even in their postmortem they don't seem to understand the
obvious: live mirrors are not a backup strategy.

Git mirroring is great, and periodic checks for consistency would help, but
snapshots taken and stored (offline) for reasonable periods of time are the
only reasonable backup model. There are corruption issues, availability
issues, etc. where offline backups are far more reasonable. Ideally you would
separately cryptographically sign your backups (which is easier than just
keeping track of hashes), too.

(and obviously a backup system is meaningless if you don't also check for
restores periodically, and monitor the success of the whole process)

------
chmike
I had a few quetions. Q1 Why not running a _git fsck_ on the canonical server
before allowing mirror servers to sync ? Q2 could it be possible to optimize
_git fsck_ to only do incremental checks, on the diffs sent to he mirrors ? Q3
if a canonical git server is used, why not ensuring this one is very safe
against data corruption ? Q4 what about the ext4 corruption in the VMs ? Is
the cause identified ?

------
geuis
His last sentence about ZFS is impossible to parse. Why aren't they using it?

"I’d love to see this in use, but, after having had excellent experiences with
it on Linux for a couple of years, I’m a ZFS fanboy at this point; and, I
don’t know how well it’s supported on SUSE, which is the distribution
git.kde.org is running (although I’ve run it on Gentoo, Debian, and Ubuntu
without any problems)."

~~~
arg01
He'd love to use it as he has had an excellent experience running it on
gentoo/debian/ubuntu linux, however as he's unsure of its support on suse and
prefers ZFS now anyway he won't.

~~~
geuis
Thanks. Much clearer

------
BUGHUNTER
The git issue (if it can be reproduced and validated as true) is an
interesting aspect of the story.

For admins there is another interesting one: they artificially generate an
extremely vulnerable spof - and after this desaster hey are still doing it.

Can you see the mistake?

Hint: if you have many things, why do you make it one thing?

~~~
BUGHUNTER
Ok, the elephant error is still unseen in the room - so I give you some more
hints: nobody has written about this error until now - it is still there and
danger of total destruction also is still there, because the single root of
evil was not destroyed. You still do not see it?

~~~
chris_wot
Just tell us, rather than try to look smart. It's not helpful.

~~~
BUGHUNTER
This was not meant to look smart, I refuse to compete and do not feel the
motivation to position myself, I speak freely here - please do not apply a
ratrace-like competitive mindset, this is misunderstanding me. I just was
really interested in the question, if the spof is not seen by anybody.

Of course I am willing to help - the problem is desribed clearly in the first
point of the author: they are generating _one_ "projectfile" - whatever this
looks like, it is a reduction of many to one. The distribution of 1500 git
repos with n-thousands of files is relying on one single file - there is no
technical need for that, in fact it eliminates the power of distributed repos
by reducing reliance into the presence and integrity of one single text file.

The author writes about the process of this file beeing corrupted and
triggering a random repo killing process - the incident is a good bad example
for what can happen if you anticipate the antipattern of making one out of
many.

Building redundant systems you always try to achieve the opposite - make many
out of one, to eliminate the spof. You can not scale infinitely with this,
because in the end we are living on just one planet.

However, unneccessarily making one out of many is the wrongest thing you could
do building a backup or code distribution system. This antipattern still
exists in many places and should be eliminated.

This is not about filesystem corruption etc. - the reason for the destruction
was one single project file. Do not do this. It is not critical for a backup
system if it takes long time to scan a filesystem for existing folders over
and over again. A backup system is not a web app, where it might be a good
thing to do one out of many (aka as caching in this case), but a backup system
does not need this reduction.

------
whitehat2k9
Darn, why couldn't this have happened to the Gnome 3 repos instead?

------
kbar13
at least he took advantage of the downtime to upgrade some stuffs.

