
What really happened at Ma.gnolia and lessons learned [with video] - markup
http://factoryjoe.com/blog/2009/02/16/what-really-happened-at-magnolia-and-lessons-learned/#
======
lbrandy
The short version: the file system got corrupted. The backup was just a file-
sync over a firewire network to another machine. Meaning the bad data was
backed up and presumably overwriting the older, good data. They had a RAID but
the problem was a software filesystem so the errors just got stored.

He seems to understand how terrible of a design decision he made in regards to
the back-up system, and he appears physically affected when having to admit,
publicly, the details of the infrastructure (or lack thereof) that caused
this.

~~~
moe
A one-liner to add insult to injury:

sed 's/rsync/rdiff-backup/g' <bin/my-backup.sh >bin/my-real-backup.sh

~~~
moe
After watching the vid I have to take that back. Apparently there was a SQL
database involved...

But rdiff is highly recommended nonetheless.

~~~
hachiya
Anyone know how rdiff-backup compares to duplicity? I know duplicity has an
option to turn off encryption, if one wants to remove that overhead...

~~~
jrockway
rdiff-backup keeps a live version of the filesystem available, in addition to
backups. This means a full restore is just a `cp -a` operation.

FWIW, I've used both, and I like the opacity of the duplicity backups, since I
store them on S3. If you are syncing to a nearby disk, though, then you might
like rdiff-backup better.

~~~
moe
Seconded, both have their place.

I have tried pretty much all of them, incl. snapBack, dirvish and various
homegrown scripts building on top of rsync, rdup, rcs and so on.

rdiff and duplicity are the most mature of the pack which shows mostly in
their handling of corner cases (connection loss during backup, resume of a
partial/failed backup, disk full during backup, handling of really large
trees) but also in overall convenience and robustness (legibility of on-disk
format, configuration sanity, tools to find a specific revision of a file,
flexibility in retention/purge intervals etc.).

I generally recommend rdiff as the default tool for backups to a remote
spinning disk. duplicity, as parent suggests, is good when you need your
archive to be a single large file which helps with handling in some
situations.

There is also _dar_ worth mentioning which is less useful for incremental
stuff but can add redundancy to archives which is good for archiving to
unreliable/decaying media (DVD, Tape). Be aware though that older versions had
problems with large archives, use a recent version.

And last no least, if you have a tape library then Bacula is a mighty tool.
Easier to use and pretty much on par in terms of features compared to the
commercial offerings and the residents like Amanda.

We generally deploy a single backup server here with lots of disks that pulls
snapshots from everywhere via rdiff and either mirrors the local repository to
a remote location or feeds the precious data to tape via Bacula.

------
jim-greer
If you have any kind of staging/testing server I'd highly recommend using your
production backups to populate that on a regular basis. That way you test your
new code releases with real data, and you know that your backups work.

~~~
tptacek
Quick cherry bomb to lob into this conversation: populating insecure test
servers with sensitive production data is a classic web app company security
failure. It probably doesn't matter for you, but be cognizant of it.

~~~
Tangurena
I agree. One of our big financial clients has an automated tool to scrub such
data, but then they have social security numbers as well as lots of other
juicy financial data. So they're worried about all sorts of stuff that most of
us never ponder as a business risk.

One of the santizing steps is to replace all passwords with a set value, such
as six/seven of a letter (like "A") or a number (eg, "111111"). Another
sanitizing step is to scramble names and addresses. Usually the first letter
gets preserved, and the rest gets replaced with a hash (say, MD5 it, and then
base64 it and truncate it to length, that way it preserves max lengths and
typical size of words).

example: John Doe, 1313 Mockingbird Lane might get munged into Jiqw Dyh, 1313
Masdfasdfas Lfds

We just have username/password/address/phone, so all we do is set all
passwords to a default value (all emails, if any, get set to mine), and munge
up telephone numbers. Later this year I'll cobble up a better sanitizer. Our
parent company has to worry about GLBA compliance, but our little apps don't
"collect" enough information to worry about GLBA at this time.

~~~
timf
> replace all passwords with a set value

They're storing passwords directly and not hashes? Wish I could ask you which
company so I could avoid them...

~~~
sho
I don't understand this obsession about only storing hashes, as if that's the
primary critical issue with site security. There are plenty of reasons to
store the plaintext, and in a well secured database I really don't think it is
much of an issue. Or as I heard someone say once, "If you can break into my
database, and show me how, I will quite literally give you a million dollars".

Off the top of my head, here's a couple of very good reasons to store
plaintext:

\- password recoverability: if the user knows they can recover the password,
they're more likely to use a more complex one

\- flexibility with authentication: to use something like HTTP Digest Auth,
you need the plaintext to be able to hash it with a one-time nonce

And like many will no doubt point out, hashing it isn't all THAT secure
anyway. If it's not a very strong hash, or there's enough information to reset
it somehow, they can get what they want anyway. Not to mention that if your
database has been cracked they probably have everything they want anyway - why
even bother logging in?

I just don't get it. Sure, defence in depth is the best strategy and everyone
should practise it whereever possible. But whether the password is stored
hashed or not is not the lynchpin security issue many make it out to be, IMO.

~~~
timf
_"obsession about only storing hashes, as if that's the primary critical issue
with site security"_

Can you point me to something I said that implies this is an _obsession_ or
that this is what I think the _primary critical issue_ is with site security?

 _" password recoverability: if the user knows they can recover the password,
they're more likely to use a more complex one"_

Why would I as a user care at all if I could retrieve the actual value of a
complex password -- and why would knowing I could recover it make me then
choose a more complex one?

(The user should be given an option of _resetting_ the password via a link
sent by email. Sending passwords themselves over email is a great way to have
it revealed for someone else to use later.)

" _to use something like HTTP Digest Auth_ "

Good thing no one needs this mediocre authentication method if SSL is
available.

The majority of people use the same passwords at different sites. So even if
someone's cracked your database, it's still a good idea. Storing passwords in
plaintext is a non-neighborly thing to do.

~~~
sho
_"Can you point me to something I said that implies this is an obsession or
that this is what I think the primary critical issue is with site security?"_

You asked for the financial institution's name so you could avoid them, based
solely on the password storage issue. That counts as obsession to me. Oh and I
forgot to write it before, but financial institutions often _need_ to store in
plaintext anyway, for telephone authentication.

 _"Why would I as a user care at all if I could retrieve the actual value of a
complex password -- and why would knowing I could recover it make me then
choose a more complex one?"_

If people know they have to remember it, they tend to choose simpler
passwords, or they write it down. If you tell users to set a hard password,
and they can recover it later if necessary, they would hopefully tend to use
better ones. I can't really back that up with a study, though, so it could
just be my experience.

 _"The user should be given an option of resetting the password via a link
sent by email. Sending passwords themselves over the email is a great way to
have it revealed for someone else to use later."_

This is veering off topic, but you either trust the email or you don't. What,
pray tell, is the difference between sending the password and sending a link
to reset the password, if an attacker has access to the victim's email?

 _"Good thing no one needs this mediocre authentication method if SSL is
available."_

Yeah, pity SSL is not an authentication method. You did know that, right?

Digest authentication is heavily used in APIs and other non-browser
applications, where you need some authentication but the tunnel is not
necessary and you don't want to maintain heavy sessions. SSL, apart from NOT
being an authentication method, is anyway slow and heavy and requires proper
certs, so is mainly used only for user-facing web sites. Not to mention
intranets, devices, etc.

Anyway, even if HTTP Digest Auth were in fact rare, trying to wave it away
with "good thing no-one needs it" is ridiculous. I, personally, need it, and
am very far from alone.

I'd like to mention that I do agree in principle, and am playing devil's
advocate to some degree. My point is that password hashing is not a panacea,
it is often not even possible, and I would certainly not avoid a site just
because they store in plaintext if I otherwise had a good impression of their
security practises.

I suspect that many companies you know, trust and use have a plaintext copy of
your password with them, and you wouldn't even know it.

~~~
timf
" _That counts as obsession to me._ "

It's a red flag, not an obsession...

" _financial institutions often need to store in plaintext anyway, for
telephone authentication._ "

Mine doesn't. And yes, if they did, I would not be their customer. Just
because I may not know exactly what happens behind the scenes somewhere
doesn't mean I can't react to the red flags I can see.

" _If you tell users to set a hard password, and they can recover it later if
necessary, they would hopefully tend to use better ones_ "

How is that any different than if the user can _reset_ the password?

" _What, pray tell, is the difference between sending the password and sending
a link to reset the password, if an attacker has access to the victim's
email?_ "

There is a big difference. Anyone who has access to the text of the mail _at
any point in time_ now has your password. It's about mitigating the risks of
the crappy vetting channel (email) with a time limited method (a reset URL).

 _"Yeah, pity SSL is not an authentication method. You did know that, right?_
"

For password based things, I am referring to the _channel_ used to avoid the
well known problems with digest access authentication such as man in the
middle attacks.

Besides what I was referring to: used with non-anonymous X509 client certs,
yes SSL _is_ in fact used for authentication. Entire infrastructures are built
on it. All of the clusters I have access to only let me in by virtue of X509
client certificates over SSL.

 _""good thing no-one needs it" is ridiculous. I, personally, need it, and am
very far from alone._ "

I said good thing no one needs it if SSL is available not that no one needs
it...

I use it myself in software we release that runs behind a firewall, I'm well
aware it's cheaper.

 _"I would certainly not avoid a site just because they store in plaintext_ "

I admit it's a little on the reactionary side for me to say that, it was quick
snarky comment.

But I don't take back that it's a red flag.

~~~
sho
Fair enough. I think we agree anyway, I'm just being difficult : )

 _"Mine doesn't. And yes, if they did, I would not be their customer."_

Are you sure about that? However would you know? And how would they do
telephone banking?

I wouldn't expect a bank to store plaintext either, I'd expect them to encrypt
it and handle decryption at the terminal. But that's a whole different kettle
of fish.

 _"Anyone who has access to the text of the mail at any point in time now has
your password."_

Yeah, there is no way I want my passwords going through email either. That
argument was a bit flaky.

 _"avoid the well known problems with digest access authentication such as man
in the middle attacks"_

Your point is valid, but I wanted to respond by saying we're talking mainly
about large-scale DB theft, 99 times out of 100 done by an insider. You seem
to have experience inside a large organisation so you will know that often,
SSL terminates at the load balancer, a password form will pass into the server
from the balancer in plaintext. If there's an attacker on the inside, he can
sniff that to his heart's content. You could argue Digest is actually _more_
secure in this setting.

Toss up between more security on the user's LAN/WLAN (SSL) and more security
inside the DC (Digest).. OK, this is a bit whimsical.

 _"All of the clusters I have access to only work by virtue of X509 client
certificates over SSL"_

Me too, actually. But, sadly, that's not appropriate for the public at large.

Anyway, I agree it's a red flag, just trying to make a point that it's not as
black and white as it seemed you were suggesting. There can be good reasons to
store in plaintext, and if it's carefully implemented I don't have a problem
with it. As long as it's an informed choice, and not just a naive default, and
that goes the other way as well.

------
Maro
He had RAID and was doing filesystem level backup, ie. copying over the entire
Mysql DB file. When filesystem-level corruption occured, the backup script
overwrote a good (perhaps 1 day old) backup file with a corrupted file, so
he's backup was worthless.

The first thing that comes to mind is that he could have used application-
level backup, ie. Mysql. The script would have noticed that the DB is
corrupted because reads (SELECT) would have failed, and the backup script
would have stopped and sent him an email to restore the good backup file.

If he used a cloud service like Amazon SimpleDB, he wouldn't have to worry
about filesystem-level corruption, because that's abstracted away by Amazon.
(And it's replicated.)

This is still not enough though. What if the site gets hacked and the hacker
issues DELETE statements. Then all your data is deleted, and even if you have
application-level backup, it will succeed (it will read the empty DB), thus
overwriting your old backup.

I guess the conclusion is to keep around several copies of the data, and have
sanity-checks in place to avoid overwriting good backups. In his case it was
hard (given it's a homegrown application) to keep around many copies, because
his DB was 500G in size.

------
amix
A simple tip for those that run any kind of database: Be sure to replicate
them in master-slave (or master-master). And base your backups on taking a
slave down for backups.

Hot backups only work for very small databases - even those that are based on
LVM snapshots, tarsnap, innodb hotbackups etc. With big databases, you will be
most likely IO bound and a backup will take your site down.

If you have lots of load and lots of data then re-creating a slave will
require lots of downtime. For Plurk.com we have had a 4 hour downtime due to
re-creating a slave, so be sure to run a master-slave setup and have fresh
slaves replicated at all times (we have learned this the hard way :)).

------
markup
If you were to start a web application from scratch, how would you deal with
important database backups?

~~~
hachiya
[http://forums.theplanet.com/index.php?showtopic=91115&vi...](http://forums.theplanet.com/index.php?showtopic=91115&view=findpost&p=599193)

This looks like a simple and safe way of handling repeated MySQL backups.

~~~
markup
Yeah, I use a similar approach. However, for ma.gnolia we are talking about a
database reaching half a terabyte (unless I misunderstood), so my question was
more like: what do you people consider the best way to approach database
backup so that it's sustainable, scalable and the most disaster proof? Got any
testimoniance (whether personal or of public companies), or case study?

~~~
joshu
I wonder how they managed to get to half a terabyte. Delicious's was smaller
even for millions of users.

~~~
sjh
According to the write-up on Wired
(<http://blog.wired.com/business/2009/01/magnolia-suffer.html>), Ma.gnolia
also took a snapshot of the page being bookmarked. This may account for the
size of the database.

------
jonasvp
My recommendation for basic backup needs: rsnapshot. I backup our public
server to our internal network as well as my desktop machine to an encrypted
portable drive using it: <http://www.rsnapshot.org/>

It's probably similar to rdiff-backup, which I haven't used. If you're fine
with daily or hourly backups and don't have too much data (<100 GB), rsnapshot
together with regular SQL dumps works fine.

