

DigitalOcean lost our data and gave us $500 - danielfernandez
http://dfernandez.me/articles/digitalocean_lost_our_data

======
sillysaurus2
So if you were backing up your data to Tarsnap, then you'd be up and running
as quickly as you could launch a new instance and redownload everything. And
$500 credit is enough to power a micro droplet for 100 months, or a small
droplet for 50 months. DO handled this well.

[http://www.tarsnap.com](http://www.tarsnap.com)

EDIT: s/years/months/g. Thanks.

~~~
asiekierka
You mean months.

~~~
tovmeod
50 months == 4 years and 2 months 100 months == 8 years and 4 months

------
AznHisoka
So this is a technical problem I am having right now that's preventing me from
backing up a Postgres database completely (hope someone here can help).

I have a master Postgres database that is receiving a TON of transactions per
second (I'm talking about thousand concurrent transactions). We tried running
pg_dump on this database, but the DB is just too huge, and it took more than 4
days to completely dump out everything. Not only that but it impacted
performance to the point where backing it up was just not feasible.

No problem.. just create a slave-DB and run pg_dump on that, right? We did
just that, but the problem is that you can't run long running queries on a hot
standby (queries that take more than a minute).

What would you do in my scenario? With the hot standby, I technically am
backing up my data, but I would have 100% piece of mind if I could daily
backups in case someone accidentally ran a "DROP DATABASE X", which would also
delete the hot standby/slave db as well.

~~~
jtchang
I use postgres and ran into this issue as well.

Inside postgres.conf for the slave I have the following:

# These settings are ignored on a master server.

hot_standby = on # "on" allows queries during recovery # (change requires
restart) max_standby_archive_delay = 900s # max delay before canceling queries
# when reading WAL from archive; # -1 allows indefinite delay
max_standby_streaming_delay = 900s # max delay before canceling queries # when
reading streaming WAL; # -1 allows indefinite delay
#wal_receiver_status_interval = 10s # send replies at least this often # 0
disables #hot_standby_feedback = off # send info from standby to prevent #
query conflicts

So I set it to 15 minutes for this specific backup server which I am okay
with. I already have another server with much shorter time delays.

~~~
AznHisoka
So you basically sacrifice speed of replication in order to ensure long
running queries don't get cancelled?

~~~
wiredfool
It's sacrificing the expected latency of replication.

Incidentally, if you're on 9.3 and your HW can handle it, take a look at
parallelizing the pg_dump. If you've got a relatively fast disk subsystem and
many cores, you can get a speedup. I've found it tends to make the dumps
O(biggest table) instead of O(sum of all tables).

(It's native on 9.3, I've hacked up some scripts that do it for 9.0, but they
don't get a consistent snapshot, so I do it during scheduled downtime. OTOH,
the dump/restore is ~6x faster OMM/OMD, so the downtime is that much shorter)

------
chc
The abrasive headline is kind of unfortunate, as the actual moral of the story
given at the end is exactly the right takeaway: Never assume your hardware is
infallible, so always have backups that you know you can use when your server
experiences a wildly improbable catastrophe.

Also, very impressed by Digital Ocean's response here. Given their reputation
as a budget host, they really do put a lot of effort into service.

~~~
dangrossman
> wildly improbable catastrophe

Or an extremely probable one like a hard disk failure. They only last a few
years; most data centers see an annual replacement rate in the 2-13% range.
The failure rate is a known quantity, and their limited 1-3 year warranties
that reflect that expectation.

There isn't a host I've used more than a few years where I haven't seen hard
drives (and power supplies) fail. I don't know if my experience is typical,
but hardware RAID controllers seem to go bad on me not-infrequently too,
losing the whole array at once. They don't pay you when it happens, they just
replace it. DO was _extremely_ generous here.

~~~
ChuckMcM
Was going to say the same thing, Dual drive failure on a RAID5 system with
five 2TB drives is 1 in 12. With 3TB drives that goes up to 1 in 7.

The underlying issue is that the uncorrectable read error rate is 1 in 10^15
bits, this is just physics (thermal noise, read/write signal loss, etc) But
with 8b/10b encoding that is only 90TB worth of bits. Rebuilding a RAID group
of 5 with four 2TB "good" drives (8TB of data to be read) you will see a
failure in one of the other 4 drives 1 in 11.25 times. (90/8). With 3TB drives
1 in 7.25 times. Using simple mirroring you won't be able to re-silver a
mirror 1 in 1:45 or slightly more than 2% of the time for 2TB drives.

Dual parity, or triple mirrors (x3) are now the minimum bars for making
storage reliable.

------
alex_sf
That's _way_ more compensation than I would have expected. AWS usually won't
even notify you until after the node has gone down.

Hardware failures happen; an application needs to be tolerant of it.

~~~
toomuchtodo
And with S3 storage so cheap, they should be backing up directly to S3, across
multiple regions.

------
gregd
This is 2013. Why are we still talking about backups as a lesson learned? Is
it because startups are skimping on Sys Admins?

~~~
astrodust
It's because some startups have developers that open w3schools, start typing
examples, and somehow ship a quasi-working proof-of-concept that goes into
production.

There's a bit of "if it ain't broke don't fix it" here, but a whole lot of
"get with the program" still required.

~~~
gregd
Well as a professional Systems Administrator, it pisses me off more than it
probably should. It's like you want to know why I'm worth what I'm asking
because when your shit falls down and goes boom, I'll get you back up and
operational in minutes or an hour.

Because it's my fucking job to help you manage your IT risks. Azure, Heroku,
AWS aren't replacements for Systems Administration, they're just tools in my
arsenal. I don't understand the mentality it takes to go into business (beta
or not) without having SOME understanding of your risk. The fact that DO paid
you a not insignificant amount due to downtime, means you're damn lucky.

------
deanclatworthy
It's great you had backups, but why a write-up. Is it an attempt to smear DO's
otherwise good name? It's an un-managed VPS so it's your responsibility to
keep backups of your box, not theirs. And hardware fails all the time, so you
can expect this to happen anywhere.

------
viraptor
> And if you just launched and have a single instance running, let your alpha
> users know that there will probably be some downtime.

That's true. But there's no reason for extended downtime even if that instance
goes down. Make sure your whole setup is described in
chef/puppet/salt/ansible/cf/whatever and even a rebuild from scratch takes
only minutes then. There's really little reason to skip that these days.

------
phea
DO is affordable enough that the minimum you should run are 2 droplets. Having
said that, I'm actually fairly impressed with the 500 credit and now you have
no excuses to run 2 vms. Consider it a lesson learnt.

------
JackFr
Alternate title: DigitalOcean went above and beyond their SLA for us.

------
kylec
DigitalOcean's pricing page indicates that "All cloud hosting plans include
automated backups".
([https://www.digitalocean.com/pricing](https://www.digitalocean.com/pricing))
From the email you received, it sounds like this is clearly not the case. I
wonder what other claims DigitalOcean is making that are not true.

~~~
richardjs
There is an automated backup system that you have to enable for a droplet,
that creates a snapshot every few days. It's a clear part of a droplet's
control panel. They began charging for it in July 2013. The price is 20% of
the droplet's monthly cost. Sounds like they need to update their pricing
page.

This is pertaining to a droplet feature though, and not some low-level backup
system. Meaning, it's not as if they're lying about the infrastructure below
what a normal customer can see. They just have an erroneous pricing page.

------
epochwolf
That's always a risk with servers. They can die and everything can do with
them. But they had backups so they didn't lose everything.

------
thu
It seems that is very nice from DO. I would not expect them to be responsible
of data loss in case of hardware failure.

------
KaiserPro
This might sound a bit glib, but raid 5 shouldn't really be used in modern
storage.

If you ignore the performance issues (which can vary by device) its just not
safe. Depending on the size of drive can take anywhere up 30hours+ to rebuild.

bear in mind that you tend to use disks that are all the same batch, it leaves
you in the danger zone for far too long.

Your options are: somesort of clever RAID (ZFS type thing) Another type of
clever RAID (Like the LSI chunk thingy in the DCS37000) RAID 10

~~~
fiatmoney
For SSDs, where the time-to-read/write-full-capacity is typically much less
than HDDs (both due to higher speed & lower capacity), it can be less of a
poor decision. SSDs also have somewhat more advanced machinery for data
integrity checking and slightly friendlier failure modes (e.g., the sectors
"wear out" over time, but the firmware tends to warn you as that starts to
happen, and you're not going to hit a sudden mechanical failure).

------
ars
Was this really a dual drive failure, or was this the rather common single
drive failure plus undetected errors on a backup drive, that show up when
trying to rebuild?

Because that happens a lot, and it's very important to do a _full_ read of
every drive in the array at least weekly! You have two options for doing that:

If you are using linux md raid then run the "check" command, which
automatically does the test using background I/O (but does still impact
things). On debian, and perhaps other distros too the mdadm command will do it
every month by default. Make sure to set a minimum speed or it might never
finish if you have a busy system.

You can also use the built in SMART on the disk to do a long self test. This
also uses background I/O and I think it has a bit less impact on existing
operations. (But you have to have some idle time on the disk or it will never
finish.) If you install smartmontools you can set smartd to do this test for
you every week, and keep an eye on the results.

I personally do both, plus a short self test of the disk every night.

------
neom
I truly believe that we did the best we could in this instance. Drive failures
are always always unfortunate, even with backups, downtime exists.

That being said, we're always genuinely looking to improve, and I'd welcome
your feedback on how you feel we did and how you feel we could do better.
Please do reach out to me personally john@do! Thanks. :)

------
mgkimsal
"we had backups".

Do you mean you had backups on digital ocean (using their backup service) or
something else?

~~~
marveller
Me too. As I know the DO backup is saved on Amazon no?

------
kbar13
Good thing you had backups.

With that being said, these days it's a good idea to use a deployment tool or
configuration management system like puppet/salt/ansible/chef/etc, especially
in a virtualized environment. This will help with scalability as well as
situations such as these.

------
sebslomski
This is the reason why I moved all data away from my server instances. My
images are hosted by cloudinary(with s3 bucket backup) and my databases are
Amazon RDS instances. I don't care if a server goes down, I can launch a new
one in a matter of minutes (with ansible) without any data loss.

~~~
dangrossman
Which of those things you named is protecting you from losing your database? I
paid the uber-high fees for RDS with Multi-AZ failover and... well... it
failed, then didn't fail over to another AZ. The instance ended up down for
hours before they recovered it. That's when I jumped ship from AWS, wrote off
the reserved instance payments, moved the database to some rented servers at
SoftLayer, and handle nightly off-site backups myself. Not only do I have
_working_ backups and failover, but 4-8x the capacity per dollar.

------
level09
The author is sweet, his conclusion was "always backup your data" if it was me
I would probably say "I'm moving away, will never trust them again on my data"
..

~~~
bcoates
Is there a provider that credibly offers high availability Linux servers?
Disks fail, capacitors fail, power fails, network equipment fails (a lot). I'm
sure it's possible to build an ultra-reliable server that mitigates all that
but I doubt it would be worth the money.

------
rb2e
The $500 credit from DO is quite reassuring. Usually if the HD fails and your
data is lost, your out of luck. I hear the "horror" stories of some hosts
reusing consumer Hard Drivers between servers so learned, Your data is your
responsibility. I'm glad the OP had backups but these failures happen,
thankfully DO had the business sense to compensate them.

Seems good advertising for DO, as any knowledgable system admin knows Drives
fail. DO could have not done anything.

------
Xorlev
Linkbait title, they handled it exceedingly well. Onus is on you to back up
your data. You did not 'lose' your data, given you had backups.

------
cbsmith
> And if you just launched and have a single instance running, let your alpha
> users know that there will probably be some downtime.

How about instead "alpha users should know that there will probably be some
downtime". Multiple instances don't really fix that.

------
aquadrop
Nice move from DO to give everyone $500 credit. As I remember, they don't
guarantee data safety (you still need backups even if they did). Double disk
failure is a rare thing, but it happens.

------
bcoates
Is DO apologizing here as a PR move, or do they make reliability claims that
would lead you to think this sort of thing wouldn't happen?

------
jonknee
DigitalOcean proudly advertises that they use SSDs... A dual drive failure
with data loss should be very rare. I wonder what happened.

------
monkeyz
So they now run raid5? I remember they boasted about raid10 a while ago, now
they silently downgraded to raid5 :)

------
Liongadev
What is the best/cost effectiv way to backup a windows server?

~~~
toomuchtodo
Backup the data and configuration information to an object store (AWS S3), use
configuration management tools so you can programmatically provision a new
server (dedicated or virtual, doesn't matter) in the event of failure.
Provisioning should include functionality to deploy your application, and to
restore your data to whatever data storage application (SQL, NoSQL, etc)
you're using.

If you have questions, more than happy you provide free advice.

------
od2m
What are the best options for backing up DO externally?

