

I Accidentally Deleted All Our Data - taylorfausak
http://taylor.fausak.me/2011/11/18/i-accidentally-deleted-all-our-data/

======
Pewpewarrows
Happy to hear that you had backups and not too much data was lost because you
caught it quickly.

But besides the obvious takeaway of always having recent, tested backups, I
think it's also important to get out of the habit of ever using a REPL (or SSH
for that matter) to connect to production servers. Write a one-off script,
document when and why you made it, check it into your VCS of choice, and run
it against the target server(s). Of course within the script itself feel free
to make any connections you need; my point is to never do it "manually" from
the command-line. This also prepares you for when you move past having a
single live server for web or db. If you're in Python-land, Fabric is the go-
to tool these days for automating this.

~~~
taylorfausak
I usually write my scripts in a temporary file and run them through Django
with something like:

    
    
        $ python manage.py shell <~/tmp-script.py
    

In this instance, I didn't even consider that what I was doing could be
catastrophic. But you're right, you want to avoid doing things manually. In
the future, I'll track all the scripts I run. And thanks for the tip about
Fabric, I'll check it out.

~~~
JonnieCache
For bonus points, clone your production db locally and test your script
against it a few times first.

Or, you know, have a staging server.

~~~
umarmung
Exactly.

This is the reason why enterprises always use staging servers with test
suites.

It is irrelevant if you use NoSQL, SQL, or magic, if you don't have
development+staging+production platforms for a business reliant on an
information system, you're doing it wrong and it will bite you hard
eventually.

------
mootothemax
Potential lifesaver for anyone using MySQL, you can start the client with
--i-am-a-dummy to prevent DELETE or UPDATE statements without a WHERE clause:

<http://sql-info.de/mysql/notes/I-am-a-dummy.html>

~~~
soult
I don't know how to do this for the MySQL client, but a very useful trick for
your the PostgreSQL client is to turn autocommit off, which basically turns
your session into a transaction that you have to manually commit before
closing the client. So if you do a "DELETE FROM important_data WHERE
condition_that_is_always_true" or something and it says "Removed 9001 rows"
you can just ROLLBACK and your data is still save.

~~~
ams6110
Oracle works this way, too. I fact I don't believe there is any such
ridiculous feature as "autocommit" in Oracle, that is only something that
might happen in a client application. Closest thing I can think of is that
sql*plus will perform a commit upon normal exit.

------
codypo
Ha, this is actually pretty funny in retrospect. There I was, giving a talk on
all of this great stuff we do with our data, having no idea that a good
portion of that data had just vanished. The universe has many potent
mechanisms for keeping us humble.

~~~
karmajunkie
Something like this happened to me several years back... I accidentally
deleted a table full of emails out of a spam appliance when I thought I was in
my development window. Thanking my stars it happened at 5pm on a Friday
afternoon, took me the entire weekend to reconstruct the database. Ever since
then I set my production terminal backgrounds to bright red so I can't miss
what window I'm in :)

------
anujkk
Although direct database update isn't recommended in SAP, still for some
strange reasons it was needed in one of the projects to modify a wrong entry
done by functional consultant. One of my fellow programmer was assigned this
task and this is what she did -

Update <table> set <field>=<value>

instead of

Update <table> set <field>=<value> where <condition>

The code was executed on development server(thankfully) and it created a big
mess. Full day work of 5 guys was lost. Project Manager banned her from
developing any critical program and told her she will only write abap/4
reports from then onwards.

~~~
zheng
Sounds like a poor manager to me.

~~~
mootothemax
_Sounds like a poor manager to me._

Agreed, sounds like an absolutely abysmal manager. As long as the developer
wasn't in the habit of making such mistakes, I would have put her _in charge_
of any future similar changes.

------
rick888
I've done this a few times in my career. I'm smart enough now to always backup
before I make any potential mistakes and test everything on a staging server.

The worst was executing "DELETE FROM <table>" instead of "DELETE FROM <table>
where userid='X' on a production server.

~~~
YogSothoth
A buddy of mine showed me a cool trick to avoid this type of problem. What you
do is first use SELECT to see the records in question:

    
    
       SELECT * from some_table where idx >= 5
    

Then, once you are staring at the records that came back (and are sure they
are the ones you want to delete), change SELECT * to DELETE (and change
nothing else) and rerun.

    
    
       DELETE from some_table where idx >= 5
    

Pretty hard to get it wrong with this approach ;-)

~~~
mootothemax
Or even better, SELECT(*) FROM some_table - that way the developer won't be
tempted to skip the step because of too many records scrolling through their
screen... not that I learned that from personal experience or anything ;-)

------
orenmazor
I think almost everybody has a story like this. when I was in grade 12, I got
a job as the sysadmin of three linux labs at my highschool. I didn't have a
lot of sysadmin experience, but part of grade 12 comp sci was to give students
a chance to learn something.

One of my first perl scripts (c'mon. it was 12 years ago) was to adjust
something about user accounts. I dont remember what it was, just that the
result was deleting the home folder of every single teacher at the school.

a very honest mistake.

------
latchkey
"We had a backup from earlier in the week"

week? That seems kind of off. I'd suggest setting up backups more often than
that, especially for production data.

Sorry to hear about that. I destroyed a drive once a long time ago (~1994) as
well. Happens to the best of us.

~~~
taylorfausak
Because of this, we're going to back up parts of our database more frequently.
A lot of our data isn't mission critical, so it wouldn't make sense to do
daily backups for it. Other things, like family accounts, obviously are. We're
going to back those up daily from here on out.

~~~
nirvdrum
Is making a backup of everything that costly? My experience is you'll screw
something up trying to be selective. With storage being so cheap these days,
it's hardly worth the risk.

I don't use mongo so it's not 100% relevant, but I found setting up a postgres
slave and running my backups from that is a nice solution. There's no
additional load on my master and I just dump the backups into S3.

------
dikbrouwer
A long time ago I once had a "rm -Rf ..." in a bash script. Missed a space,
and yep, it deleted everything. Luckily the DB wasn't owned by the logged-in
user so everything was recoverable. One of these "learning the hard way"
examples of what NOT to put in a script (this may sound all very obvious but
when you're in a hurry, really stupid mistakes do happen).

------
joevandyk
Postgresql has PITR backup.
[http://www.postgresql.org/docs/9.1/static/continuous-
archivi...](http://www.postgresql.org/docs/9.1/static/continuous-
archiving.html)

This allows you to go back in time to an earlier snapshot of your data,
without having to do a full database backup.

Unfortunately, it's still a pain to setup, even with tools like
<https://github.com/greg2ndQuadrant/repmgr>. Wish there was more convention
over configuration.

(I use Heroku's WAL-E to backup the data to S3.
<https://github.com/heroku/WAL-E>)

------
harryh
You should consider having a backup replica with slaveDelay set to an hour or
so. If you ever fat finger something in the future you won't necessarily have
to go all the way to your backups.

------
kfool
Version data, like source code.

    
    
      $ chd version mydb
      $ chd revert -d <txn_id> mydb
    

Neither replication nor backups protect from accidentally deleting data. And
restoring from binary logs requires downtime.

<http://chronicdb.com/blogs/undelete_from_whoops>

------
tripzilch
Wowwww! Great to hear you had backups and could mostly restore your loss
though!

You said you were working from the interactive shell so you didn't have
history? Now I always use IPython, which does keep a history (which I set to
_very long_ ), isn't there some way you can config the default Python
interactive shell to keep a history as well? Sounds like that would be useful
for all sorts of unexpected things. Otherwise, get IPython, it's really good,
but still a plain interactive shell only with extra fancy features.

Glad to hear it turned out mostly all right though. I always feel for these
data-loss stories because I can imagine what it'd be like ... that sinking
feeling. Brrrr.

~~~
taylorfausak
Python's default interactive shell have history, but it only keeps it for the
current session. Once you kill it, the history is gone forever. Once I
realized what I'd done, I checked to see if Python keeps the history anywhere.
It doesn't, but I found several wrapper scripts that do (and IPython).

------
slardat01
I added a SAN LUN to a volume that turned out to be in use, and was storing
our only copy of a 500GB (sic) backup database tracking 60k+ tapes. After
laying out the partition on the volume someone noticed that the database
disappeared, and we were able to revert the partition and get the data back.
Ended up buying some pretty nice Cognac for shortening the life of few folks.
No idea why they weren't backing up their database though.

------
JonnieCache
OUCH.

Doesn't django let you check if models are valid without saving them?

In rails I can do:

    
    
        Family.all.select {|f| !f.valid?}

~~~
kingkilr
Django's ORM does (obj.full_clean()), but he's not using Django's ORM, he's
using mongoengine, I have no idea if it does.

~~~
sibsibsib
mongoengine does do validation. It can be disabled on a per-model basis
though.

------
jakejake
I try to always keep in mind that you can deal with any error except data
loss. Any bug can be fixed but if data is gone there is absolutely nothing you
can do. It's pretty easy to forget about keeping good backups because it's not
particularly exciting. Testing regularly to make sure you can recover from
backups will give you a lot of peace of mind.

With that in mind, I'm going to do a test restore from our backups right now!

------
dagheti
Duplicate values that should be unique is such a common problem if you don't
use constraints.

This is a good example of why ACID and declarative constraints are a very good
idea for data management.You can do a bad delete from query if you don't have
the proper safeguards (though querying ability of SQL lets you more easily
preview your changes), but the initial corruption of the data is an easily
avoidable problem.

------
adavies42
> Unfortunately my screen session doesn’t have enough scrollback to see what I
> entered earlier in the day.

`defscrollback 100000` is probably the single most important thing in my
.screenrc. And if you're working in production, you should really have `deflog
on` too.

------
skurry
I wonder if it's possible that it was a MongoDB glitch and not a user error.
From what I've read, it's notorious for randomly losing records. Maybe the
loop save triggered it?

~~~
taylorfausak
I definitely don't attribute this to MongoDB (or MongoEngine or pymongo or
Django). I haven't seen it lose any of our documents to date. Silently
dropping documents when saving them without making any changes would be a huge
problem. It's astronomically unlikely.

~~~
gojomo
Maybe something in the Mongo-ORM layer went wrong, so you were iterating over
empty models, and then saving them back. Could that have created the absences
seen later? (The way you talk about your client automatically fixing things on
the load/save cycle suggests you might have some magic there.)

It just seems hard to believe you confused 'save' and 'delete'...

~~~
taylorfausak
I'm pretty sure nothing fishy went on behind the scenes. The client
"automatically" fixes things through our API when it tries to log in because
it will create an account if it doesn't exist.

I, too, have a hard time believing I confused "save" and "delete", but I'm
reasonably sure that's what happened. I think I misused Python's shell
history. I'll bet I hit the up arrow, which resurrected "family.delete()"
instead of "family.save()".

------
kirillzubovsky
I had butterflies in my stomach just by reading your post. Congrats on having
the backup in place!

------
greenshirt
Good post! And a great quick fix.

------
AmazingBytecode
Hey guys, I accidentally all of our data. Is this dangerous?

~~~
taylorfausak
I originally titled this post "I Accidentally All Our Data", but my girlfriend
convinced me to change it. I also considered "Delete ALL the Data", but I went
for the meme-free title.

------
drivebyacct2
Who works with a prod database in an interactive interpreter? My eyes are
literally wide in shock.

~~~
jrockway
I do this from time to time. But I also type "BEGIN" before anything else.

~~~
joevandyk
Watch out for database locks if you do this.

I once opened a transaction, did an ALTER TABLE, and the site hung until I
either committed or rolled back the transaction.

~~~
jrockway
Yeah. Changing the schema is not something I'd do with code running against
the database.

Also, my database typically refuses to start a transaction if the entire
database is locked, and my application handles failing transactions by waiting
a while and retrying.

~~~
matc
Why not?

~~~
smokinn
Timeouts probably.

If you mean why is it a problem at all, it's usually because of database size.
On any large scale deployment (ie you have at least a million users) schema
modifications will take _hours_. The only way to do reliable schema
modifications is to have extra capacity and do it in stages. Also, your
forward changes have to be backwards compatible. (AKA you're not allowed to
both add and remove a column at the same time.)

The way to do it is to take some of your slaves out of the request pool and
run the alter tables on them. You do this many times depending on your
available capacity. (You probably can't just rip out half your slaves, you
probably need to do at least 3 batches.) After you've altered all your slaves
you can promote one to master and take the master offline to do its own alter.
Then you push the code changes to production and add the old master back into
the pool as a slave once it's done its alter.

In this scenario you need 3x the time the alter takes. So if the alter takes
6-7 hours (common in mysql if you have a large-ish table) it's going to take
you at least 18 hours before you can push your code that depends on a database
change.

Doing this manually at scale instead of an automated deployment process is
extremely risky and will almost certainly be screwed up often.

This is one of the main reasons people are hoping schemaless databases work
out in practice.

~~~
matc
Agreed on backwards compatible, but for application semantics; _not_ types of
schema changes.

There's no reason to not be allowed to both add and remove a column at the
same time, or to merge and split whole tables. In your example, these kinds of
changes would not be possible.

There's also no reason to not be able to run old and new code at the same
time, or to revert a schema change.

With ChronicDB we reduced schema changes to:

    
    
      $ chd change -f upgrade_map mydb
    

Schemaless databases don't solve this, just as an instantaneous ALTER TABLE
won't solve this.

