

The Day I Took Down 100,000 Web Sites - sefk
http://sef.kloninger.com/2012/07/taking-down-100000-sites/

======
dredmorbius
"Don't delete, disable" is another key.

Even if I'm doing a mass file delete, if I've got the time for it, first thing
I'll do is move/rename the files to a backup / marked-for-deletion directory.
Watch for strangeness. Then commit.

If it's site accounts or features, disable them, prevent them from being
served, mark them inactive in the database / server. Leave them for a week or
two. Then nuke.

Legally proscribed deletions (illegal content, etc.) are another matter, but
most other content is eligible for a graduated approach.

~~~
bradleyland
I can't upvote this enough. For bonus points, tar the items in question so
there's no possibility of access by processes that are holding a file open. If
a file is kept open by a process, simply moving it on the same filesystem may
not be enough to identify problems.

~~~
dredmorbius
The advantage of a same-filesystem move (usually to a subdir or alternate
directory) is that it's a fast action. You're renaming an inode, not copying
data around. When you're dealing with many, or large, files, that's an issue.

(Assuming Linux) If you're concerned with open files, you can do an 'lsof' and
screen for files of interest. 'fuser -v _filename_ ' will show the process
using a file, 'fuser -vk _filename_ ' will kill it (xargs or loops for many
files, obviously).

And not everything that you're cleaning up is necessarily files, so adapt
methods accordingly.

------
jameszol
"The day I did something very very bad" can happen to marketers too.

I owned a pay per click management agency from 2005-2009. In 2008 I decided to
buy some Enterprise software to help me better manage big clients in
competitive spaces. I migrated these clients over to the software, set up a
bunch of automated rules to change bids and called it good. Then Saturday hit
and we suddenly exceeded a single account budget by $40,000.

That was the day that I lost $40,000. I took responsibility for every penny of
that loss.

After a thorough internal investigation, we determined it was a bug in this
Enterprise software. We took our records to the software company and they
admitted or confirmed that it was indeed a bug on their side. Unfortunately,
they were such a big company that we couldn't get anywhere when seeking a
refund. They lawyered up real fast upon our request for our money back. I
didn't have the means to pursue it at that point, so we took the hit as a
business and continued on our way.

The lessons I learned: audit every automated task, find ways to lower your
risk while implementing changes, and hire/retain a business lawyer for the
duration of your business.

~~~
sefk
Whoa, that's a bummer story. Not quite that London trader who somehow managed
to lose $2B on bad hedging, but still.

~~~
jameszol
You're right. It could have been much worse. It's all about perspective isn't
it? I like to think I grew up a little in business after this incident.

------
notjustanymike
As an intern I rolled out a shopping cart upgrade to 150 of our subsidiaries.
On Friday afternoon. Without testing it.

Monday sucked.

~~~
sefk
Concur about big changes on Friday. In our daily release meetings, we always
used to ask people who wanted to release things on Friday -- "will you be at
your keyboard all weekend in case something goes wrong?" Answering no to that
meant no Friday release.

------
mootothemax
_Make changes during the day._

And if at all possible, don't put major changes live on a Friday afternoon;
you're just going to ruin yours and everyone else's weekends.

------
webmonkeyuk
I think the "Test your changes somewhere other than production" lesson was
left out from the list at the end of the post

~~~
sefk
Fair point. I had tested that delete/restore had worked, it's just the "what
are you deleting" criteria that I'd missed. Our test environment didn't match
production _that_ closely so the difference in data is what bit me.

~~~
lmm
That's the big, and more subtle lesson: it is really important that your test
environment be very similar to production.

------
kellysutton
I'm no expert, but having JS/CSS served out based on DB-dependent values seems
dumb. I would say it wasn't your fault, but the technical debt collector
coming around.

~~~
sreyaNotfilc
Sounds to me like some type of legacy code that never was revised.

Things were probably moving so fast that revising it was always an option.
Instead of changing how that main file worked, it was "best" and part of
Ning's policy to just note that its "important" not to mess with.

That said, if you have the resources to create a team of people to revise
early code and how it relates to the whole system, its important to just get
it done instead of waiting for something to "blowup".

------
xSwag
Unrelated: I just remembered about a hacker called Tiger-M@te who defaced 700k
websites in one day and took over a datacenter. Mad stuff.

~~~
sefk
That isn't me. :)

------
rwallace
Lesson 0. You are not a child anymore with your mother nagging you to clean
your room. If something is doing no harm, _don't delete it_.

~~~
liotier
You are not a child anymore: you have become a parent and it is now your
responsibility to keep everyone's room clean - with or without their
cooperation. If something exists, it consumes resources, even if it is just
attention - so it can only exist if it has a purpose, however insignificant.
So keep the rooms clean - and enjoy the bonus of spotting cracks in the floor
more easily.

