The Day I Took Down 100,000 Web Sites

dredmorbius · on July 25, 2012

"Don't delete, disable" is another key.

Even if I'm doing a mass file delete, if I've got the time for it, first thing I'll do is move/rename the files to a backup / marked-for-deletion directory. Watch for strangeness. Then commit.

If it's site accounts or features, disable them, prevent them from being served, mark them inactive in the database / server. Leave them for a week or two. Then nuke.

Legally proscribed deletions (illegal content, etc.) are another matter, but most other content is eligible for a graduated approach.

sefk · on July 25, 2012

Luckily, Ning did (does) have a nice way to handle network deletes. Once "retired" it is gone from the system in all visible ways, and then a month later an auto purger comes along to clear database content and blobs for good.

I don't take credit for that -- it was the work of smart folks who came before me. But it sure saved my bacon this time.

bradleyland · on July 25, 2012

I can't upvote this enough. For bonus points, tar the items in question so there's no possibility of access by processes that are holding a file open. If a file is kept open by a process, simply moving it on the same filesystem may not be enough to identify problems.

dredmorbius · on July 25, 2012

The advantage of a same-filesystem move (usually to a subdir or alternate directory) is that it's a fast action. You're renaming an inode, not copying data around. When you're dealing with many, or large, files, that's an issue.

(Assuming Linux) If you're concerned with open files, you can do an 'lsof' and screen for files of interest. 'fuser -v filename' will show the process using a file, 'fuser -vk filename' will kill it (xargs or loops for many files, obviously).

And not everything that you're cleaning up is necessarily files, so adapt methods accordingly.

pif · on July 26, 2012

Just to say, moving or tar+removing does not change anything from this point of view. A file will exist until at least one process holds it open. As dredmorbius said, use lsof if this is a concern for you.

jameszol · on July 25, 2012

"The day I did something very very bad" can happen to marketers too.

I owned a pay per click management agency from 2005-2009. In 2008 I decided to buy some Enterprise software to help me better manage big clients in competitive spaces. I migrated these clients over to the software, set up a bunch of automated rules to change bids and called it good. Then Saturday hit and we suddenly exceeded a single account budget by $40,000.

That was the day that I lost $40,000. I took responsibility for every penny of that loss.

After a thorough internal investigation, we determined it was a bug in this Enterprise software. We took our records to the software company and they admitted or confirmed that it was indeed a bug on their side. Unfortunately, they were such a big company that we couldn't get anywhere when seeking a refund. They lawyered up real fast upon our request for our money back. I didn't have the means to pursue it at that point, so we took the hit as a business and continued on our way.

The lessons I learned: audit every automated task, find ways to lower your risk while implementing changes, and hire/retain a business lawyer for the duration of your business.

sefk · on July 25, 2012

Whoa, that's a bummer story. Not quite that London trader who somehow managed to lose $2B on bad hedging, but still.

jameszol · on July 25, 2012

You're right. It could have been much worse. It's all about perspective isn't it? I like to think I grew up a little in business after this incident.

laserDinosaur · on July 25, 2012

je-SUS. I felt sick just reading that...

jameszol · on July 25, 2012

It took a good year+ for me to get over it although I still try to persuade people considering this particular company to seriously consider alternatives.

dredmorbius · on July 25, 2012

Unless there's some very specific reason, with consideration, you're not naming and shaming them, I'm a little surprised you aren't still. That's quite a hit.

jameszol · on July 26, 2012

The past few years have been incredibly good to me. I don't have the time or energy to name and shame. I just keep busy trying to hustle and do what's right. I will also try to never make that mistake again because I feel like I learned something valuable with that experience.

dredmorbius · on July 26, 2012

Fair enough.

I have had the experience of shaming a company with a little essay that, over the years, garnered a few responses from other customers, a few employees, a particularly irate investor. And eventually watched the firm file for bankruptcy liquidation.

Fun, that.

notjustanymike · on July 25, 2012

As an intern I rolled out a shopping cart upgrade to 150 of our subsidiaries. On Friday afternoon. Without testing it.

Monday sucked.

sefk · on July 25, 2012

Concur about big changes on Friday. In our daily release meetings, we always used to ask people who wanted to release things on Friday -- "will you be at your keyboard all weekend in case something goes wrong?" Answering no to that meant no Friday release.

mootothemax · on July 25, 2012

Make changes during the day.

And if at all possible, don't put major changes live on a Friday afternoon; you're just going to ruin yours and everyone else's weekends.

webmonkeyuk · on July 25, 2012

I think the "Test your changes somewhere other than production" lesson was left out from the list at the end of the post

sefk · on July 25, 2012

Fair point. I had tested that delete/restore had worked, it's just the "what are you deleting" criteria that I'd missed. Our test environment didn't match production that closely so the difference in data is what bit me.

lmm · on July 26, 2012

That's the big, and more subtle lesson: it is really important that your test environment be very similar to production.

kellysutton · on July 25, 2012

I'm no expert, but having JS/CSS served out based on DB-dependent values seems dumb. I would say it wasn't your fault, but the technical debt collector coming around.

sreyaNotfilc · on July 25, 2012

Sounds to me like some type of legacy code that never was revised.

Things were probably moving so fast that revising it was always an option. Instead of changing how that main file worked, it was "best" and part of Ning's policy to just note that its "important" not to mess with.

That said, if you have the resources to create a team of people to revise early code and how it relates to the whole system, its important to just get it done instead of waiting for something to "blowup".

xSwag · on July 25, 2012

Unrelated: I just remembered about a hacker called Tiger-M@te who defaced 700k websites in one day and took over a datacenter. Mad stuff.

sefk · on July 25, 2012

That isn't me. :)

rwallace · on July 25, 2012

Lesson 0. You are not a child anymore with your mother nagging you to clean your room. If something is doing no harm, don't delete it.

liotier · on July 25, 2012

You are not a child anymore: you have become a parent and it is now your responsibility to keep everyone's room clean - with or without their cooperation. If something exists, it consumes resources, even if it is just attention - so it can only exist if it has a purpose, however insignificant. So keep the rooms clean - and enjoy the bonus of spotting cracks in the floor more easily.

endianswap · on July 25, 2012

Anything that is sitting around doing nothing is likely doing some amount of harm. For example, making backups longer, increasing the learning curve for a codebase, etc..

Cyranix · on July 25, 2012

The room-cleaning metaphor isn't as apt as you think. The metaphor I have found most helpful, heard from several other voices, is the garden-tending one.

There's a reductio ad absurdum open in the room-cleaning premise. I could write a million lines of code that don't do any measurable harm -- calling a no-op -- and by your standard they ought not be removed. Not doing any harm isn't the same as actively contributing value, which is what most people want their codebase to do.

sefk · on July 25, 2012

Sure, if it's not doing harm. But in this case, all this old data was really costing us:

- real costs - database storage, blob storage, (minor) perf impact

- hidden costs - tougher to create test environments that look like production if production is so large, but largely invisible / unusable.

K2h · on July 25, 2012

Thats how I ended up with 700 versions of FW. If something is not in use, archive it and make it non accessible, or someone will come along and do things with it you never intended.

startupmum · on July 25, 2012

I estimate I have around $200 a month in EBS costs for volumes and their backups that need cleaning up, if I can get round to it. But there's just me in my startup, and I have no choice but to leave it lying around for the moment. But the lesson is that what I spend on that in a year (if I leave it that long) will pay for a temp resource for a month easily. That's a lot of harm things lying around can do to a cash strapped startup.