Hacker News new | comments | show | ask | jobs | submit login
An administrator accidentally deleted the production database (gliffy.com)
533 points by 3stripe 611 days ago | hide | past | web | favorite | 331 comments



My very first job - ~25 years ago.

Destroyed the production payroll database for a customer with a bug in a shell script.

No problem - they had 3 backup tapes.

First tape - read fails.

Second tape - read fails.

Third tape - worked.... (very nervous at this point).

I think most people have an equivalent educational experience at some point in their careers.

Edit: Had a project cancelled for one customer because they lost the database of test results..... 4 months work! Their COO (quite a large company) actually apologised to me in person!

Edit: Also had someone from Oracle break a financial consolidation system for a billion dollar company - his last words were "you need to restore from tape" and then he disappeared. I was not happy as it was his attempts at "improving" things were the cause of the incident! Wouldn't have been angry if he had admitted he had made a mistake and worked with us to fix it - simply saying "restore from tape" and running away was not a good approach.


A coworker of mine used to say "It's not the backup, it's the restore." Meaning your backup process isn't meaningful unless you have a tested and effective means for recreating your system from that backup. It has stuck with me.


Made that mistake once. I was interning at the IT Helpdesk at a Pharma Startup. Our CEO calls us one day, informing us that he deleted a slide deck, and was wondering if we could do anything about it. We dutifully attempt to restore from tape, to find the tapes blank and that our backup has done literally nothing for the past month. Thankfully, the data loss was nothing more severe, but the lesson stuck.


In my last job I was a member of an IT team of three, one guy was the CIO + he did sysadmin and IT support for an office of about 60. I did sysadmin and support for about half a dozen offices, 200 odd staff, in two countries with a geographical spread of several hundred kilometres. The third guy was support and sysadmin for our US office which had about 12 people (cushy job). I popped out there once for a week or so to help train the guy on our new systems. They had an old backup system that copied diffs to disk which he had to take home every day, but he didn't like doing that so they invested in a very expensive off site system. I was having a dig around when I was over there and it appeared to me that the diffs hadn't been happening. I confronted the guy and he said that he hadn't set it up yet and knew that would bite him in the ass eventually. They hadn't run any backups for six months, but he had been giving positive reports to both his manager and the rest of the IT team about the hard work he was putting in setting it up, and how it had been so reliably backing up, we celebrated when the first full copy of data finally finished, etc. Needless to say he lost his job.

We were very lucky that I caught it.


We were using an external vendor for our production database (I've changed this now) and I was poking around their management console one morning when I notice something (I was about two months into this gig). The CTO walks into the office at that point and I ask him: "Hey, I can see that the staging database is being backed up, but the prod database doesn't have any backup files. Where are the backups for prod?". His response: "Press that manual backup button immediately, please." Turns out that production had been running for nearly three years without backup actually running...


It's important to check that not only can you restore the database but you also check that it contains the data you think it contains.

Finding out that your backup consists of a perfectly backed up empty database isn't much fun!


One of the first things I did at my current place was to add a Nagios check "Is the most recent backup size within x bytes of the previous one". It wasn't comprehensive, but was fast to implement and was later replaced with full restores and checks, but nevertheless a few weeks later it caught a perfectly formed tar.gz of a completely empty database due to a bug that would have continued creating empty archives until it was fixed.


This is a fantastic illustration of testing 101: often it's the dumbest possible checks catch huge errors. Get those in place before overthinking.

E.g. a team adjacent to mine years ago had a dev who made a one-character typo in a commit that went to production. Which caused many $MM to incorrectly flow out the door post-haste. The bad transactions were fortunately reversible with some work, but I was floored that there were no automated tests gating these changes. It wasn't a subtle problem. The most basic, boring integration test of "run a set of X transactions, check the expected sum" would have prevented that failure.


All these threads are complete gold mines of years of hard won experience and system administration tips! Fascinating, especially the war stories I'm reading - both hilarious and horrifying in equal measure.


And as Gliffy is discovering to their detriment, exactly how long said restore takes...


Resonates. I didn't spend the extra $5 to get a USB 3.0 Flash Drive. Currently the read process for the image backup looks like it will take 4 hours to create write media, 2 to actually restore. Good lesson above about te restore being more inportant than backup.

Wish I learned that.


This morning I backed up my iPhone with iTunes. Once finished iTunes said backup was successful. Then formatted my iPhone, and went to restore. It say's the backup was corrupt.

Tried some closed source iTunes Backup fixer, didn't work.

Pretty heated right now, but oh well, what can I do? I guess I'll just have to start over. Thankfully I am a iCloud Photo Library subscriber.

So yes, can confirm, "It's not the backup, it's the restore."


If you've never restored the backup, you haven't backed-up.


My first job was doing software engineering consulting. My main client was a defense contractor. We wanted to at least use cvs or svn for source control, but you wouldn't believe the red tape associated with getting approval to use a free source control system... so we used date-named zip files for "source control". The rule was the last person out at the end of the day zipped up the shared drive contents and shut off the lights.

Though, we were required to keep good backups. 3 sets of tapes.. 1 always in the tape drive, 1 always in a fire-proof bomb-resistant bunker, and one sometimes in transit to or from the bunker.

Our manager was paying some obscene sum for this backup service, so I suggested we just hide one of the daily backup zip files and pretend we deleted it. The head of the group humored my request. It turned out that nobody was monitoring the tape or the backup job. The tape had filled up and nobody had swapped tapes and called the bunker courier.

Luckily, they at least used PVCS for configuration management of the releases. No source control for development, but every time we cut a release of the software, the head office sent over no fewer than 3 people to literally watch over our poor release guy's shoulder as he zipped up the source, built the binaries, checked both into PVCS, and burned two CDs of binaries and source.

Defense industry... things you're required to do get done in triplicate. Things you're not required to do, but cost no money and significantly reduce risk, take approvals from 5 different people 9 levels above you. I guess how often the backup tape needed to be checked for available space was insufficiently specified.

On a side note, at that client I also once sat quietly in a meeting for 30 minutes watching two grown men argue over if my use of "will" in a design document needed to be changed to "shall". I didn't care, and right away said I was fine changing the word to "shall", but the second reviewer was adamantly opposed to unnecessary changes.


"Will" to "Shall" -- very important legal distinction. Thank the second reviewer. Your job in "software engineering consulting" was actually managing risk.


Except that, as mentioned, this was a software design document and not a legal document. I described intended behavior of a software component. There's no ambiguity in using "will" or "shall" since software does not (at least did not at that time) have intent and does not make promises.

I tried to avoid argument because it didn't matter, not because I thought my word choice was incorrect. It was an internal document describing intended software behavior, to help the poor soul who had to maintain that code.

"When 'shall' is used to describe a status, to describe future actions, or to seemingly impose an obligation on an inanimate object, it's being used incorrectly."[0]

[0]https://law.utexas.edu/faculty/wschiess/legalwriting/2005/05...


I have a similar issue.

Formerly critical tool (now, sole repository of necessary historical information) that runs only on an outdated stack and for "copy protection" has critical data in an undocumented, obfuscated DAT file. It also requires a parallel port dongle, which it checks for as part of every read operation. The vendor went bust a decade before I was ever hired.

I've automated a 'backup to zip file' each time the application terminates. It's saved me more than once - it's easy to clobber the data in the thing and the users have a tendency to screw it up when trying to navigate its cryptic keystroke-driven interface.

Trying to export all the legacy data into a new tool met with incredibly frustration the couple of times we tried. It all becomes irrelevant in late 2017 and there's been a new system in place since 2007, so this abomination needs to only live on for a little while longer.


When appropriate, this is why I adore full-disk, bootable system backups. Plug a backup drive in, spot-check the contents (e.g. by data priority), run a filesystem compare tool, boot from it, etc. Backup is easy, as is restore and testing. A booted backup drive can restore to a replaced system drive.

That's obviously not the right solution for many IT-centric backup needs, but when full-disk backup became cheap and easy it set a new standard in how I think about backup and restore process everywhere.


This entire discussion has me wanting to backup (and test) everything that I can get my hands on.


World Backup Day — March 31st


World Restore Day April 1st


Postmortem on date field mismatch following backup/restore procedure of the World: Octocember 99st


What could possibly go wrong?


As long as it's not February 29th.


Had the same training from the head of operations at my first gig. If you don't restore the backup each time it's pulled to verify it's valid as a test then it's a risk.


This is made more serious in practice by the continual recurrence of bugs in Backup Exec over the last several versions, which would often manifest as "backups ran fine, verify ran fine, restores claim to run fine. But all your restored files are 0 bytes long".


How do you do test your restore without destroying the only trustworthy copy of your data?


Do it in the Staging or User Acceptance Testing environment.


This is the right answer! You want a staging/acceptance/mirror environment that's the same as production, right? So make it with nightly restores of the production backups. You get a crisp, fresh staging environment, and regular validation of your backups too. Just remember to run full production monitoring against your staging environment too.


A friend of mine has his staging databases restored every morning (and anonymised) from the previous day's production backup.


Restore it elsewhere.


I got fired from a (sysadmin) job because a servo in a tape library failed while i was working 14h days on conference that I was told was "my only priority".

We didn't need the backups at all, prod was fine, and I found out 2 business days after the servo failed when the conference was over. apparently 2 days of not having backups due to a mechanical failure out of my control was beyond unacceptable.


in the early 2000s I was a young guy and board observer on a .com that was growing extremely rapidly and had a 9 day complete outage.

The venture funds, including the one I was part of, came in screaming, "who's getting fired for this", to which the CEO responded: "are you kidding me? We just spent $1.5M educating this team and you want me to hand them over to our competition?"

Every investor and other board member got real quiet, realizing how correct the CEO was. Good CEOS recognize that mistakes happen. The best make sure they retain the people that have learned through those mistakes.


Early 2000s, 9 day complete outage.

Don't know if that was the same one, but happened to me at Rent.com. The story is that a change in a shell script meant that backups were not actually being sent properly to tape. That was OK, there was another online backup copy. But the restore process deleted that for 1 hour each day before it was recreated.

The production database died during that hour. We had to take the last good backup (several months earlier) and replay WAL logs to bring it up to date.

The sysadmin whose mistake it was offered her resignation, and was turned down by the head of tech because he knew she wouldn't have made it if she had a more reasonable load. The head of tech offered his resignation to the CEO and was turned down because the CEO knew that it was due to incorrect company priorities.

The next tech hire was a DBA whose sole task was make sure that we have multiple levels of verified backups.

In less than a year we were sold to eBay at a nice price. Part of the reason was that they thought that the way that we handled failure said very positive things about the organization.


And this is how you run a business. It seems every person who worked with you had integrity and true leadership in spades!

That's rare. Too rare!


Oh, and one other memorable detail that I couldn't make up.

The database went down DURING the CEO's 50th birthday party!


i love happy endings :)


Perhaps one of the senior guy should have learned from past mistakes but they are still in this situation?


Ah, the good old "blameful postmortem." Sounds like a toxic culture...


Now that's terrifying. Glad it worked out for you.

My tifu is when I first started working for an actual client, I thought I was a genius running my own git server on my own computer. Also I had the ssh key for my server on the same computer.

When my hdd failed, I lost everything.

Now the funny thing was, while my server was running I had no way to gain access to the server, the customer had already started using it, and they had inserted around 400-500 records already.

Now there I was, locked out of my own server, no source code I was working on for around 3-4 months.

Lucky for me, someone online had mentioned about a way to gain access to an aws server if you lost the ssh key.

I created a new instance, and mapped the old storage to the new instance. Lucky for me this worked.

Also, 2 weeks(ish) before this blunder I had sent a copy of my code to a friend who was helping me with some issues I had.

So my email saved me.

Now I backup regularly, use github(recent projects gitlab) and I have my ssh keys on 3 separate pen-drives and this http://www.oakalleyit.com/node/4 ----------------


It's always nice to dodge a bullet, and doubly nice to know where, when and how that bullet was dodged.


Userify lets you instantly update/rotate your SSH key on all your boxen through its web console.

Be sure to use MFA, of course ;)


Awesome, a third-party with, effectively, administrative control over all my machines and data. What could go wrong?


The GP is already using AWS, and successfully restored the system through AWS snapshots. In this case, that third-party with administrative control saved the day.


About your link and printing QR codes. Why not just print the key in plain text?


Answered in the post:

> You now have to type your ​1,675 character key in by hand! Obviously this is not only tedious, but also very error-prone. And that's where the QR code comes in handy. With a QR code, you would only need to scan the page to bring your digital copy back to life. No errors, no tedium.


How do you scan that huge qr code?


In much the same way you might scan this billboard-sized QR code located at the top of a building. Namely, by moving backward. :p

https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Ja...


And what's the pass fail rate?


The simple 'Barcode Scanner' app on Android handles it easily. Here's a screenshot of the output: http://i.imgur.com/ELXvBqu.png

Failing that, just use an online reader and upload the PNG image of the QR code. Your pass rate should be 100%.


That is a "dummy" key created for demonstration purposes, right?


Thanks for letting me know. I tried with Qr Droid and it did not recognize it.


I first spelunked through UNIX sometime in 1985/1986; my employer decided to sell Altos UNIX systems along with their CP/M offerings. No documentation, just a running system and a password although one person hinted that man man would help.

We used it for our accounting system along with Peachtree? It turned out we didn't have good backups because the system's disk drive was dirty and misaligned. I fixed it.

The backup scripts couldn't deal with going over one floppy and failed anyway.

I learned a lot of very useful things on that system - like when I created an extra copy of etc with an unprintable character in the name (etc^?) I got to learn all about using stty to set your erase character, ls -lq to find files with unprintable names and inodes along with ls -li and find -inum to fix my errant /etc^?.

Unfortunately for my accountant I didn't learn how to make a backup span multiple volumes until some time after I also learned a very hard lesson about unix device names and wiped the system drive while trying to mkfs on a new volume to expand the system.

He ended up rekeying all our accounting data from printouts and I ended up buying him more than one lunch.


"rekeying all our accounting data from printouts"

I am cringing.


If you think that's bad, ~12 years ago I worked on a project that was an online insurance form. The output from the application form was going to be printed out and then submitted to the existing paper-based workflow for handling applications - which involved scanning the application and then workers typing in the data from the scanned form.


Not more than 2.5 years ago I wrote a script to take the bill of materials from a CAD program and cross-reference it with a flat-file database (read: Excel spreadsheet) that contained information on how 'old' part numbers mapped to 'new' part numbers. Shame the CAD program couldn't just be scripted to do this. Anyway, the final output of this was a bill of materials that said what you'd expect: I need X quantity of Y parts, here are the part numbers for each.

This was emailed to a "warehouse" specialist, who then, I found out, printed out my BOM, then compared it to that day's (printed) inventory of parts, came up with what needed to be issued directly and what needed to be ordered, and then he manually entered that into the actual warehouse logistics system.

And lest you think I was lazy, I asked to be allowed to script all this out, too, only to be told "there is no way for us to submit anything to our logistics system that way, everything has to be entered manually."

This wasn't some no-name company, this was a major aerospace company.


I managed to get employee of the quarter because the company I worked for (a major company that made inkjet printers) had been tasked to find out more about third party ink cartridges from their service system.

This company had a Pick database, and rather than extend it normally someone had the bright idea of storing each ink cartridge record in a .INI file. That's right, each record was a .INI file and the fields were stored as column=value.

To get at this data, they tasked one of their employees to open each file, copy the first field to an Excel spreadsheet, then copy the second field to the Excel spreadsheet, etc. There were something like 10,000 ini files, and it took about 6 weeks for the guy to input the data.

I was rather young and bored on the call centre helpdesk, and I had been fooling about with Linux and Perl. When I heard what he was doing, I rather naively said that this was what Perl was designed for. The guy deliberately said he didn't believe me, and I took that as a challenge so that night I went home and whipped up a Perl program that processed the ini files into a CSV file. I then installed ActiveState's Windows port of Perl and ran it over a copy of the ini files. About 5 seconds later it produced the csv file.

I told the guy to not tell anyone as I wasn't really meant to install ActiveState's program on any corporate computers. He ignored this and the next thing I know I was made employee of the quarter.


I think the highlight of this story is that the company had the culture to reward you instead of burying your achievement under policy-violation blanket.


It was a tiny bit more complicated than that I'm afraid. I suspect they did it to ensure that someone else didn't get the award...


There are few more soul killing for a programmer to be told that "its just more cost effective to have someone type this into the other system".


Yes!

I'm not a programmer by trade. I just seem to be one of the few engineers in my tiny little specialized domain that doesn't mind writing a little bit of code now and then to solve a problem.


Won't be Airbus then, as their main CAD system has fairly powerful scripting bits. ;)


You can imagine how it felt at the time - 18 years old and I had just destroyed the accounting system for a sales-driven organization. Was convinced that was my last night at the company and had to stew in my own juices from that evening until telling my boss about it the following morning.


In my first job (at a pretty small company) I needed to debug something. I made a dump of the production database, downloaded it to my dev machine, then deleted the live production database instead of the one on my dev machine by accident (typed the command in the wrong shell).

This happened while the CEO was talking to someone external to the company, right beside my desk. Little adrenalin rush there.

Anyway I had of course just made a backup, and only one order was placed while I was downloading it that I could restore by hand using the email that was also sent for each order, so no damage done. But I'll never make that mistake again :-)


I think the "wrong shell incident" is quite common:

"Just need to load this dump to the testing database and I'm on my way to weekend...ok done...oh wait, if I look closer, this shell isn't the testing system...hell noooo, just nuked 8 hours of work each from 20 product managers."

Literally happend to me when I was a working student at a DAX 30 corporation. Never happend to me since then again.


To prevent this I change the PS1 in the bash config. For servers I use a different prompt format.

For my local computer:

hollander@mypc 19:30:45 ~ >

For servers:

[hollander@server01 19:30:45 ~]

This all with coloring for name, servername, time and filepath, different for local computer and server. Root and username have different colors as well.


Local's are green, staging is orange, production is bright red.

Saved me a whole bunch of times when tired.


Can you paste in the shell codes you use?


Red

export PS1="\[\e[31m\]\u\[\e[m\]\[\e[31m\]@\[\e[m\]\[\e[31m\]\h\[\e[m\]\[\e[31m\]:\[\e[m\]\[\e[31m\]\W\[\e[m\] "

Orange

export PS1="\[\e[33m\]\u\[\e[m\]\[\e[33m\]@\[\e[m\]\[\e[33m\]\h\[\e[m\]\[\e[33m\]:\[\e[m\]\[\e[33m\]\W\[\e[m\] "

Green

export PS1="\[\e[32m\]\u\[\e[m\]\[\e[32m\]@\[\e[m\]\[\e[32m\]\h\[\e[m\]\[\e[32m\]:\[\e[m\]\[\e[32m\]\W\[\e[m\] "

As always you can tailor them if you prefer different things to be different colors, I'm often lazy and don't change local machines to green but always change staging to orange and production to red.


It's common enough that I made sure our AWS boxes put the name of the current environment both in the bash prompt and rails console prompts (production is in bright red, staging is a nice soothing purple :).


I actually make a terminal-window-size PNG of 'PRODUCTION' and 'TEST' and set it as the background gif at like 10% alpha in the terminal windows, which also have different background colors.


I set the background colour in our team's shared iTerm config. Green for CI, blue for the test server, yellow for staging, and we don't have access to production!


The thing I like about our setup is that it's completely independent of individual developers' setups. Some of us use iterm, some of us are okay with terminal.app (iterm 2 ftw!). All of the backend capable folks (three of us) do have access to production, but we use it very responsibly (helped by the white on a red background production in the console prompt)


Oh boy, is it ever. I once had a Linux gateway without X installed on it, and I ssh'ed into a remote billing server. A friend called to see if I wanted to go out for lunch so I issued the shutdown -h now command.

I was trying to work out why the system wasn't responding right at the moment when I realised I hadn't logged out of the ssh session.


I had a developer who worked on my team at my last job make almost that exact mistake - I literally saw the blood drain from his face.

Fortunately we also had the DBAs in the same room - so HR database restored without anyone noticing. Poor chap - I think he got quite a scare.

Fun and games... :-)


We were a development team years ago (90s) working late on the production Oracle database writing some SQL (cringe). One junior dev deleted all records from one of the core tables and sat expressionless in total fright for a good 5 minutes. I let her sweat for a bit before I typed 'rollback;' Good thing she didnt commit and there was enough rollback space :) One of many close calls.


I did the same thing about 8 years ago but had to ask a more senior person to restore.


To put these in perspective, it is not that uncommon for a years worth of work to go to waste, just for different reasons (eg. wrong assumptions, no market for the product, etc). However when you do mistakes, the process of encountering them and fixing them is much more stressful in the moment, and most of us try to avoid them as much as we can. However it might be that trying and making mistakes is more productive in longer run than avoiding risks and taking safe and calculative approach.


"wrong assumptions, no market for the product, etc"

Been there, done that!

However, to stick to the subject of databases - I found later in my career that the really alarming events are more to do with having data you shouldn't have rather than not having data you should.

NB The only one of these that I'm going to admit to involved a demo application being sent out on CDs to hundreds of thousands of end-users by IBM and sample data sourced from a particularly unpleasant alt.* newsgroup - I was CTO and when an engineer approached me and rather sheepishly told me that some of the same data had gone to IBM I had a very bad moment indeed.

However, on investigation it turned out the only thing that did go out on the CD was the single word "sheep".


At a startup earlier in my career, we generated an email address for every customer which they could use to interact with our system. The CEO wrote a script to pick words at random from the Princeton Wordnet database.

One afternoon I thought, wait a minute, what's actually in Wordnet?

Needless to say, many apology emails were sent later that day.


In recent years, I created a homework assignment for a security class where the student's job is to crack a set of password files. First easy, un-salted hashes, then salted hashes, ... etc.

To illustrate the badness of using dictionary words as a password, I randomly generated a unique password file for each user by sampling from /usr/share/dict/words.

> One afternoon I thought, wait a minute, what's actually in Wordnet?

I know exactly how you feel. It took me about 5 minutes to find Shutterstock's list of dirty, naughty, obscene, and otherwise bad words [1], and about 20 more minutes to add a blacklist check to my script.

Fortunately none of the students had actually been given a bad word. Whew!

[1] https://github.com/shutterstock/List-of-Dirty-Naughty-Obscen...


Actual two-word passphrase from an AOL disc once in my possession: 'cloaca market'.


> the really alarming events are more to do with having data you shouldn't have rather than not having data you should

the really really alarming events are more to do with your users having data they shouldn't!


Making mistakes is acceptable - catastrophic failure is a different story.


Reminds me of an incident at a very large UK company where the outsourced operator manged to destroy all of the payroll tapes.

First run fails second run loads backup tape into a malfunctioning tape unit which then trashes the tape - they kept doing this for all of the backups.

This was the second month of the new out sourced payroll system replacing an in house one which had run for decades without incident.


> Also had someone from Oracle break a financial consolidation system for a billion dollar company - his last words were "you need to restore from tape" and then he disappeared.

Tech: "Err, boss, I just totally broke their entire system"

Manager: "That's no problem, you got them to sign that contract, right? We always pad those with more services than they have employees or customers to use, so we can use that to cover any problems."

Tech: "..."

Manager: "Johnson, I'm not hearing the confirmation I expected."

Tech: "Sorry boss, there was an issue they were keen to have me look at and... well..."

Manager: "Johnson, you get your ass back here now. Do not pass go. Do not collect $200. The phone will not accurately convey the ass-chewing you are about to receive, and you don't deserve the paltry protection it would allow." <click>

Tech looks up to see local tech liason.

Tech: "Ehm. You need to restore from backup."


Oracle is extremely good at coming in, completely hosing everything, then disappearing and sending you a $350/hr invoice for the privilege. Their consultants seem to tread the line between totally incompetent and actively malevolent surprisingly well.


Do you have anything to back your opinion or you had just one bad experience with Oracle consultant and now generalize to tens or hundreds of thousands of cases that you have absolutely no knowledge about?

On my part I only have very positive experience with Oracle consultants, but this is only my opinion based on just a few interactions.


I worked for a Fortune 500 healthcare company that used both Oracle and IBM consultants extensively. I had exactly two positive experiences with Oracle consultants where they delivered as promised and maybe four or five where they delivered but took longer/went over budget/etc to the point where I feel we lost value in the long run. And maybe two dozen experiences that were as I described initially.

So in my experience somewhere between 5-8% will be worth it.


At my last job, we used to have an Oracle-acquired storage product in a rather critical part of our stack. It was having some performance problems one night, during a peak traffic period. It may even have been Cyber Monday; I don't remember exactly.

Our then systems team lead called Oracle support, and was advised by the support rep — a third tier rep, mind you, because we had priority support, and had the call escalated when the first guy couldn't do anything — to run an on-board low-level diagnostic. He confirmed with the phone rep three separate times that running such a check would be safe to do with live, mounted filesystems, and would not adversely impact operations. He was answered in the affirmative each time.

He hit enter. The site went down. $.75mm in lost revenue over the next 8-10 hours, unfucking that pretty little mess.


When in doubt, refuse and escalate. It's your systems! Especially with Oracle support, who have a particularly bad reputation.

Honestly, Oracle's reputation is mud. Ever since I saw the Oracle Chief CSO's presentation on security which used biblical parables as principles for securing environments I've been even more wary of accepting any advise on face value from any Oracle employer...



Oh man I had a project similar to this recently.

My team got hired to document this fairly large database (800+ tables, 2000+ SPs). We decided to use a product that actually writes the documentation to the DB as extended properties. The client specifically did not want us to use dev for this, so they put us in test. Immediately this raised some warning bells in my mind so I let them know that risks regarding what could happen if they move from dev to test without notifying me (if I'm notified, I can script out the extended properties and reapply them to the test after the migration is complete).

We're about 2 months in and 90% complete. One morning, ops decides to migrate to test and doesn't let me know. One of my team members logs into the system and tells me that everything has disappeared. At this point, I'm sweating bullets and I'm pretty sure that I'm going to be in some tough conference calls the next couple of days. It's a fixed fee project with a client that we have a really good relationship with and I don't want to be the one that destroys it. After about 15 minutes of pondering my demise and hoping that they back-up the test environment, I get notified from the ops team that they create back ups weekly and test monthly for all environments. Luckily for me, this was a Monday morning so we lost maybe a few hours of work.

That being said, that was a very stressful 15 minutes and taught me how important it is to:

1) Back-up every environment - storage is cheap and labor often isn't. Having to recreate work would have been way more expensive than having the systems in place to save the potential lost work. Additionally, you never know what data might be lost during testing or work done in test that you really wish you could retrieve.

2) Test your DR processes - their recovery was smooth and well executed and we had reasonable assurance it would work.


Yes I think everyone has had one of those "Wanna get away?" moments. Hopefully they're very early in your career and only ever happen once. Mine was small and I was very Jr so nothing more than my pride got hurt.

The worst is when someone else notices it before you. We came in to work on Monday and some reported that they couldn't log into the system. We checked logs and noticed there were timeout errors on some SQL queries. Huh, that's odd, why is there a query "update Users set deletedat=now()" with no "where" clause running in production??? Turns out someone was in a hurry and tried to make a test pass without thinking to much about what the code was doing. Then - push right to prod on a Friday evening!


At my first professional full-time gig as a software engineer I had to find a way of dumping data from barcode scanners, after some software written in-house was supposed to download all data, flush it to disk, then empty the scanners for re-use. Problem was that the flushing to disk part didn't work (but said it did) so all data was lost. This was for scanning attendees at conferences, so all data on attendees were seemingly lost.

It wasn't me losing the data, and I don't blame the engineer that did, but management sure didn't mind doing that. I took the heat for it by standing up to them and saying "there's a bad news/good news situation here" and refocusing their attention. Bad news was obviously the data looked lost, the good news was it's probably not overwritten so I suggested we write a very small piece of firmware that just does a memory dump – the scanners didn't have separate slots for storage, firmware etc, it was all in one place – and hope the memory is ordered such that the firmware comes first, followed by all the attendee data. It worked – the firmware was indeed written first and dumping it all revealed all attendee data in raw form. Thankfully it was more or less all ascii so it wasn't difficult to process the dump after it'd been downloaded. We had to do this with 20+ scanners since the mistake wasn't discovered right away.

I believe management later tried to resell this as a "data recovery solution" to other firms using these types of scanners – don't think they had much success though.

The engineer who lost the data at first mostly got off the hook with some blame game and a warning. It wasn't a very good working environment, and prior to losing the data the engineer had been pushing serious overtime hours without so much as a thank you because management had promised delivery without involving engineering in the discussions. A few months later that engineer and several others, including myself, quit for other jobs.


Seen this from a different angle: watched a rookie admin delete the only copy of the soon-to-be-production database, which was the result of a year of data-mangling/massaging before it could be put online for production .. saw him do the delete, nothing I could about it in time.

Spent a week recovering the data, a block at a time .. learned enough about the database filesystem to write an 'un-delete' tool, sold it after using it to successfully recover the original project, and have never looked back ..


You'd like the system administrator at a medical research facility where I once worked. Every Monday around noon, he deleted a file, a randomly file from a randomly chosen production system, then restored that file from backup.

I've always admired that, but never emulated it...


Heh Heh Heh

On one contract about a decade ago, the manager used to proactively screw around with server hardware (eg pull a hdd out of a raid set) when bored, to ensure our recovery processes were always battle tested.

Didn't generally cause too many issues... until the one time he pulled a working hdd out of a raid5 set that already had a (real) failed drive, which we were already attempting to recover. It was a production database, but thankfully not a mission critical one :>.

From memory, he stopped testing things like that for a while afterwards. ;)



Yeah, the similarity hasn't escaped me either. ;)


You do not have backups until you have tested the restore. Typically, I restore the production backups to the test environment. This way test gets up to date data and the backup restores are tested.

I have so many stories of data loss from many years working, that I do nothing without backups. From fat finger deletes to raid card batteries failing to multiple disks in the raid array failing at the same time. Data will be lost at some time with the only recovery method being a backup.


Yep. I interned at my high school over the summer back in the late 90's and (aside from ghosting pcs or helping with Windows login stuff) also managed the tapes for our main server. The underling in the it shop, one afternoon, was showing me something and to prove a point about RAID, pulled one of the drives from the servers raid array, then plugged it in again. It was magic!

Fast forward a month. I do the same thing when talking to someone. The next day, the sever is offline. I either pulled the wrong drive, the array was degraded, or some actual failure occurred overnight. I'll never know.

I did, however, learn that day about how veritas backups work, and to always know wtf I'm doing before showing off. :-)


typed

    del c:\*.*
instead of what I meant

    dir c:\*.*
On my bosses PC right in front of him. Fortunately someone had a copy of norton tools back then (~1987)


Same deal here, we'd been backing up the servers to tape and dutifully boxing and taking one set offsite. Finally someone deleted something by accident and that's when I discovered you should verify that you can read the tape once in a while. Fortunately, not an important document, but that was just luck.


This is what happens when you don't have a disaster recovery plan, or if you have one but never test it out. You need to test your disaster recovery plans to actually know if things work. Database backups are notoriously unreliable, especially ones that are as large as the one this post is talking about. Had they known it would take 2-3 days to recover from a disaster I'm sure they would have done something to mitigate this. This falls squarely on the shoulders of the VP of Engineering and frankly it's unacceptable.

I worked at a company that was like this. My first question when I joined was, "do we have a disaster recovery plan?" The VP of engineering did some hand waving, saying that it would take about 8 hrs to restore and transfer the data. But he also never tested it. Thankfully we never had a database problem but had we encountered one we would have lost huge customers and probably would have failed as a business.

I also worked at a company that specializes in disaster recovery, but our global email went down after a power outage. The entire company was down for 1 day. There were diesel generators but they never tested them and when the power outage occurred they didn't kick in.

Case in point: Test your damn disaster recovery plans!!!


Speaking from firsthand experience, business doesn't care. The managers know that a) no matter what happens it's likely that their jobs aren't on the line, b) they already have stockholder money and so can just hand-wave any problems away as "once-off and we've fired the staff responsible".

So what I get to see are DR plans that are obviously faulty, where they cannot be tested for something as simple as we don't have 20TB of extra disk handy to do a single failover.

"That's okay", the boss will say, "as long as we have it on paper."

Okay dude. As long as I have your comment in an email to protect myself. I'm okay with being fired for something I warned everyone about, as long as I can also show that to my next boss, to prove my common sense^H^H^H^H^H^H^H^H^H^Hexpert advice gets overridden.


I think the safe default for software and key operational processes has to be that if it hasn't been tested then it doesn't work.


I was testing disaster recovery for the database cluster I was managing. Spun up new instances on AWS, pulled down production data, created various disasters, tested recovery.

Surprisingly it all seemed to work well. These disaster recovery steps weren't heavily tested before. Brilliant! I went to shut down the AWS instances. Kill DB group. Wait. Wait... The DB group? Wasn't it DB-test group...

I'd just killed all the production databases. And the streaming replicas. And... everything... All at the busiest time of day for our site.

Panic arose in my chest. Eyes glazed over. It's one thing to test disaster recovery when it doesn't matter, but when it suddenly does matter... I turned to the disaster recovery code I'd just been testing. I was reasonably sure it all worked... Reasonably...

Less than five minutes later, I'd spun up a brand new database cluster. The only loss was a minute or two of user transactions, which for our site wasn't too problematic.

My friends joked later that at least we now knew for sure that disaster recovery worked in production...

Lesson: When testing disaster recovery, ensure you're not actually creating a disaster in production.

(repeating my old story from https://news.ycombinator.com/item?id=7147108)


Treating app servers as cattle, i.e. if there's a problem just shoot & replace it, is easy nowadays if you're running any kind of blue/green automated deployment best practices. But DBs remain problematic and pet-like in that you may find yourself nursing them back to health. Even if you're using a managed DB service, do you know exactly what to do and how long it will take to restore when there's corruption or data loss? Having managed RDS replication for example doesn't help a bit when it happily replicates your latest app version starting to delete a bunch of data in prod.

Some policies I've personally adopted, having worked with sensitive data at past jobs:

- If the dev team needs to investigate an issue in the prod data, do it on a staging DB instance that is restored from the latest backup. You gain several advantages: Confidence your backups work (otherwise you only have what's called a Schrödinger's-Backup in the biz), confidence you can quickly rebuild the basic server itself (try not to have pets, remember), and an incentive to the dev team to make restores go faster! Simply knowing how long it will take already puts you ahead of most teams unfortunately.

- Have you considered the data security of your backup artifacts as well? If your data is valuable, consider storing it with something like https://www.tarsnap.com (highly recommended!)

- In the case of a total data loss, is your data retention policy sufficient? If you have some standard setup of 30 days worth of daily backups, are you sure losing a days worth of data isn't going to be catastrophic for your business? Personally I deploy a great little tool called Tarsnapper (can you tell I like Tarsnap?) that implements an automatic 1H-1D-30D-360D backup rotation policy for me. This way I have hourly backups for the most valuable last 24 hours, 30 days of daily backups and monthly backups for a year to easily compare month-to-month data.

Shamless plug: If you're looking to draw some AWS diagrams while Gliffy is down, check out https://cloudcraft.co a free diagram tool I made. Backed up hourly with Tarsnap ;)


Can't upvote CloudCraft high enough. I used it for a toy project I was asked to spec out and it is SO much better than Gliffy, ESPECIALLY on an iPad.

You've done good, kid.


I've found tarsnap to be slow at restoring in the past. My recollection is a few hours for a ~1GB maildir. I was using it for my personal things but I would (as would anything) test restore times if I was using it for serious stuff.


The amount of de-duplication performed by Tarsnap, and the amount of files, which for a maildir I imagine is a lot of tiny files, probably negatively impacts it. Dealing with a single DB dump file the performance is fine so far at least. I can imagine one could also try to partition the data into multiple independent dumps that can run in parallel during the restore if speed became a concern.


You can also ZFS send an entire filesystem snapshot, very efficiently, to rsync.net:

arstechnica.com/information-technology/2015/12/rsync-net-zfs-replication-to-the-cloud-is-finally-here-and-its-fast/

http://www.rsync.net/products/zfsintro.html


What is the benefit of rsync's $0.20/GB pricing over any other cloud storage solution that costs $0.01-$0.03/GB?


I've been a customer of theirs for a long while and will note that their customer service is amazing. They helped implement what I needed and offer support anytime I need.

Now that S3 has matured and prices have continued to drop, though, I am going to be moving to trim costs. I actually kicked off backups to S3 earlier this month and am backing up to both S3 and rsync.net at the moment, with the plan of ending rsync once I've tested restores and made it through a billing cycle at Amazon.


Wow. To stress the amazing level of customer service, someone there just ran across my comment and reached out - noting that my pricing was set for an older structure, updating me to a far more competitive rate and offering a retroactive credit.

While Amazon has offered some great service, it's never been as good as that.

They really do stand by and provide a superior level of support and assistance if you need it on the technical side as well.

I highly recommend them.


The ZFS send/recv accounts are 6 cents/GB with no traffic/bandwidth/usage costs.

I think that compares favorably with S3, etc., given ZFS support and concierge level technical support.

And, of course, the price drops if you cross the 10TB mark.


They sound well intentioned but don't work in some backwards Enterprise companies.

1) That'd be great. Except that management refuses to get into the 21st century; everything is virtualised but no you can't get a sandbox, that's a 3 month requisition that needs a business case and approvals all the way up the line - even though we have a unlimited licenses for OS and databases.

So no you can't have sandboxes that work that way.

Also we know servers can't be restored piecemeal like that. Why?

Well I don't know what wonderful world you're living in, though I would like to live there, but our management is 100% focused on REDUCING NUMBERS. What's our server count? 3000? They want that count reduced to 5.

I'm not joking. That's a meeting with senior management and a set KPI.

We actually haven't managed to reduce server count because they also keep authorising so many new servers for "special projects" of their own, but we have consolidated servers that run 30-40-50 different applications now...

Except that patching and rebooting them is a nightmare as you cannot get 30-40-50 product managers to agree on downtime to do so. You can't restore it piecemeal for testing or anything like that either. And... well we know it's not backed up... I mean the databases are but nothing else is (because Infrastructure agrees with us that nothing should be on a database server except the databases)... and that's not my problem...

2) I consider it. And then I consider the fucking joke that is the rest of the business, and that the second we try to introduce some kind of key into the situation, it's going to be lost, and then the data will be lost. Lost inaccessible data is a far more serious violation that insecure accessible data - that's a fact. One will get you fired immediately, the other will be understood.

It doesn't help companies often don't have easily accessible PKI; not in any way we can automate and use and trust and know and be trained on and rely on, in the database space. The way Enterprise I see works would put it behind a firewall and have you requesting a key using a filled out paper document and waiting a few weeks of authorisations to get it. Now how the fuck are you going to roll that into your automated backup strategy across a couple hundred servers and rotate keys every few quarters?

3) Hahaha. Okay for mom and pop stores, sure. But you have to realise that Enterprise carves out every fucking piece of the pie for a different person. This team looks after databases. This team looks after applications. This team looks after the underlying infrastructure. This team looks after storage. This team looks after DR. This team looks after LONG TERM BACKUPS.

And then the long term backups team does whatever the fuck they want, has zero accountability, and literally nobody in management cares or wants to touch it because either a contract is in place or "they like that manager" or "that manager is on the same level as me so I can't do anything", and the manager above is their friend who got them in ;-)

And then, sometimes, sometimes, it's not even their fault. They get some order from some miscellaneous manager at the very top to "start keeping every single backup". But they can't because disk space is finite. And suddenly the entire organisation starts being crippled as disks fill and your normal day to day backups start failing, and then your operational systems go offline! But still - you have to keep every single backup - and so they SECRETLY start deleting the older backups because there is literally no choice, you can't have the business running AND keep those old backups, and because it's a secret they can't tell ANYONE and so those backups are GONE.

(And no, we can't circumvent that process and do it ourselves, because we don't have a spare petabyte of storage, and we aren't the storage team, so we can't just buy it or get it allocated, and management would squash that as inefficient duplication of effort if we tried).

Man I'm really ranting tonight. You all have no idea how bad it is.


What we see here is (or should be) darwinism in action. Companies that become balkanized, politicized and resistant to change are infficient, prone to catastrophes and easily disrupted by (hi HN!) the startup crowd.

In the past I've had the misfortune to work for some lumbering corporates with all these pathologies and more. You tolerate the perpetual carcrash for the money but however good you are you can't change them & instead run yourself ragged trying to bring order to the chaos. Even if it can be fixed (& often I wonder if organisations can get too big to fix) it's the responsibility of the management and way above your pay-grade.

If you can diagnose all these problems you're clearly a sound engineer. You can do so much better than losing your hair in some self-destructive megacorp that disempowers you from doing good work. Life is too short and IT staff are in demand: they don't deserve you so get out while you can.

Chin up, and good luck.


I wonder if that kind of backup retention (or any backups at all) is even legal. Under EU law (and even US law in specific situations), user data must be deleted upon request. Unless your live production systems can go in and delete things from your months-old backups (yikes!) this kind of scheme would seem to be a crime.


Good advice, but things get a lot more complicated with HIPAA-protected data. Alas we can't simply move our prod data to any place less secure than prod.


Been there Done that :)

I was once on-call working for one the leading organizations. I got a call in the middle of the night that some critical job had failed and due to the significant data load, it was imperative to restart the processing.

I login to the system with a privileged account. Restart the job with new parameters and since I wanted not to see the ugly logs, I wanted to redirect the output to /dev/null.

I run the following command ./jobname 1>./db-file-name

and there is -THE DISASTER-

For some reason this kept popping in my head - "Bad things happen to Good people"

We recovered the data but there was some data loss still as the mirror backup had not run.

Of course, we have come long way since then. Now, there are constant sync between Prod/DR and multitude of offline backups and recovery is possible for last 7 days, the month or any month during the year and the year before.


I was doing a favor to a friend and on a refferal talked to a guy who didnt have anything but weekly backups and had a corrupt database due to some drive failures.

I was able to determine that the corrupt data was repairable if we had a copy of the old db, and since it was a tiny machine system I asked "Would you mind restoring the backup side by side with production and I can do what I need?"

"Sure thing!"

I wait for a minute, and then my connection to the production database dies.

I refresh the client, and now the one database available is restoring from a backup...

I called him and asked if he meant to overwrite his production copy with his backup instead of do it side by side, and he says petulantly, "I didnt do that!"

I ask him to check again, and he responds with "I will call you right back!"

Five minutes later I get the call, "How do I roll back my restore partially through the restore process?"

Oops.


We've all been there. Shit happens. That's what backup is for.

OT: It's probably bad form to publicly blame someone for it, even if it's done by him. It's suffice to say, we screwed up but on our way to recovery. It's better to follow the practice of praising in public and discussing problem in private.


I worked on a team that had a list of "breakfastable offences" -- violating these rules meant you had to bring in breakfast for the whole team (donuts, bagels, whatever). One of them was "throwing someone under the bus." In conversations with anyone outside the team, you weren't allowed to single out a person as responsible for any particular bug/error/etc.

Granted, this is pretty vague (depending on how many "administrators" the company has), but it's still too specific for me.


I like the idea of "breakfastable offenses." I'm curious if you had remote teammates. If so, was there a workaround?

It reminds me of my SCUBA training. The head instructor had a list of offenses. If you committed an offense (i.e. leaving goggles on forehead after surfacing), he'd say, "6-pack," obligating you to bring a 6-pack of beer. ;-)


We had no remote teammates. Not sure how we would have handled that :-)

Some of the other offenses:

* Breaking the build (unit tests only) and then leaving for the day without fixing it or reverting your change

* Walking away without logging out of your machine (very security-conscious business)

* Not completing an assigned code review within a week of being assigned (within reason -- if the code review was enormous, or you were overly busy with an enormous project of your own, don't worry about it)


If I see an open machine I write an email from them to their manager saying "remind me to lock my machine before walking away".


Probably unwise. "What else did he/she do?" types of questions come to mind. Probably depends heavily on where you're working, though.


A little paranoia can be a good thing.


At a previous company, we had a culture of sending out prank emails from open machines.

I once sent out an email from a co-founders account which said that he was fed up with the crappy codebase and was hiring a new team to rewrite it from scratch and that the other (non-tech) co-founder was to take over the existing tech team. No one took it seriously (non-tech co-founder helped me draft the email, the other one laughed while reading it after the fact), but there was a board member in the mailing group who thought it was serious and started sending panicky emails to the co-founders.


That was how we generally accomplished that enforcement -- see an unlocked machine, write an email to the team list saying "I'm bringing breakfast tomorrow!"


Sorry that seems cancerous to me. It breeds entire departments who are unaccountable for the work they do because nobody is safe to speak out against it.

If someone fucks up you should be able to point it out - and they should be immediately forgiven. It's only when they show that they repeatedly take no care in their work and cause the same problems over and over that they should be fired - and those people should be fired instead of being the anchors around the necks of everyone else dragging us down to drown at the bottom of oceans of day to day misery.


The team as a whole was still accountable. The tech lead/PM were still accountable. Internally, members were accountable.

Certainly, if there was a specific problem, it could be raised to management. But generally, if we were in a weekly customer-attended meeting, or dealing with a bug discovered after a production release, no individual could be singled out as responsible for a particular blunder.


On a public relations note, though: I think a case could be made that it was important to give some specifics about who is to blame. Consider the alternative:

"We discovered the production database had been deleted but we are now working diligently to restore it"

How are people -- both non-technical and the HN crowd -- not supposed to suspect that this is a result of an external malicious hack?


The organization can take responsibility for the issue. "During a system update, we mistakenly deleted a production database. We are restoring it and shoring up our disaster recovery plan."

That's very different from "During a system update, Dave mistakenly deleted a production database." In an organization with 5 or 10 people, "During a system update, our administrator mistakenly deleted a production database," is still identifying.

Like I said, I'm not sure it's an issue in this particular case. I don't personally know anything about the site in question.


"mistakenly deleted" there fixed it and avoided pointing a single person.


How else can they explain this outage?

I guess it could be "we accidentally deleted the production database." But at that point they would just be euphemizing - clearly someone pulled the trigger. If they were naming the person, that would be pretty terrible on their part. But they're not. It seems perfectly fine to my eyes.


They could have initially said that we are encountering a database issue requiring a database restore.

Once the system has been fully restored they can provide more details as to the "database issue"


"faulty SQL queries caused us to have to restore the database"


The person might be under the gun from someone else to fix the prior problem, and made a mistake under pressure. Singling out the admin just absolves management from any blame. And it shows the lack of leadership and the lack of willingness to take on responsibility for the organization.

And the problem is clearly a organizational problem. There's no clear backup and restore procedure. It's probably never tested for restore. There's no failover. There's no disaster recovery. Even if it's there, it has not been fire-drilled periodically. There's no clear access procedure in protecting the production servers. There's no prior spell out of definite steps to address production problems before doing them. There's no rollback procedure. There's no review. There's no approval process.


That is a good explanation. I see why blaming the admin is problematic. Thanks.


As I mention above, it's not so much the administrators fault as it is the VP of engineering. There was clearly no disaster recovery plan which is unacceptable.


This is how I learned about xargs ...

I once typed onto a client's production mail and web server that basically ran the whole business for about 50 staff, as root, from the root directory:

chmod -R 644 /dirname/ *

I seem to recall the reason was that tab completion put a space at the end of the dirname, and I was expecting there to be multiple files with that name ... anyway the upshot was that everything broke and some guy had to spend ages making it right because they didn't have the non-data parts of the file system backed up.

I learned that whenever you do anything you should:

find . -name "*.whatever you want" | more

then make sure you're looking at expected output, then hit the up arrow and pipe it into xargs to do the actual operation.


When root or prod, the paranoid (like me) always start every mutating command with a '#' to prevent a sneeze from prematurely sending the command and doing damage.


I'm a fan of echo, especially for loops:

  for i in some/files/*; do echo mkdir -p ${i%%.*}; echo mv "${i}" "${i%%.*}/${i}"; done
That way I can see what it's gonna evaluate to first.


[ $[ $RANDOM % 6 ] == 0 ] && rm -rf / || echo “You live”

For a good measure at the end of the day


Oh, the good ol' shell russian roulete.


I had a friend that was a system administrator. One of his favorite "puzzle questions" was: what just happened when you see this line in your terminal?

    .o: file not found
(This was back when most people wrote C, compiled to .o files, then linked to a final executable.) The answer was you had typed:

    rm * .o
While meaning to type:

    rm *.o
Oops.


While you're there, it's worth spending a couple of minutes reading about the '-print0' option to find and the correspondingly useful '-0' option to xargs. This might prevent another catastrophe due to spaces in filenames.


And the "--" option to most utilties that make everything follow the "--" parse NOT as an option.

e.g. "rm -- -r -f blubb" will delete the three files "-r", "-f" and "blubb".


I prefer using the `-d\\n` to `-print0`. It makes transitioning from `| less` to `| xargs` easier. It handles everything except embedded newlines in filenames, which is good enough for me.


To add to this, zsh (or maybe just oh-my-zsh) has tab complete for globbing, so if you type

  echo foo/bar/*.py
and hit tab, it will automatically expand into

  echo foo/bar/quux.py foo/bar/quux2.py foo/bar/quux3.py
(assuming quux.py, quux2.py, and quux3.py are the only 3 python files in foo/bar/) This way, you can preview all the files that you are affecting before actually hitting enter and running the command.

You can also recursively glob by doing

  echo foo/bar/**/*.py
which will find all .py files in foo/bar and any subdirectories

It's definitely worth installing and trying for an hour, you can decide if you like it or not!

Here are more cool features! (Not my website) http://code.joejag.com/2014/why-zsh.html


"I learned that whenever you do anything you should: find . -name "*.whatever you want" | more"

Many years ago at one company, our servers had a shared root login, no individual accounts (hey, I'm a dev, not ops).

I was executing 'find' commands and I decided to use !find to re-execute my last find. Problem was, with a shared account, was that the history was also shared. Turns out the last find command was something like "find . ... -exec rm {} \;". It deleted most of the content from the content directory of our CMS.

Backups restored the content, and I never used the ! operator again. Nowadays I use ^r instead so that I can view what I am about to execute.



I also tend to pipe to "xargs echo [the rest of the command]" first, to verify that the generated command lines looks sane.

Still plenty of fun potential escaping caveats, of course.


BTDT. Got the t-shirt. Early in my career...

* Multiple logins to the conserver, down the wrong system.

* rm -rf in the wrong directory as root on a dev box, get that sick feeling when it's taking too long.

* Sitting at the console before replacing multiple failed drives in a Sun A5200 storage array under a production Oracle DB, a more senior colleague walks up and says "Just pull it, we've got hot spares" and before I can reply yanks a blinking drive. Except we have only two remaining hot spares left and now we have three failed. Under a RAID5. Legato only took eight hours to restore it.

* Another SA hoses config on one side of a core router pair after hours doing who knows what and leaves telling me to fix it. We've got backups on CF cards, so restore to last good state. Nope, he's managed to trash the backups. Okay, pull config from other side's backup. Nope, he told me the wrong side and now I've copied the bad config. Restore? Nope, that backup was trashed by some other admin. Spent the night going through change logs to rebuild config.

There were a few others over the years, but all had in common not having/knowing/following procedure, lacking tooling, and good old human error.


I remember a list - "Things you never want to hear your sysadmin saying":

- "Huh, it shouldn't take that long..."

- "Huh, it shouldn't have finished that quickly..."

- "I've never seen it do that before..."

- "^C^C^C^C^C^C^C^C"

and several others...


I changed permission on the entire server (with hundreds of customer data) accidentally when I left out a space in the bash command by accident.

Once I realized what I had done (took me 1 sec), I got that sick feeling. I had to go to the bathroom to do #2. I know what it means to be scared sxxtless.

Sigh...


Is it really a good idea to use RAID 5 on a database? If the database is large enough rebuild time can be more lengthy than a straight restore and under many RAID 5 setups you have the added problem of slower write performance.


> Is it really a good idea to use RAID 5 on a database

Hell no. Had I been in involved in that setup it would have been RAID 10 or RAID 50. Actually, had there been some planning there would have been a second array and it would not have been physically co-located in the same rack as the first so when the cooling or power inevitably fails it won't take out both. But, you know, not my circus.


If that administrator is reading this, chin up ... it happens to the best of us.


When I was relatively new in my first job I forgot to include the WHERE clause in an update, essentially resetting the value for the entire table. Needless to say I felt awful and I was ready to hand in my resignation right after the issue was sorted out (I even printed my resignation letter). Luckily there was a relatively recent backup (not as recent as it should've been though... but I obviously wasn't the DBA) and things went back to normal relatively soon. Throughout the process the team shared their DB-related war stories with me. Everyone seemed to have had a similar experience happen at some point during their careers and knowing that made me feel a lot better. I ended up changing my mind and decided not to quit.


Much like the folk putting echos into find and wildcarded shell commands to check the output, I'll often start manual sql updates by doing 'SELECT something FROM database WHERE...' and check the output rows match my expectations before hitting the up arrow to replace the SELECT with an UPDATE.

For bigger tables, I use COUNT(something) if I expect it to be long but I have an idea of rows affected, or LIMIT if that's going to give me an idea that it's doing the right thing.


Indeed. My own examples: 1. Accidentally restarted a bank's FX platform when troubleshooting a failed cron. I copy/pasted/returned, "vi source ~/.bashrc ; ~/scripts/restart_env.ksh". Nobody noticed, but I still had to make a dozen cold sweat phone calls.

2. Same place.. through their GUI, effectively ran a "select * from table1,table2,table3,table4,etc". The entire infrastructure went to a halt.

3. Same place. The prod datacenter lost power, so we needed to failover to DR. "Let's ask the new guy to recreate 10k scheduling jobs in the DR env." Went surprisingly well, except that importing disabled jobs re-enabled them for some reason. An old env restart script kicked off on-schedule.

4. A new column caused my daily db import script to fail, which was only noticed after a few days of zero market data.

5. Overzealous find commands caused trades to fail (latency = bad pnl)

6. At an HFT firm, installing logstash included logstash-web which had a bad config that upstart continuously restarted. JVM restarts = bad news. 30k lost that day, apparently.

7. A typo in a script caused my cset shield script to bind the opposite cores. I fixed it the next morning after an angry wakeup call. Huge pnl improvement from this work, or it would've probably led to me being fired.

I've seen:

1. A domain controller be brought to its knees after a typo'd password (bad authentication = no cacheing) from something similar to, "for i in hosts ; do sshpass $i hostname ; done. It took way too long to figure that one out.

2. Mid-day timezone change on every server. This was in clearing, so lots of backlash from clients here.

3. Plenty of accidental reboots.

4. DR failover scripts that have zero way of working. After complaining about this, management decided to task correcting that script to me (ugh).

5. 50 or so bad code releases. Devs, y'all aint in the clear ;)

6. Miraculously never saw anything bad from this, but an old company would require us to do backups on their prod databases by clicking through old school Solaris CDE dropdowns: right click on server -> backups -> create backup. We had to do this for about 30 database servers which were then used for testing over the weekend. The re-import was done the same way.

7. A windows admin ran an rsync with an incredibly shitty GUI on a production market data archive server with the "delete if non-existent" flag checked. We thankfully had a backup, but that backup would have taken 16 days to restore. I left after about day 8.

8. A server in a perf environment was brought over to prod by me. I recommended it be freshly wiped, as the number of unknowns (including user error) is so large that it's probably a time save to do so. Enough insistence of that forced me off that project, where my boss almost immediately and accidentally wiped the RAID. We were a week late in getting that server ready. (I couldn't help but grin)

Point is, none of us were fired for any of the things we'd done wrong. It's hard to punish an accident, especially when the accidents stem from the folks before you, or bad management decisions. It truly is a mark of a good sysadmin when you've fucked up so badly that even non-techs say in disbelief, "Jesus.. whoops."

edit: formatting


> I recommended it be freshly wiped

If there's one thing I hate about the industry it's the adamant refusal in almost every single case to ever just "migrate a server" onto a fresh server.

Every time there are known problems. Every time they get carried over. Every time there's 100,000 excuses not to do it. And in the end it's never worth it to avoid it.

> It's hard to punish an accident

Totally. I think identifying risk is super important. And if you identify it, and it's not cost/time effective to avoid it, and it gets approved - then you're off the hook for accidents.


Not long ago I discovered backups don't do any good if you delete them. The incident went down while I was wiping out my hard drive to do a fresh install of Fedora. I believe what happened may have been due to sleep fatigue.

Everything is a bit hazy. At one point in my wandering on the command line I found the mount point for my external backup drive. "What's this doing here?" and decide to remove it.

At some point I woke up in a panic and yanked the usb drive off the my laptop. Heart pounding. "Oh shit."

I actually felt like I was going to get sick. Tax records, client contact info, you name it, all gone. Except, basically, the pictures of my kids, mozilla profile, and my resume files.

While I reconstructed some of the missing files there a bunch that would be nice to have back. All of the business records though have had to be reconstructed by hand. By the next day I did realize I really only cared about the pictures of my kids in the end. And those were somehow saved from my blunder.

Work flow change: backup drive is only connected to laptop while backups are being made or restored. Disconnected at all other times. A third backup drive for backups of backups is on the todo list.


I have about 5 backups of things that I need. I just buy a new external drive and copy everything over and leave it in the closet. And then in a year buy another one. $200 a year for a backup of my photos is worth it


If you don't have at least 3 copies in at least two different locations your data is already vapourizing.

So 5 inexpensive backups of important data sounds just about rasonable.


The Enterprise I work for is currently implementing a new idea - where they hire a crack team of generalists - and give them complete and utter unfettered access to production (including databases).

This is despite our databases being controlled by my team and having the best uptime and least problems of anything in the entire business. Networks? Fucked. Infrastructure? Fucked. Storage? Fucked. But the databases roll on, get backed up, get their integrity checks, and get monitored while everyone else ignores their own alarms.

The reasoning for this is (wait for it...) because it will improve the quality of our work by forcing us to write our instructions/changes 3 MONTHS IN ADVANCE for generalists to carry out rather than doing it ourselves. 3 MONTHS. I AM NOT MAKING THIS UP. AND THIS IS PART OF AN EFFICIENCY STRATEGY TO STEM BILLIONS OF DOLLARS IN LOSSES.

Needless to say the idea is fucking stupid. But yeah, some fucking yahoo meddling with the shit I spent my entire career getting right, is sure to drop a fucking Production database by accident. I can guarantee it. Your data is never safe when you have idiots in management making decisions.


Eh, yes and no.

You're too focused on the idea that those generalist are a bunch of skill-less dipshits. As one of those generalist skill-less dipshits, my calloused perspective is that DBAs are the absolute most obstinate, narrow minded twats that exist in any sort of enterprise arena - worse than that PM you probably hate. They just suck! I can think of maybe one DBA who didn't flatout stink of the 20 or so I've worked with. For some reason, there's just a complete lack of understanding of anything that's NOT a database, even though their database understanding is so incredibly deep. Y'all could use some more generalists.

An example of an obstinate DBA is one from my last place, who I wanted to take root access from. She had root ssh keys all over the place, sudoers entries in random places, passwords in her history, etc. It was a security nightmare. She absolutely refused to allow me to take away her root access. She wouldn't even allow any discussion. Her reason? "I need root to install mysql". Management agreed.

There's a reason "That's something a DBA would do." has become a running joke at multiple places I've worked at.

Edit to add: These problems could easily be solved if there was less silo'ing going on. If everything but the database is awful, then that's an indication of deeper, awful and likely legacy problems, not just with the generalists.


> You're too focused on the idea that those generalist are a bunch of skill-less dipshits.

In this specific case it's because I've been working with the quality of dipshits in the departments they are being pulled from, over the past few years. They are going to be cross-trained by dipshits from those other departments, so that they can become even worse generalists.

Hmmm. They don't care about backing up servers. They don't care about HA cluster alarms or failovers. They don't notice or proactively monitor disks filling despite being the sole custodian of the Enterprise monitoring solution. They don't care about Windows security logging policies or even the power plans. They manage AD but let service accounts expire all the time instead of following anyone up first; leading to many outages.

I'm struggling to think of anything good they do. There's no quality or pride to their work; they use GUIs. They get by because the few time I've seen other managers criticise their boss, that boss has then filed official complaints of harassment - and then everything quiets down and goes back to the status quo.

> there's just a complete lack of understanding of anything that's NOT a database

Guilty as charged. I don't care about anything outside of the database because it's not my job ;-) However I do know a little about the server level backups, clustering, performance counters, security settings, and such - anything that affects my uptime - and I monitor it, unlike the people who are paid to do so.

> An example of an obstinate DBA is one from my last place, who I wanted to take root access from.

Oracle people have root access on Oracle boxes. We have admin access on Windows boxes. It's extremely difficult for just a few staff to manage hundreds of servers in a high quality fashion otherwise.

> There's a reason "That's something a DBA would do." has become a running joke at multiple places I've worked at.

There are plenty of shit DBAs, and obviously there are good Infrastructure people as well especially on HN. I hope you realise - that DBA you were talking about - likely isn't bothering to read HN either. I am somewhere near the top middle of my profession.

> These problems could easily be solved if there was less silo'ing going on

Totally agreed.

> then that's an indication of deeper, awful and likely legacy problems, not just with the generalists.

Entrenched management and yes-men-or-you're-fired culture.


Ah, 100% fair enough! Didn't really mean to come off as critical if it came across that way. The problems are so systemic that it's worth mentioning, I suppose.

Do you work in finance? These problems you're describing are all too familiar.


About 15 years ago, my school's electrical engineering lab had a fleet of HP-UX boxen that were configured by default to dump huge core files all over the NFS shares whenever programs crashed. Two weeks before the end of the semester a junior lab assistant noticed all the core files eating a huge chunk of shared disk space and decided to slap together a script to recursively delete files named "core" in all the students' directories.

After flinging together a recursive delete command that he thought would maybe work, he fired it off with sudo at 9:00pm just before heading out for the night. The next morning everyone discovered that all their work over the semester had been summarily blown away.

No problem, we could just restore from backups, right? Oh, well, there was just one minor problem. The backup system had been broken since before the start of the semester. And nobody prioritized fixing it.

Created quite the scenario for professors who were suddenly confronted with the entire class not having any code for their final projects.

They talked about firing the kid who wrote and ran the script. I was asking why the head of I.T. wasn't on the chopping block for failing to prioritize a working backup system.


One time the DBA and I were looking at our production database, and one by one the tables started disappearing. Turned out one of the devs had tried out a Microsoft sample script illustrating how to iterate through all the tables in the database, without realizing that the script was written to delete each table.


"And this is why devs lost prod ddl access"


Nobody should have write access to prod when wearing their dev hat. Only scripts tested against a copy of the DB should be run with write in prod.


Amen to that. Least privileges please.


I heard a similar story.

Small company, dev had access to test and prod databases. Was using SQL Developer or something similar, and had gotten into the habit of using auto-commit when on the test database. Must have switched over to prod without realizing it, and did a `DELETE FROM USERS` to truncate some production table.


https://www.gliffy.com/examples/

First graphic on this page includes a bright red box asking: "Is your data safe online?"

Evidently not a rhetorical question.


It also says the answer is NO if you have "services administered by meat-based lifeforms"


If the gentleman who did this loses his job, then those looking for a new sysadmin should definitely give this guy some serious consideration.

Because I guarantee you he'll never, ever, let this happen again.


Your comment cracked me up.

But one thing worth quibbling over:

It might not have been a gentleman who did this. Might have been a lady. Or might have been a not-very-gentle man.


Whoever it was they were not very gentle on the delete key.


The official rite of passage that turns anyone into a bona-fide sys admin. The equivalent to running your production server on debug. D:


I had a client call me in panic after he'd run a unit test script that started by wiping and recreating the database from scratch, and he'd run it against the wrong server.

Thankfully he'd just run it against a dev environment where the loss wasn't particularly severe (the prod environment is firewalled off, so he couldn't have done the same thing against that), but from the panicked tone of his messages before it was clear what had happened, I'm sure he's come to be extra careful about database credentials going forward....


We seem to have shared clients in some point in time. ;)


> The equivalent to running your production server on debug.

I worked in a company where this was the standard, and we updated the code on the fly with eclipse to fix bugs :)


Here is a little fun story:

Last year I was on a flight en-route to an Ed-Tech convention in Philly. There was on board wifi and my phone had the wifi turned on. I go and check my emails before departure and get a debug page from tomcat. My initial reaction was panic, but then I remembered that the wifi runs separately from avionics. After all the mental pictures of the plane going down in flames due to someone leaving a server in DEBUG mode disappeared I simply closed the tab, turned the airplance mode on and went to sleep. Knowing in full that I had put my life in the hands of people who are pressured into writing code that works under impossible deadlines.

That's why I think that just like the romans poisoned themselves over time with lead in their pipes we will kill ourselves with buggy code from shitty projects.


The code I speak about runs in powerplants in several countries. Non-nuclear, though, and has nothing to do with industrial automation.

Yes, shitty code will be our demise. I bet on AI.


AI? LOL. It will be some cron job running off some defense contractor's forgotten ubuntu 7 server with the password password123.


Once I asked server support guy to move database from production to dev. He did - without any question of doubt - exactly that: copied database to dev environment and deleted it from the production. (note: in my language word "move" is less unambiguous than in English - it may mean, depending on the context "move" or "copy").


Your server support person did wrong and language is no excuse for them. If there's any ambiguity, if data may be lost, if something may cause an outage - then they have a duty to tell you and then ask if you're sure.

I do it all the time. "Give this access to this user to this db." 'Okay, but do you know they'll be able to drop your db?' "Oh shit okay wait a sec..."


Last week I deleted a large portion of our pre-production ldap. I use jXplorer ldap client and for some reason the control d (delete) confirm dialog defaults to "Ok" instead of cancel. I'm use to hitting control f (search) and then enter to repeat the last search and when I hit d instead of f I deleted a bunch of stuff. The silver lining is I patched the problem in jXplorer and submitted it. It's my first legit contribution to a project.


In my opinion it is way too easy in Unix to accidentally delete stuff (even for experienced users). Having a filesystem with good (per-user) rollback support is, imho, more than just a luxury.


I miss VMS, with its 32767 backup versions of every file.


The people writing UNIX put all their brains into writing a stupid system.

Also, there are now CoW Filesystems that prevent data loss because of human error. Btrfs and ZFS are good examples.


But you really don't want to use those when working with databases, the performance loss is severe.


So.. story time. While at the university, there was that project where we had to create an elevator simulator in C as a way to learn threading and mutexes. All the tmp files were stored in ./tmp/.

In between build/run/debug cycle, I would "rm -fr ./tmp". But once, I did "rm -fr . /tmp". At that time I didn't know any better and had no version control.

I had to redo that 2 weeks in a night, which turn out to be more easier than expected considered I had just written the code.

My lessons from that:

  A) Version control, pushed somewhere else.
  B) Use simple build scripts.


just run `!rm` and hope for the best


"Tell me about a time where something didn't go the way you planned it at work."


I've done this - ran out of space on /home for the mSQL database (~1996 era), I moved it to /tmp which had plenty free. I suspect most people can now guess which OS this was on and what happened when the machine rebooted some weeks later...

(Hint: Solaris)


but ... but ... but ... it's right in the name !


That didn't stop a tester on a team I managed. Server was getting low on space, I found a bunch of crap in a sub-directory named tmp. Deleted said crap. Tester complains shortly after, I explain what happened and why one shouldn't put stuff in directories named "tmp". He retorted that those were production tests. Okay, maybe one can't be expected to remember 30 or more years of Unix naming conventions, but I should challenge you to battle simply for using crappy, non-descriptive directory names. I'm supposed to find production tests in a directory named "tmp"? Clean out your desk.

Of course there was no backup, because the sysadmin doesn't back up things in tmp directories.


Well, who hasn't done a DELETE without a WHERE clause? ;P


I'm a fan of the "Write as a SELECT first" method for any DELETE or UPDATE queries, but in my busier no-sleep days, I was very fond of mysql's --i-am-a-dummy switch[1] on the CLI client, which will "Permit only those UPDATE and DELETE statements that specify which rows to modify by using key values."

Although, these days, I generally avoid mucking around with production DBs directly. It's all manually scripted migrations, and testing said scripts on a backup or unused slave, which is by far the safest, and also helps to avoid running queries that might adversely affect production's performance.

1: http://dev.mysql.com/doc/refman/5.7/en/mysql-command-options...


I have a very strict habit after once or twice doing this: always start all DELETE commands with "--" (that is, comment it out) until I have written a where-clause.

My command in the buffer might go through these steps:

1. "Delete from" 2. "-- delete from" 3. "-- delete from table where condition limit n;" 4. (Generally either ask a co-worker or make a Jira with the exact command I have at this point, so there's a sanity check and/or permanent record, but for very low risk/especially mundane/especially time critical updates, do it) 5. Delete the "--" and run it. 6. Think hard about adding some functionality in the app for doing it in app code unsteady of in database code.

Generally do the same for "UPDATE".


You can also start all of your commands with "BEGIN;" (at least in Postgres) to ensure you're in a transaction you can easily rollback.

Another trick it to start all "DELETE" commands as a "SELECT" to ensure you get the "WHERE" part correct, and then swap out the "SELECT *" with a "DELETE".


I use the latter almost religiously. Also can be useful to add a LIMIT on the end, just to guard against an UPDATE running amok. And I tend to write my queries 'backwards' (e.g. starting with the WHERE clause, ending with the "DELETE", to guard against executing the query prematurely.


I tend to write my DELETE statements by writing a SELECT first then edit that into the delete version


I've turned to creating a temp table and putting the primary keys from the select statement into there so I can guard myself against the thing that was to delete a handful of rows deleting everything. Plus with that you can do another join and see if those rows have some sort of value in it that you didn't expect.


I also put my DELETE statement on a line by itself and comment out the line. In SQL Server execution processes only the selected text. So I select starting with DELETE then run.


This is exactly what I do too.


I dont use mysql, but it actually has a command line flag to help with this problem that always made me laugh, --i-am-a-dummy

This is an alternate form of the flag --safe-updates, this option prevents MySQL from performing update operations unless a key constraint in the WHERE clause and / or a LIMIT clause are provided.


Several years ago a colleague wrote a handy command line tool that was potentially destructive if misused, so naturally the outsourced operations team misused it. In response he added a flag along the lines of “—-yes-i-really-do-know-what-im-doing” that was undocumented and mandatory for the destructive operations.


My personal preference is to either have a "deleted_at" column so deletes never need to happen, or have a "history" table.

When this isn't possible, I do a select for the PK and delete based upon only the PK. This way I can review the rows to be deleted and back them up manually if needed.

All of this is way overkill considering the backups available these days.


I've gotten into the habit of enabling transactions by default in my .psqlrc. And before I do any maintenance, a good:

abort;

begin;

Just to be doubly sure. Then notice the number of deleted rows, and select from the table afterwards.


I am a bit less convinced by this practice. The general consensus seems to be: "oh you can SELECT around after you've done your update and before you commit; the update." But while you're doing that, you might be holding row locks on all the data your update touched, and meanwhile the lock acquisitions in your app are failing, along with whatever that entails. And if you're not doing that and immediately commit;ing, are you gaining anything?


When the UPDATE or DELETE finishes it will output the number of rows changed. If that number is much larger than expected, something probably went wrong. So there's a gain even if you don't investigate the results in detail.


I agree with parent that using transactions is good practice. In general, naked SQL on a production DB is a bad idea - it is not different to running untested coded in a production environment. Why risk it?

Transactions are a sane safeguard if you absolutely must run SQL on your production database.


[REMOVED]


This is a lot of manual steps. Why not have migration scripts in your code base. Run them in dev and qa and staging so you know they work. Then you go to prod and are done.

The less manual steps the less mistakes will happen.


I had a great mentor who taught me all the safety mechanisms.

Start your .sql file with "use xxxxx" (non-existent database name; will prevent execution if you fat-finger F5.) Always write your deletes as a SELECT first:

SELECT *

-- delete

FROM blah

WHERE blah = 1

Never uncomment the "delete" line; just select the portion of the query after the comment with your mouse and F5 it.

And of course: always design your database with support for soft-deletes, because sooner or later you'll need to add them in.


Simple solution:

  BEGIN TRAN
  DELETE (without where clause)
  UUPS!
  ROLLBACK 
That said it would be not quite honest that there wasn't an incident in my past, which enforces this policy. Always! Without fail!


Better solution is:

    BEGIN TRAN; DELETE (without where clause)
... on the same line.

I've lost a lot of data to "oh lets find that line in mysql history... up, up, enter, FUCK"


Yep, I've never done it, but I've definitely been aware of how easy it can happen. When I'm in this kind of situation on a production DB without easily restorable backups I usually try to remember to put all my delete/update queries in a transaction. Then commit after I see the appropriate number of rows have been effected.


Oh man oh man.... This hurts just reading that, or UPDATE without WHERE.


I have a story about that based on real facts(TM): I once had to reset my own password on a production database and I decided to hash it by hand and UPDATE my row in the users table.

A few hours later we had got a few calls from angry customers who couldn't log in. I had effectively forgotten the WHERE clause so all users had the same password: mine.

Extra points for not having read the "xxx rows updated" line that the mysql console outputs after each query...


Technically they didn't have the same password, unless you're saying that your passwords aren't salted ;)


I updated the hash and the salt in the same query. They weren't salted against the user id or anything like that, just a second column for the salt, which is... common practice.


where do you get that conclusion? There's a lot of ways the password could be literally the exact same string, yet still be salted and even peppered.


I like adding garlic to my passwords, gives them a kick!


Your response could be taken as a joke (made me laugh anyway!), but also seriously too. If it was serious, what do you use as garlic and why?


It was a joke, I'm not sure what garlic would be added to a password.


I was forced to train someone so cocky that they ended up doing a rm -rf / on our production server a month after I quit. He also accidentally euthanized a legacy server, deleted the accounting database when he was trying to do a hardware based RAID rebuild, completely destroyed the Windows domain server and mocked my daily tape backup regimen - opting to ship USB consumer grade hard drives in an unpadded steel box instead to off-site storage... The list goes on. He literally destroyed everything he touched. The only reason he wasn't fired was because he was a pretty man.


Oh man, this was my last boss who literally destroyed every one of the critical reports to key clients. I quit when I realised he was undermining me to the CEO, and the CEO was listening to him. He then completely borked the entire system so badly that he was forced to resign 6 weeks later.

The company went from 25 clients to 4 last month.


I think tech workers should have a 'strategic vacation reserve' specifically to deal with this situation.


Actually, that's what I'm planning on doing from now on. However, I'm going to have to make sure all my own processes are completely bulletproof and documented.


You trained him well, master.


A dark and snowy night a bunch of databases on a server just vanished. This was on a server that was still in development, but was part of the billing system for a huuuge company, and it was under a lot of scrutiny. The files are just gone. So I contact the DBA and the backup group. For whatever reason, they can't pull it off local backups, so tapes had to be pulled in from Iron Mountain.

As I said above, a dark and snowy night. Took Iron Mountain 4 hours to get the tapes across town. The DBA and I finally get the database up around 8am the next morning. I investigate, but can't find any system reason for the databases vanishing, the DBA can't either.

2 weeks later, the same thing happens.

I eventually track it down to a junior developer who has been logged in and has on several occasions run this: "cd /" followed by "rm -rf /home/username/projectname/ *" Note the space before the star. On further investigation, I find the database group installed all the Oracle data directories with mode 777.


Sounds like a terrible situation. I wish those guys luck.

One useful sys ops practice is the creation and yearly validation of disaster recovery runbooks. We have a validated catalog of runbooks that describe the recovery process for each part of our infrastructure. The validation process involves provoking a failure (eliminate a master database), running the documented recovery steps and then validating the result. The validation process is a lot easier if you're in the cloud since it's cheap and easy to set up a validation environment that mirrors your production env.


Posting as a reminder to myself that "in the cloud" != safe

There's always room for computer error, and more like, human error.

Imagine if something like this happened to Dropbox? Ooooft.


This gives most SaaS providers a bad name. The error here is not the engineer deleting the db, it's the complete lack of data restore testing.

Complex restoring never works well when the first implementation is under the pressure of the real event. Other SaaS providers will be cursing such a big name tool making such a public mess.


I can't remember who said this, but it seems apt: "Don't ask if they do regular backups -- ask if they do regular restores".


Good point. And in the cloud, restoring a backup can take ages, which is important to take into consideration. In fact, I'm not sure how to do it efficiently other than swapping the drives.


Yep. My regular rant: until it is tested what you have the is just a file, or collection of files or other object(s), that might be a backup. You don't really have a backup until you have tested it.


Schroedingers backup.



DR testing is hard, complicated and costs a lot. Yes, it should be done regularly, but it's not an easy task; I believe small(ish) companies simply can't afford it.


Then perhaps small(ish) companies shouldn't hold data that's critical to their customers. Just like real engineering companies don't build nuclear reactors if they can't test their safety systems, cars if they can't afford to do crash testing and so on.

DR is not a luxury. Systems that don't properly do DR aren't unoptimized or something, they're badly engineered.


Doubt any business will go down as a result of a web based diagram tool being unavailable for a few hours.


I wasn't referring to this specific case. Also, "don't worry, our services may suck, but not to the point where they bring down your business" is not exactly the kind of reliability one would want to aspire to.


In the DR services my company provides, failover testing is baked in to the cost, and is _mandatory_ annually. It is hard and complicated and expensive, but not as hard as explaining to the customer that "we have your backups, we just can't use them because we never tested whether the software would run on the DR hardware" or "don't worry, the replica of the file server is secure in the datacenter, you just can't log in now because the Active Directory server is tombstoned and won't process your logon." When someone needs their data, what's the difference? Can I see my files or not?

It's more than a cliché that without restore tests, you don't really have a backup-- if the customer won't commit to testing the DR, we won't provide the service anymore. Anyone who pretends anything else is acceptable is kidding themselves.


Testing complicated failover systems is hard. Testing backups is not. Especially if you just do a full nightly backup and not something more complex. The options range from "manually restore a random 10%" to "write a moderately complex script to automatically restore and validate everything".

Those all look like more trouble than they're worth before you start, but they aren't that hard. And you'll recover your whole investment the first time you need to restore something and it's just a simple tweak to your standard test practice.


They seem to afford now to run 4 restores in parallel, using different methods.

If they had done this exercise even once a year, they would have known better what to expect, or how much it would take.


Apologies if I've put the wrong slant on this — designer not a tech expert. Could still edit the title of the submission if there's something else it should say?


It wasn't aimed at you.


Well, with Dropbox the default setting is to keep local copies of all files in every computer connected to the system, so it probably wouldn't be catastrophic.


There was a bug where Dropbox wiped a few of someone's files. These changes were sync'd to the customer's computers, which dutifully wiped the files.

I think he was able to recover most or all of his data (I forget if it was photos or what) but it was interesting to realize that even with Dropbox's stellar record and pedigree, errors like that can still happen.

It makes you realize data is just so fragile. Our bits are unlikely to last centuries, which is kind of unfortunate. I've wanted to take on this problem in a systematic way somehow. It seems like bittorrent offers a way to make a digital time capsule -- theoretically, if you have seeders, your data will persist forever. So you could imagine setting up a "time capsule" of sorts. Example: $2,000 on Digital Ocean will get you 33 years of server time. So that's one seeder, "guaranteed" to last 33 years. I wonder how far the costs could be reduced? Could you reach the 1 century mark for less than $3k? Costs will continue to go down, so it should be possible to have 10 seeders for your time capsule.

And then of course you have to think of what to put into your time capsule. There's not any guarantee the computing architecture will be the same a century from now. But somehow I suspect Python 2.7 might still run. :) You could write some kind of rudimentary AI program that responds to basic questions, maybe using the data embedded in the time capsule...

The only part I can't figure out is the most obvious one: Digital Ocean probably won't be around 20 years from now, let alone a century from now. So even if you get a bunch of money together, how do you reliably transform that money into running servers over the course of a century?


Not until the first sync, that is. Then all your files go poof.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: