
Times I've Messed Up as a Developer - zach-kuhn
https://medium.com/@zacharykuhn/the-times-ive-messed-up-as-a-developer-3c0bcaa1afd6
======
1ba9115454
Back in the early 90's I had 2 terminals up on my X windows console.

One for local dev and another for a production server running a trading system
for around 50 users. Naturally for production I was logged in as root.

I typed rm -rf to clear out my dev folder only to figure out I just deleted
the root folder and all sub folders on a production machine.

I recalled a piece of advice a university friend had given me. "Never admit to
anything."

I walked into the server room, switched off the PC and restored it from a
backup tape that was a day old. 2 Hours later the machine was back on.

No one ever noticed, and I never mentioned it either.

Moral of the story - Don't let anyone give you access to production servers.

~~~
seanwilson
This is why I do a "mv" to "tmp" instead of using "rm" when I can. Having
special messages or colours to differentiate between production and other
environments is a great idea as well. Generally if I'm about to do something
on production, I'll close everything else to do with other environments as
it's just too easy to slip up.

~~~
3131s
You can also alias "rm -i" or "rm -I" to "rm" to avert major catastrophes.

From man rm:

 _-i prompt before every removal_

 _-I prompt once before removing more than three files, or when removing
recursively; less intrusive than -i, while still giving protection against
most mistakes_

~~~
nisse72
This alias seems to be set by default in many linux distributions these days
(and has been that way for many years now).

The problem is that you get used to this crutch and rely on it, until one day
you're logged in someplace without the alias and end up deleting something you
shouldn't.

I'm not a fan and normally disable it. Instead, I prefer to be very careful
and deliberate when using rm, especially with any wildcards, making sure I'm
in the right directory and that the pattern matches what I expect it to.

~~~
chias
> This alias seems to be set by default in many linux distributions these days
> (and has been that way for many years now)

Which Linux distributions do this? I have not seen this alias set by default
on any distribution that I've used recently (Gentoo, Linux Mint, Debian,
Ubuntu, Crunchbang, Arch).

~~~
nisse72
I stand corrected, it doesn't seem to be the case anymore.

I can find old rpms and deb files with rm aliased in /etc/skel/.bashrc or
/etc/skel/profile/aliases.sh, but the most recent I found is in a bash 3.2
package, so this practice seems to have stopped quite a few years ago.

------
leejo
I've seen a DBA drop a production table as well, fortunately it was a small in
memory table for non persisting data that could be quickly repopulated. Still
lead to about 5 mins of downtime.

I always have two terminals open, both running tmux. One is white text on
black for dev work, the other is white text on red. I only ever ssh into
production servers on the white on red terminal so there's a strong visual
signal that I absolutely shouldn't be running drop, delete, truncate, rm, dd,
whatever commands on them and if I do then I am extremely careful.

~~~
kbart
_" One is white text on black for dev work, the other is white text on red."_

Cool. Sorry, but I'm going to steal this idea from now on.

~~~
shagie
Also doable with desktops for those occasions where it is a remote screen
share.

For UAT / Test environments that are different than dev, I use a yellow
background.

------
GlennS
Not quite so disastrous, but still pretty dumb.

I was logged in over VPN to a client machine. I suspected the problem I was
fixing was something to do with the network misbehaving. It was a Windows
machine.

I thought "I know: I'll disable and reenable the network device!".

A few moments later: "Oh. Bugger.".

Shortly followed by an embarrassing call to the client's IT department to ask
them to reenable the network device on that machine...

~~~
throwme211345
Yeah, this is common. Don't modify firewall rules, fuck with interface
settings and addressing without another way in.

------
donatj
I was on my way out the door at my first job, and an SEO person stopped me and
asked me if I could fix something. I just needed to delete a single row from a
database. Easy enough. I get as far as “DELETE FROM table_name” and for some
unknown reason run the damn thing without a WHERE. All the rows, gone!

We did have nightly backups but the guy with access to them was notoriously
hard to get the attention of and lived in a different state. I was on my way
out the door, and now I ended up being stuck there for another few hours
trying to get ahold of devops.

At the company I am at now, we had a devops person whipe our multi TB CDN on
their first day trying to make an improvement to our asset sync script. It
took probably 2 days to get the whole thing reuploaded.

~~~
Spoom
Always write DELETEs as SELECTs, and once you run the SELECT and it returns
the rows you want, then swap out the SELECT * with a DELETE.

~~~
amyjess
That's good advice in general, not just for SQL.

For example, I often write my 'rm' commands as 'ls' at first and then swap out
the command once I know it gets the right thing.

~~~
fapjacks
Yes, I learned this from a sysadmin in the early 90s and did this when
learning to use Unix and having used the technique in various other contexts
(e.g. database DELETEs), I am confident this one weird trick has saved me from
catastrophic mistakes countless times.

------
andyjohnson0
Back in the nineties, at a previous employer, I wrote a code generator that
read a kind of DSL from stdin and spat out C source to stdout.

Having written the initial version and got it to compile, I thought I should
test it before checking it into RCS for the first time. So I ran it with some
simple input and piped the output to something like program.c. At that point I
remembered that the code generator's main source code file was also named
program.c. To my horror it had overwritten a large part of its own source.

I remember staying in the office until about 2am to re-write the lost code,
but my boss never found out.

~~~
khedoros1
A couple years ago, I was writing a script to run a codesigning system (watch
an input directory, sign the file, move to output directory). I had been
working on it for a couple days. I got confused with my filenames, and I
deleted the script, thinking it was a temp file I'd just created.

Luckily, it was Python, which compiles to a bytecode representation when you
run the script, and I had done a test-run of the script recently. After a
little research into the available tools, I was able to extract the original
source (and comments!) from the pyc file. It was a simpler solution than any
of the un-delete utilities that I found.

------
bungie4
Late 90's, I'm lead dev for a large site. Company hires a DBA. We partner with
a VERY large content provider. I'm awake for 48 hours buttoning things up,
last minute stuff. I finally collapse from exhaustion only to be woken up at
7am on - go live day.

Nothing works.

A few moments debugging reveals that all the column names in a key set of our
datasets have been renamed. No time to waste, I rename them all back as
quickly as possible. We go live, no one is the wiser.

We contact the DBA and explain the folly of his decision.

He then proceeds to rename everything AGAIN on a live system. No we didn't
shoot him lengthwise, but we did fire him with some prejudice.

~~~
acoard
>We contact the DBA and explain the folly of his decision.

>He then proceeds to rename everything AGAIN on a live system.

What could his reasoning possibly be? Did he not understand your explanation?

~~~
bungie4
We later found he was acutely OCD and a perfectionist, which I've since come
to find is common in DBA's. He HAD to change the column names. He quite
literally couldn't stop himself. Changing it in dev, QA and staging wasn't
enough for him, it was an all or nothing proposition.

------
Posibyte
Fortune 500 company, Junior dev. Tried to log into our staging database, which
has the same username, but different password than production.

Tried 3 times, got the locked out error. Got frustrated, said I'd fix it after
lunch, went out to eat, came back about 3 hours later after a doctor's
appointment.

My boss was waiting for me at the department door and I got sent back to his
office for a stern talking to. I had locked out the production user from the
database, causing every app in the company to go offline. I was then told to
read aloud back all of the projected sales losses from that day as well as
write a bunch of letters for our top sales people who couldn't make sales that
day.

Wasn't fired (thankfully). Made it to a senior lead, so if I had to say
anything, it's that the punishment really made it really drive home and I
_never ever did it again_ . :)

~~~
planteen
Did you have different credentials you were supposed to be using?

~~~
Posibyte
Afterwards, yes, but before that all staging tables belonged to a single role
under a single staging superuser.

~~~
planteen
It doesn't seem like you really did anything wrong in that case.

------
mrsernine
That time I had to update the users table so users belonging to 'department A'
should now belong to 'department B'. I wrote that really simple update
sentence and even diligently tested it on one of the pre-production servers.
Then I copied & pasted it into the production console just to notice that I
hadn't copied the WHERE clause of the sentence and now all of the users
belonged to the same department.

The fix was not painful, just restore de user table from last night's backup.
but I felt a little ashamed when i had to explain the database administrator
what has happened.

From that day, I keep autocommit always off.

------
ge96
Tried to get around a firewall for a windows remote desktop, changed the port
number through registry, logged out, could not log back in. Was not able to
change it either, detached the AWS volume, attached it to another device,
could not access the registry to change the port back... I could see the
screenshot, machine is running but couldn't remote in.

Doing a recursive sudo chown of the /etc folder, locked myself out of that
ubuntu server.

Wrote a parser that had a condition that would not end, would watch the server
spike to 100% usage.

------
benburleson
My big whoops was remotely bricking a server on top of Mt Haleakala.

I needed to update MySQL to take advantage of some new feature that young,
eager me just _needed_ to use. Well, the package repo didn't have that
version, so no worries, I'll install from source. Well that required some
newer version of a core utility, so no worries, I'll just `yum update` that.
No dice. No worries, I'll just force remove and install the newer one.

Well, young, eager me didn't realize essentially _every_ command relied on
that core utility, so although I had a prompt, I couldn't even `ls`, much less
`yum install`.

Our site technician had to take an OS disk up with him next time he made the
2+ hour drive to the top. Luckily we did have a KVM set up so it didn't
require a site visit from me (or is that unlucky...it would have been a trip
to Maui!).

------
f4f4
Our 3-year-old production server hard drive, containing all assets and a
database for multiple customers, died. No backups. For a few hours I was
picturing not only the anger of our customers, but the demise of the whole
company. Luckily the other drive in the RAID was OK and we didn't lose any
data.

Probably one of the stupidest, but also the most personally effective way to
learn anything.

~~~
mikestew
Account created an hour ago, and the only comment from this account is this
one you’re looking at (which seems fine by my standards), and it’s insta-dead?
I vouched for the comment, but am I missing something or is HN’s juvenile
shadow ban broken again? (IP ban, maybe?)

------
ChemicalWarfare
Speaking of backups :) We had a very large mission-critical system that we
developed and hosted for a fortune 50 client. Long story short - prod DB with
tons of records took a crap, proper RCA was never done I suspected HW failure
of some sort.

The kicker was - the client grudgingly OKd the restore understanding that they
will loose few hours worth of data. BUT. When a DBA attempted to restore - the
backup was corrupt lol. They went back day by day - all of the recent backups
were corrupt, the "freshest" working backup was about a month old. Some heads
rolled as a result.

~~~
rocky1138
The age-old story: if you don't test restore your backups, you don't have
backups.

------
qrybam
After some bloopers early on in my career I developed a systematic "back it
up" reflex if there's even the remotest chance of something being lost/broken.
This has saved my bacon on countless of occasions... normally from problems I
least expected.

------
jason_slack
I worked for a company once where a "consultant" was brought in to look at our
"production efficiency". Some big wig came up with the idea that we should
make sure we are using our production servers effectively. Not to many, not to
few. We really only had 3 servers (Jerry, Kramer and Newman). Newman took a
daily beating running some very large databases. It held firm though.

The "consultant" decided to move a database to its one machine. During this
transition this person started a backup and then whacked the production
database while the backup was still running because they felt the backup had
"probably already backed up the database since it has been running for hours".

Dead in the water. Restored from a week old backup to the dismay of a lot of
people working there whom had lost critical pieces of their work.

------
acomjean
You're likely going to mess up sometime. Have a backup plan.

I too have "restored" a database to the wrong server (production was now
development). This was a member list for an arts organization... After about
20 minutes of panic (what is going on....) I remembered I had set daily
backups. whew.

------
13of40
On the first team I worked on at [corp] we had a corner of one of the bigger
labs set up as a test environment, with a row of headless PCs that were used
for simulating deployment scenarios. Sitting at the end of that row was a
nondescript beige box that was barely distinguishable from the test boxes,
except for being a bit older, that had our team's internal website, all of our
documents, scratch file shares, databases, etc. on it. Of course there were no
backups. At least two times while I was there, they hired temp workers and on
day one gestured in the general direction of the shelf full of machines and
told them something like, "OK, you can get up to speed by reimaging those
boxes for this week's test pass..."

------
Karupan
During the beta for a product, we were working on sending out daily summary
emails to our users. We had a cron job which should run at 8:00 AM in the
morning which would process the queue and summarize all activities into a
single email.

It being a very long day, I deployed the job with the cron expression * 8 * *
* instead of 0 8 * * * , which meant it ran EVERY minute starting at 8:00 AM
instead of just once.

The next morning I woke up to a flurry of text and emails saying customers had
complained they received emails about a hundred times. Thankfully, we had a
pause in the job which meant it went out _only_ a hundred times and not
thousands as one would have expected!

~~~
db48x
Oh yea, I've done that too. These days I use systemd timers, which are much
less prone to that sort of error.

------
CodeWriter23
We were group debugging a crash that happened while using our API for a
network device for NetWare under heavy load. We weren’t sure if the problem
was our code or the IPX driver we had written for the particular NIC, so we
were swapping in every NIC we had in the lab, and running some samples.
Actually, _I_ was doing the swapping. Turns out, the one time I forgot to
power down was removing an NE-1000 (cheapest NIC available) and swapping in a
3Com 3C505, worth about $900 in 1988 dollars. The NE-1000 was fine, but the
3C505 was toasted. The PC was fine. The boss totally clowned me with some mean
mugging, then smiled and said it was ok.

------
Insanity
We had weekly releases to production. during my first month I checked in some
code without yet fully understanding the spaghetti'ness of the 20 year old
codebase.

it turned out that our beta testers found a lot of cases that were broken on
Friday when the release was next Monday. The problem was that a ton of other
classes pointed to the one in which I made changes and altered the state..

Worked through the weekend with my teamlead and got it fixed somewhere on
sunday afternoon.

good bonding experience though. All had a laugh about it the next week. :-)

------
drinchev
As for "Why So Many Dropped Production Databases?", you can always configure
ssh with a combination of LocalCommand or use tmux to change the background of
the terminal, when you are on a production server.

I did a blog post about it [1].

1: [http://www.drinchev.com/blog/ssh-and-terminal-
background/](http://www.drinchev.com/blog/ssh-and-terminal-background/)

------
madamelic
Here is a fun one for you:

In Git, a push-force used to (or at least how I remember it) only force-push
your current branch. I force-pushed on a branch... and it force-pushed master.
Thankfully no one had pushed to master since I last pulled from it.

Here is the fix:

``` [push] default = current ```

Or you can just, not force-push... or be explicit, which I do now. I use force
pushes for rebasing and --amend. I am one of those weirdos who enjoys clean,
readable git repos.

~~~
craftyguy
I guess we're both weirdos, since that's exactly what I do. I made it a habit
to be explicit when doing the force-push (e.g. git push --force origin
<my_branch>).

------
bg4
Stop mucking around in production. Stop. Seriously, stop.

~~~
brianwawok
If you have never screwed up anything in production, no one at your org trusts
you enough to give you production access. Which is ok and great for you, but
at the end of the day SOMEONE has to have production access. You can write a
devops template that wipes out data just as easy as you can sudo rm -rf /

~~~
madamelic
I am much more terrified of Terraform than myself, especially `terraform
destroy`. I mean, a human can only do so much so quickly but a script could
kill everything semi-instantly.

I look at `terraform plan -destroy` and it still terrifies to destroy an
internal service.

~~~
theleftfielder
We actually had a production outage due to a `terraform destroy` typed in
production instead of dev.

------
whipoodle
I think the biggest time I messed up was when I accepted the offer for my
current job.

~~~
noitsnot
Same here. Smallish non-tech city. Nowhere to go but out.

~~~
whipoodle
I work for one of the fancy startups everyone here loves. It's a shitshow.

------
henvic
Another thing you can do to minimize the risk is limiting what your database
user can do.

For instance, if you usually just need to view content. Don't let him truncate
/ delete / do other destructive operations.

------
msangi
Once I refactored some stringly typed C# code to use an enum with values Prod
and Test instead of the two strings "Prod" and "Test".

I was relying on expressions like

    
    
      mode == "Test"
    

to fail to type-check to spot all the places that I needed to update.

Sadly some parts of the codebase weren't very idiomatic and had some

    
    
      "Test".Equals(mode)
    

This caused a bit of involuntary testing in production and thought me not to
trust anything while refactoring.

~~~
Gaelan
>stringly typed

I'm going to steal that one.

------
amelius
I once moved the files in /lib to another folder (with the intention to move
them back later, but of course that wasn't possible).

My boss wasn't happy.

~~~
db48x
LD_LIBRARY_PATH

------
throwme211345
I once deleted the entire contents of /bin and /sbin on an HPUX box ~2004 with
the HPUX servers origin circa 1996. Live ssh session: non-standard location.
scp from a convenient near-replica and voila. Without the near-peer I would
have quit and never went back to IT.

------
Huggernaut
We had a good one recently where / got bind mounted to be used as a container
rootfs in a quick and dirty test, then we decided to `rm -rf` the directory
containing the bind mount. A clever way to get around pesky --no-preserve-root
not being the default.

------
w0m
At my first job, guy in my group once su'd to system user that ran our
infrastructure; and typed, cd && rm -rf * and went home.

I was sitting at my desk watching service after service slowly disappear in
utter confusion, then hours spent cleaning...

~~~
madamelic
... Why would someone do that? Was it ineptness? Just being tired? That seems
so blatant.

------
snissn
I love this comment thread, there's a cool app idea here if done right!

