

Software Engineers: What was your biggest ever f*ck up? - fotoblur

I just came across this story where a 'junior' engineer truncated his entire prod Users table (http://news.ycombinator.com/item?id=5292591). 
Every software engineer I've ever talked to has done something that was a major disaster. Would be great to read about your fails too!<p>Also add what was your lesson learned!
======
codenut
My biggest f'up, so far..

It happened on my third week as a junior developer on a very nice startup
company - its kinda my big dream to work in a startup.

It was a friday morning and I was just starting my day at work(I was working
remotely) when suddenly one of the cofounders sent out an email that our
website is timing out. So I checked out out nagios to see if the website is
receiving a large amount of traffic and surprisingly I can even count the
number of connections using my fingers. I was a newly hire back then and our
lead developer is currently flying on his way back home. My other teammate is
not yet online because he is in a different timezone and it is not yet his
time to work. So basically I was the only developer available at the time.
When my figure out that I have no idea of what is happening, he asked me to
just shut down the server so that our customers will not be able to process
erroneous transactions. The website is hosted on AWS EC2 and I cannot find our
Amazon login credentials(its either I was too dumb at the time or too nervous
because later I found out that our lead dev gave it to us a week before) then
I decided to shutdown the sever through cli, you know shutdown -h now.

Now the other developer got online and asked me what happened. I told him
everything then he decided to power up the server so that he can investigate
the issue. He logged in to AWS console but he cannot find the server. It turns
out that the server's shutdown behavior was set to terminate. And yes, I just
destroyed/deleted the server that the website is using. To cut the story
short, our lead developer came online and he rebuilt a new server. But still
the timing out issue is still there. He found out that it was coming from a
MySQL connection and the root cause was that select statement that is very
slow. And guess who wrote that query. Yeah its me. A new release was just
deployed the previous day and that query was used in one of the new features.
The website became operational the following day and everything came back to
normal.

The next day I became emotional and was depressed the following week that I
handed down my resignation because I felt like I dont deserve to work for
their company. They tried to talked me out on not leaving. The lead dev even
said nice things to me(that Im a good coder and even him will write the same
kind of select query if it was assigned to him). But my mind was too clouded
and made a very poor judgment to pursue my resignation. And here I am now
stuck on a corporate job trying to figure things out and getting my shit back
together hoping someday I can work in a startup again and not f'up.

~~~
josephkern
Sorry to hear that, don't be so hard on yourself in the future. This is hard
work (even if it is fun). As a post-mortem you did one thing right and one
thing wrong.

What you did right: You accepted responsibility. Good, this is a hard trait to
find. The "perfect" engineer mixes equal parts intellegence and humility.

What you did wrong: You didn't trust your team. They were all very supportive
of you, but you didn't trust their assesment of you.

Trust your team; trust yourself. Have confidence in your ability to learn from
your mistakes.

You'll do fine. Start looking for another start-up job while working at a
corp.

The best time to find a new job, is when you already have one.

------
drharris
My first "real job" was at a company that developed equipment for radiological
surveys for decommissioning efforts, and after a short time was given the
responsibility to develop a VB5/6 application that turned out to make a lot of
money and gain a lot of favor contract-wise. A few months in, I was tasked to
our largest project (and the largest decommissioning project in the US), and
traveled back and forth each week.

As someone on the go, I thought it was a good idea to keep the source code for
that app on my flash drive (there was no Github back then). For 6 months I
worked directly on that flash drive, adding new features to support the large
project, and expanding the abilities of the application to gain us even more
favor. One day, I plugged in the flash drive and Windows gave the warning that
it was corrupt and needed to be formatted. Immediately my heart sank, and the
drive was indeed dead. My last backup was about 3 months old, and didn't even
include some resources like icons and graphics.

Long story short, I had to sit there for weeks and re-code everything I'd
lost, using the latest release as a reference to what was missing. On the plus
side, my design was probably better the second time around, but nobody was
pleased that any new releases would be delayed a month at least.

I now keep that flash drive, still in its corrupt state, as a permanent
fixture on all the desks I've worked at since. It's a constant reminder to not
be stupid when it comes to time-expensive intellectual property.

------
danudey
Ops story:

I worked at a data centre which had an IP KVM attached to all of their
machines. When you were logged in as 'admin', there was a mode you could
toggle that would send all of your keystrokes to every server, but still only
displayed the one you were logged into, so there was no (clear?) visual
indication that this was going to happen. Coworker hit Ctrl-Alt-Del to reboot
a stuck server, and rebooted every non-Windows server in the data centre (and
we only had one Windows server).

Every customer got some level of compensation, the noisy ones got a lot of it,
and no one ever logged in as admin again other than to relabel servers in the
server list.

------
mb_72
This will show my age but ... as a junior developer, I was responsible for
generating the 'gold' floppy disk set for our application. The second disk of
five held hundreds of small report template files, and without a post-disk
build defrag the install process for the second disk took a couple of hours
instead of a few minutes. For one release - you guessed it - I forgot the
defrag on the second disk. I passed the disks to another guy for a test
install, and later on in the day he test-passed the install set and send it on
for duplication. Hundreds of floppy-disk sets were sent out to clients later
that week, and then we started getting many irate phone calls about the slow
install process. Turns out the testing guy had missed the slow install rate
because he inserted the second disk, then went out to lunch for a couple of
hours, and assumed everything had completed quickly when he returned. Lesson
learned - have a written checklist for generating installs / deployment (we
didn't at that stage).

------
pindi
When defining our initial data schema, we forgot to put a unique constraint on
user email addresses. There ended up being quite a few duplicates, so before
we added the constraint I had to write a query to remove the duplicate users.
About 2/3 of our users didn't have an email listed, and my query failed to
take that into account, so it wiped out all but one of those users.

~~~
drharris
I have also done this exact thing. Luckily, it was on a development database,
so nobody ever had to know! Easy mistake to make, for sure.

------
fotoblur
My biggest f*ck up:

When I worked for a financial institution my manager gave me a production
level username and password to help me get through the mounds of red tape
which usually prevented any real work from getting done. We were idealists at
the time. Well I ended up typing that password wrong, more than 3
times...shit, I locked the account. Apparently half of production's apps were
using this same account to access various parts of the network. Essentially, I
brought down half our infrastructure in one afternoon.

Lesson learned:

Don't use the same account for half your production apps. Not really my fault
:).

------
clamattack
I've had my share of SQL messes but nothing critical (thankfully!). Probably
the worst as far as effect goes was a while back in a low paying dev job. I
was under immense pressure to fix some thumbnails for an e-commerce site (as
in, if this isn't done in 10 minutes, get your coat and get out). The shop I
worked for was getting pressure from the client as they'd put it off for weeks
at that point.

So.. I write a quick script to resize the master images and re-generate around
2,000 thumbnails. Except... I copy/paste the source path to destination - and
I mistype 200px width as 20. Now we have a whole site with long thin product
images and no originals to recover from! As in the linked story, no backups
were in place and all work was done on production. Lost a weeks wages over
that, and had to manually re-add everything from a stack of CD's :)

Lesson learned? Don't let pressure force you into making bad decisions. I knew
I really shouldn't be doing that but I was young & foolish.

------
Jeremy1026
I work at with a medical office management company. We handle the billing,
training, hiring, and IT for various medical offices in the area. Included in
the IT portion where I am, is the hub of the electronic medical records. One
day while working on a new web application to tie into the EMR system I was
fiddling with some SQL. After confirming I was logged into the development
database I ran some select statements. I moved to a new query window in SQL
Server Management Studio and ran a delete statement on a large (100,000,000+
records) table. I forgot to include a where clause so the entire table was
wiped. Which was no big deal because it was the development database and it'd
be restored in the overnight copy, except that the 2nd query window was
connected to the production database. Oops.

~~~
xauronx
As someone who works in the same industry, doing the same type of work, I
cringed hard. I hope you were able to save your ass.

~~~
Jeremy1026
Fortunately it happened early in the morning. So the previous nights copy on
the development server was close enough that it could be pushed back to live
while causing only very minor issues.

What EMR system do you deal with?

~~~
xauronx
A very very small one that I'm too ashamed to mention :) it's mostly medical
billing though.

~~~
Jeremy1026
I hope they have better security than the one I use
<http://www.jcurcio.com/posts/obscurity-is-not-security/>

~~~
xauronx
Haha, a little bit better than that.

------
eddiemunster
\- We got a brand new shiny Xbox devkit (one of the silver ones, only one in
the studio), I plugged it in..BOOM!...oh it's a American devkit and I plugged
it into a British power socket...ooopss..

\- Doing a port of PS2 -> Gamecube, one guy asks me 'do we need this assert?'
I go 'nah it'll be fine'...cue a month later when we have a intermittent soak
crash after several hours which I find out would have been caught instantly by
the assert I said was ok to remove...took some time to find :/

------
keefe
I was under the gun for some client facing deadline and I had a crash so I had
to rebuild my system. We had registration for our software and nobody was
around to give me a key, so I commented out the authentication and call home
(not normally in my part of the source tree) then promptly finished my work
and committed the whole thing... got caught at the last round of QA
fortunately.

------
k1kingy
I managed to code a pretty bad bug that went out and stopped a key module
working on a piece of software.

Funny thing is, it got through a code review my own personal testing and QA
testing.

Once the problem came to light it was a very obvious quick fix though.

------
spoiler
Spent 2 hours trying to fix a bug in the wrong place. I was getting syntax
errors, because I typed fi instead of if, and I didn't even realise I typoed
it.

------
tectonic
rm -rf / some/specific/path

~~~
anywherenotes
I ran chmod like that as root.

------
robomartin
I wouldn't call this a disaster, but: Coding a somewhat complex embedded
application entirely in assembler when it should have been done in C from the
start. I knew better, but I got going on the project in assembler and didn't
stop.

At first maintaining and expanding functionality was not too hard. As time
went by it became harder and harder.

The fix was to stop everything about a couple of years after the product was
already shipping and take three months to re-write it in C. After that adding
feature requests and improving functionality was an absolute breeze.

