Hacker News new | past | comments | ask | show | jobs | submit login
Confessions of your worst WTF moment (stackoverflow.com)
29 points by superted on Sept 24, 2010 | hide | past | favorite | 26 comments



During the week before Christmas, I had to rush a quick change into production that would allow us to split orders into multiple shipments. The software was fine, but an admin accidently booted the wrong server (that had a bad test version of the software) to send feeds to the UPS label printing server.

Two days later, 272 packages arrived at one customer's house in Minnesota.

She had a good laugh, and we managed to fix the problem in a couple of days without too much additional expense.

Lessons learned:

1. If something is wrong, I'd rather have it crash than give wrong results.

2. Get a good version control system.

3. Have good policies and procedures.

4. Don't change anything after December 1.

5. Don't ask me to do a rush job if I'm busy on Hacker News.


I had a similar situation once where a character set encoding bug in a cron script I wrote and was told to put live late on a Friday caused one lucky customer to receive an SMS message every minute for over 60 hours.

I got away with that one.


I've been fairly lucky over the years (touch wood!) and for most hiccups I've had the wherewithal to get it fixed, restored or whatever before anyone noticed...

...Except the time I worked for a web dev company that had just set up a sideline selling product [1]. At that time we received maybe 5 orders per day and while I was integrating the payment gateway automation, we had to manually enter card info into a web interface. We had a lot of attempted orders from Nigeria so very strict verification was enabled and if you mistyped a cardholder name or address even slightly then it would fail.

So one day we get an order for around £150. I enter this guys card info into the page and hit send. Bam! "Not accepted: please try again". I re-check his info and it's fine except my capitalisation is slightly different. I adjust and try again. Same message. I try a couple more variations with no luck. I phone the customer and find his card actually has 'Mr' on it. I try that and nothing. I try 'Mr.'. And so on and so on. Eventually I give up and tell him to send a cheque but we'll dispatch today.

Couple of hours later we get a call from an irate Mr Johnny Customer. His bank have just informed him that we've charged 16 x £150 to his card. Turns out that the address verification would only fail after a charge attempt was made - and then a request made immediately to void. However, for debit cards, the voided process can take 3 days and his card was useless in the meantime.

Not too bad in the scheme of things, but this error was over 3 times my monthly wage at the time and I'd only been there 2 weeks.

[1] specifics removed as they've since gone on to become quite a big deal (from the solid foundations I built I presume!) and Google can be a harsh mistress.


Fortunately I have to go over back 20 years for mine.

It was a missing $, it was something like this inside a shell loop:

    mv $i $.old
Except I had

    mv $i .old
This was a payroll database where I had just moved all fifty files in the database to a single file.

No problem - tape backups were up to date! They had three tapes.

First tape failed to restore.

Second tape failed to restore.

Third tape worked!

I learned a lot from that little adventure.


My third iPhone project, and the first truly big whack of MVC code I'd ever written.

Everything was tested and working, but I didn't want to release yet. Had to optimize a meaty, graphical tableview that was choppy and dropping frames.

Got it glassy smooth, tested it lightly, then released.

Press was solid, sales were outstanding, everything was going well. (edit: I forgot: This was especially exciting news because I was about to leave a safe job for big adventure)

But the app was crashing and destroying any data the user had input before the crash.

I had to pull the app while I figured out what was wrong, destroying the product's amazing momentum and killing my sales rank. I didn't care. I felt terrible, taking people's money with a broken product.

Came down to one over-released button object, one line of code, along with a boneheaded assumption of when I should commit persistent data.

I donated the entirety of my early sales to charity and chalked up the experience to the importance of rigorous QA, no matter how trivial a change seems.


"I donated the entirety of my early sales to charity and chalked up the experience to the importance of rigorous QA, no matter how trivial a change seems."

WOW. Respect.


For class, I built a RSA implementation that was susceptible to frequency analysis. That wasn't so bad, considering that I was a pupil, however the fact that I got an A+ for it was slightly strange.


My first job out of university was working on the mainframe systems for the Canada Pension Plan (equivalent to US Social Security). Every now and then some upstanding citizen would go in to register for their retirement benefits only to be told that unfortunately, as far as the computer was concerned, they were already dead. So we would run the "Lazarus" routine to resurrect them.

That was in the mid 90's but I'm sure the same thing is still happening today. Pensioners will eventually die, but that old COBOL code never will! :)


I wrote a piece of network test software for NASA on an internship. Now, I must here stop and say that the Johnson Space Center outsources their IT security to a contracting company, who is based out of another state, and who is batshit paranoid and completely unwilling to admit the existence of anything but Windows and Office. I had to escalate up three levels just to get authorization to hook up a computer running this scary thing called "Linux" to their network.

Anyway I built a test computer with two NICs in it: one was connected to the official network so I could get internet for Linux updates and to do research, and the other was connected to my private test network. While testing the software I wrote, which is capable of sending low-level "raw" ethernet packets, I sent 10,000 maliciously malformed IP packets from a MAC address of "00 00 00 00 00 00" to make sure I couldn't crash the other copy of my program receiving it across the test network no matter what it received.

Unfortunately, after hitting enter I realized I'd typed the wrong ethN port, and actually sent the 10,000 malformed packets across the official network. They weren't directed at anything, but they did reach the switch and probably triggered an IDS. Oops.

I found out later that the IT people, not content to just turn off my access, actually drove out and physically disconnected my ethernet cable from their switch!


The probability of this discussion thread ultimately containing at least one `rm -rf` must be close to 1.


I'll bite:

In my undergrad Operating Systems we used Minix and were expected to rewrite various parts of the system ourselves as our class projects. We were all given our own systems in a smallish lab used only for this class so we didn't need to save anything. About halfway through the semester one of my classmates mistyped a cd command, didn't realize it failed, and quickly did an rm -rf. This resulted in all their work for the semester being lost. Because we had all been given our own machines nothing had been saved. After this the rest of the class started backing up our work daily.


Pretty much as bad was that a colleague of mine once ran:

  rpm -e *
..which had much the same effect (and no confirmations required!)


Also, configuring the firewall to interrupt your ssh session.


I'm not sure that I'd trust any sysadmin who hasn't had at least one 'rm -rf'-related cockup at some point in their life - you certainly learn some useful lessons along the way ;)


rsync --delete is the new rm -rf (it's bitten me more than once)


Not worst, but certainly embarrassing. A good many years ago, I was working for a small company, and we got to the stage where we could invest in a couple of high-end servers. I set everything up, started to test and couldn't log in to the web front-end: the new server was crashing as soon as I entered my login credentials. My non-technical boss was starting to get a bit confused about why we'd spent all of that money...

The login screen required 3 numbers from the user's PIN to be entered, being selected at random. So, pick a number at random between 1 and 10 (maximum PIN size), then pick another random number which hasn't already been generated, and do the same for the third number.

Problem: the random number generator object was created afresh for each number. The default seed value for the generator was the current time, but with no greater precision than seconds. Alarm bells started to ring.

The old server was so slow that the loop could run over several seconds - so getting a fresh random pseudorandom number each time. The new server had no such problems running our code, and IIS noticed an infinite loop and killed the thread before it had a chance to affect the rest of the server.

30 seconds after realising this, I moved the object creation out of the loop, and life was good again. Whoops.


Spending a day figuring out why a new routine wasn't called only to realize there is another copy in /usr/local/bin of the same program that took precedence in the path.

Having the cleaning lady unplug the air conditioner to vacuum inside a glass enclosed server cage (fancy office) and to walk in there in the morning. It was like walking in to a wall of heat, amazingly most of the machines still worked, but we did end up replacing all of them.


This reminds me of uni days, when I had a habit of compiling a quick snippet to a binary called "test", and would be left mystified when I ran it and it just exited immediately. I hate to admit how long it took me, on multiple occasions, to realise I was running /usr/bin/test (more familiar to me as "[").

Eventually I wised-up and dropped that habit :).


I tried to get some open source webserver code into a Gnutella (P2P software) program called Gnucleus.

He put the code in the repository but never enabled it.

The WTF moment? When Morpheus (P2P company) took Gnucleus and rebadged it as Morpheus Preview Edition.

My code was in there, orphaned as nothing ran it, and it got downloaded 100 million times. So I missed my chance and didn't get any money because Gnucleus was GPL.

WTF?


When I was a junior programmer at a bank, I had an assignment to change some reports in the plastics issue/reissue system. I used made-up cardholder names, such as Malaguena Splunt, Maxie Terwilliger, and all the Beverly Hillbillies. I ran my test, checked the reports, and thought I was done. Then a phone call from security informed me that my cards were ready. That's when I found out that running a test would actually result in test plastics being created. Since fraudsters could get hold of plastics, the person who created them was required to go to the plastics area (located in a vault with armed guards) and fill out a form for each plastic, including the name on the card and the purpose of the individual test plastic. I spent a very uncomfortable half day writing up forms for my cast of stupid names, under the watchful eyes of the security guards.


Setting up a new switch in the office, and to show the pretty light-show on startup, rebooted the switch. Turns out that the newer Procurve 1810Gs don't automatically save the config (you have to tell them to after making changes), and they don't have loop protection on by default...

So reboot -> no loop protection and that 3-cable LACP trunk I'd just put them on was suddenly a bit of a loop[1]...

Moral of the story : Don't assume that (what seems like) a minor version update in a product doesn't change the behaviour substantially (the 1800G switches save the config when you make changes, the 1810Gs have it as an explicit step).

[1] A gigabit loop on a completely flat network, so there was essentially a gigabit of broadcast traffic pounding at the ~100 machines on the network...


Several years ago, computers running Win9X in the company I worked for would stop booting, complaining about missing loader or somesuch. IT was puzzled and couldn't figure out what was going on - no data was missing and hardware seemed fine. A windows repair would fix it until it happened again a few weeks later with someone else.

This happened for a year or so until I was making changes to a code I had written in the past, in a method that would delete all files from a temp folder it had created. I noticed a rare case where it could fail to get the folder name and clean up non-recursively the drive root instead (wiping some windows boot files and mostly nothing else).

I then learned why you validade arguments and check return codes.


I had an off-by-one error, for the case of 4n+3 rows in a half-tone picture display. I learned about it when I arrived at the office one afternoon and my boss called me into his office, handed me a copy of TIME, and told me to look at page 57. The only reason I didn't get fired was that that release also had a (lossless) compression algorithm that allowed the magazine to push back their photo deadline by a day. (I've always wondered how good my ~50% compression was for pictures, when you only have enough RAM to look at one scan line at a time). Luckily the customers noticed the problem before printing more than a few copies.


One of my clients wanted to be able to send an email "blast" to all 50,000 of their customers via their intranet application. They wanted to be able to include an attachment with the email. I wrote the code and told them it was ready.

A frantic call came in at midnight. The whole email system was down. Unfortunately they had attached a 2 MB PDF file to each email. So our little underpowered Exchange server was desperately trying to send 100 GBs of emails downs our T1 (plus dealing with all the bounce-backs).


Another bank/plastics mishap in my yute: one of my colleagues had a PIN pad on her desk. I ran my own credit card through it for a $1.00 charge, and it was approved. I tried a few more times, approved each time. Finally I got a "pick up card" response. I said, "Hey look, the test system wants to pick up my card." My co-worker: "That's connected to production, not test." Oops!


Hm, nothing too severe. Once I didn't check that the config files used a dummy smtp server before running the unit tests for sending order confirmation emails, it was probably a bad idea to be using real data too :)

Just required an apologetic email to customers, no real fallout.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: