I've only cried literal tears once in the last ten years, over business. Due to inattention while coding during an apartment move, I pushed a change to Appointment Reminder which was poorly considered. It didn't cause any immediate problems and passed my test suites, but the upshot is it was a time bomb that would inevitably bring down the site's queue worker processes and keep them down.
Lesson #1: Don't code when you're distracted.
Some hours later, the problem manifested. The queue workers came down, and AR (which is totally dependent on them for its core functionality) immediately stopped doing the thing customers pay me money to do. My monitoring system picked up on this and attempted to call me -- which would have worked great, except my cell phone was in a box that wasn't unpacked yet.
Lesson #2a: If you're running something mission critical, and your only way to recover from failure means you have to wake up when the phone rings, make sure that phone stays on and by you.
Later that evening I felt a feeling of vague unease about my change earlier and checked my email from my iPad. My inbox was full of furious customers who were observing, correctly, that I was 8 hours into an outage. Oh dear. I ssh'ed in from the iPad, reverted my last commit, and restarted the queue workers. Queues quickly went down to zero. Problem solved right?
Lesson #3: If at all possible, avoid having to resolve problems when exhausted/distracted. If you absolutely must do it, spend ten extra minutes to make sure you actually understand what went wrong, what your recovery plan is, and how that recovery plan will interact with what went wrong first.
AR didn't use idempotent queues (Lesson #4: Always use idempotent queues), so during the outage, every 5 minutes on a cron job every person who was supposed to be contacted that day got one reminder added to the queue. Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each. As soon as I pressed queues.restart() AR delivered all of those phone calls, text messages, and emails. At once.
Very few residential phone systems or cell phones respond in a customer-pleasing manner to 40 simultaneous telephone calls. It was a total DDOS on my customers' customers.
I got that news at 3 AM in the morning Japan time, at my new apartment, which didn't have Internet sufficient to run my laptop and development environment to see e.g. whose phones I had just blown up. Ogaki has neither Internet cafes nor taxis available at 3 AM in the morning. As a result, I had to put my laptop in a bag and walk across town, in the freezing rain, to get back to my old apartment, which still had a working Internet connection.
By the time I had completed the walk of shame I was drenched, miserable, and had magnified the likely impact that this had on customers' customers in my own mind. Then I got to my old apartment and checked email. The first one was, as you might expect, rather irate. And I just lost it. Broke down in tears. Cried for a good ten minutes. Called my father to explain what had happened, because I knew that I had to start making apology calls and wasn't sure prior to talking to him that I'd be able to do it without my voice breaking.
The end result? Lost two customers, regained one because he was impressed by my apology. The end users were mostly satisfied with my apologies. (It took me about two hours on the phone, as many of them had turned off their phones when they blew up.)
You'd need a magnifying glass to detect it ever happened, looking on any chart of interest to me. The software got modestly better after I spent a solid two weeks on improved fault tolerance and monitoring.
Lesson the last: It's just a job/business. The bad days are usually a lot less important in hindsight than they seem in the moment.
Lesson 1 rings a mad bell. I distinctively remember a text from a colleague saying "yeah, I'm never launching anything on a Friday".
As a general rule, the best time to launch is first thing on a Tuesday. Why Tuesday morning? We are also well aware of "Monday morning madness" too - pre-planning stress for Monday morning is akin to masochism.
For essentially all companies whose customers are paying consumers:
The technical staff all talk about the dangers of Friday evening deployments.
The company has ignored/overridden these warnings in the past, and sometimes experienced catastrophic failures because of it.
STILL, sometimes the marketing staff will have weekend promotions that are considered more important than these concerns.
In particular, a weekend promotion will be implemented, tested on Wednesday, and ready to go. And only after it's active, is a bug found in it on Friday at 5pm. And the promo is more important than the risk of failure.
I'm not saying this as a "bitter techie"... I'm just explaining that this is how it happens.
I think that the "it's more important" argument is often correct. Not running the promo costs the company $X of opportunity cost, and X is often very well known, from past promotions. Balance "losing $X" against "well, there's only a small chance of something going wrong", and that's why the risk is often taken.
Take the risk enough times, and eventually something goes wrong.
Pretty much. It's arbitrary, but is fixed, not the first week or last week of the month (often reporting periods), not Monday, and not the weekend. If you then assume that organisations test the patches, that means they can deploy them for thursday morning
I work for a Forex (have USD? Want JPY?) service provider that runs a 24-5 service (At any time of day during a weekday an exchange will be open somewhere, on some continent. They all close on weekends)
I've worked with a global services industry organisation based in the UK but with MAJOR work happening in the middle east, India and Asia-Pacific regions. To add to that, because so much of the service industry's core activities tend to be based around the weekend and start of the week, we can't simply push at the weekend. Add in regular scheduled downtime and backup periods, having to work with people in the US, and many, many dozens of systems...
This is the difference between a software developer and a senior software developer. I never release anything on a Friday OR the day before a holiday. ALWAYS plan it afterwards, give any technical reason you like (solar flares). Project managers will realise and thank you for it in the long run.
There is a great lesson in one of pg's essays. If you push you better be there for few hours to monitor what is happening. You do not just push and leave. I hope we all learn from this thread.
In the IT services industry, we'll often talk about running a "war room" for the hours following a significant change, and then follow with a period of a nominated on-call person who has the authority to wake the project team up, tell them to drop everything and fix the issue.
Funnily enough, change control actually runs a lot more smoothly in terms of getting past senior non-technical managers when you include elements like that
"continuous deployment" is not a panacea to the issue. There's two primary ways of doing it, from what I understand:
continuous deployment to a dev/test environment is the easiest for most organisations to move to. Due to the live environment being mission-critical, they can't afford the risk of any degradation of service. So you push regularly (after passing test suites) to a test environment, get a small subset of users working, and at some point then push out to live from that. But I suspect that isn't what you are referring to, as this is too similar to typical change management.
the alternative definition of continuous deployment is that of constantly pushing to live, initially for a subset of users then rolling it out from there gradually, but always on the live environment. In many large, often 'cloud-y' solutions, that subset might be all users. Except you can't have a public transport ticketing system fail at peak times. A hospital patient record system must stay available for staff, and give sufficient notice before any possible impact to service to allow manual processes to be used. Payroll, accounts, HR systems... all of these have failure modes that have financial penalties at best.
Hence continuous deployment only makes sense when you can afford to risk service, possibly with significant impact.
It helps a lot though. One of many small incremental changes, or one big monster change, which do you think is likely to break a system? And which is going to be easiest to diagnose and fix?
> I've only cried literal tears once in the last ten years, over business.
> Don't code when you're distracted.
Same story here, I can't remember the exact scenario but I was concurrently acting under all three of my titles (Developer, Architect, Escalation Engineer/Critical Debugger). The customer (who was 7 hours different to us) had been battling for 3 months and we were getting nowhere (all thanks to a, since fixed, bug in WinDBG which essentially came down to broken stack traces in certain scenarios), for those 3 months I had been working 20 hour shifts (development by day, support by night).
Under that strain I eventually made a screw up with the dev and it cost QA time. The MD of my region had a sit down with me and I
> cried literal tears
Needless to say they were impressed that I someone cared so much and sent me home to sleep. The next day I came in and decided to go through the 800kloc codebase line-by-line and see what could be causing the issue - I found it in a few hours.
"People who are shipping actual products instead of talking about them on message boards", perhaps. Running someone else's product company from a message board is a little like playing Jeopardy! from your couch, right?
Yeah. This was... my biggest takeaway from running my own company. I struck out on my own because I thought the business people who ran the companies I worked for were unethical idiots, and being not-an-idiot, I could do better.
Turns out? Nope.
I also find that the arguments I had with my bosses come around again, from my employees.
I still don't really understand how business works - but I've learned enough to understand that I don't understand how business works.
My question is: Who never makes mistakes? I certainly do not belong in that set.
And I like Patrick's candid business anecdotes. Snarky comments like this might disencourage people from writing useful advice here. Which would be very unfortunate.
>All true, but that still doesn't answer what possesses someone to pack their cell phone in a box.
this is the primary difference I see between programmers and sysadmins, development and operations. I know programmers who don't own cellphones at all, while I know some sysadmins who take tertiary backup communication devices on vacation.
It's a difference in focus.
Of course, most programming jobs have /some/ operational responsibilities, and most sysadmin jobs have /some/ development responsibilities, but most people see themselves as primarily one or the other, and act accordingly.
Note, I agree that packing your cellphone in a box was a mistake either way. But if you are primarily an operations/sysadmin type? that would be a really big deal kind of mistake, one that you probably wouldn't make very often. To a Developer type who saw their operations role as secondary, sure, it's still a mistake, but it's a smallish, forgettable kind of mistake.
You're right, I just run my own product company from my couch, and we'd never be daft enough to ship something critical right before disappearing into the ether.
> Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each.
Excuse me for caviling at your post, but "fortuitously" is a synonym for "accidentally", not "fortunately".
Well, a quick definition says "happening by accident or chance." Well I suspect he didn't know the proper definition (I certainly didn't) it's not that bad a choice.
Thanks for the comment though - I personally always appreciate a gracious correction.
oh man. the idempotent queue thing reminds me of the time i was working for a very-early-stage startup, and we'd managed to persuade an exec from a pretty big company to sign up for a trial. our ceo got a very angry call the next morning; the guy had woken up to 300 email messages because our queue had hiccuped and, yeah, not idempotent.
Not the worst at all, but probably one I found most amusing. One of my jobs included some sys admin tasks (this wasn't the position, but we all did dev ops), among my other responsibilities. I spent half a day going through everything with the person responsible for most of the admin tasks at the time. She was an extremely dilligent and competent admin, did absolutely everything through configuration management and kept very thorough personal logs and documentation on the entire network. One of my first tasks was to change backup frequency (or other singular change) and going by how I usually did things at the time, just sudid a vi session, changed the frequency and restarted the service.
She found out about it pretty quickly due to having syslog be a constant presence in one of her gnu screen windows and gave me a look. She quickly reverted what I did, updated our config management tool, tested it, then deployed it, while explaining why this was the right way to do things. I slowly came around to doing things the right way and haven't thought much about the initial incident until we found her personal logs that she archived and left on our public network share for future reference.
In the entries for the day that I started, we saw the following two lines:
[*] 2007/09/09 09:58 - yan started. gave sudo privs and initial hire forms.
[*] 2007/09/09 10:45 - revoked yan's sudo privs.
She found out about it pretty quickly due to having syslog be a constant presence in one of her gnu screen windows
I'm amazed that this is possible. How would I set something like that up? A realtime log of only the most significant events of a remote system?
In fact, I'd like to take this opportunity of ignorance-admitting to ask the community for general linux/bsd sysadmin resources. What books should I read, or what topics should I study? I want to become an expert at modern sysops. Modern deployment, hardening, backup, managing dozens of boxen, etc.
I've been thinking of going through any MIT OCW on the subject, but it seems like hard-earned experience might not necessarily translate well to an academic setting. What would you recommend I do?
What position are you starting from? My old workplace was a university group where we (admins) were recruted from the available pool of PhD students. So I'm used to guiding people from "no knowledge" to "enough knowledge to be dangerous". The first step was to force the prospective admins to run a specific system on their "productive machine" and keep it in such conditions that _everything_ works.
This way, a complete admin newbie would learn about digging through the systems by working out the kinks of practical everyday problems. Remember, this is only the most basic instruction, nowhere near enterprise-grade.
If there was a "prospective admin" who had never before run Linux, I'd tell them to install and use Ubuntu/Mint. (Those guys whould usually only be trained to be a helping hand for a "senior" admin.)
If he'd already used Ubuntu at home, I'd tell them to start using Debian and work out how to set up an SSH server and set up their home machine so they could access it remotely.
If they had dabbled with Debian, Fedora, SuSe or something similar, I'd tell them to install Arch and set up some "interesting things", like a mail server or a nis server.
If they were using Arch or Gentoo at home, I'd just personally show them the important things about our system and have them wingman with me for a few days.
If you are already an advanced Linux or BSD user, my approach is of course not appropriate. Instead I'd recommend to pick skills that you want to learn (iptables? Exim?) and set that up. Read Manuals! Read RFCs!
Accurately assessing one's own competence is difficult and makes for boring reading, but since it's probably necessary here, I'll give some background.
If he'd already used Ubuntu at home, I'd tell them to start using Debian and work out how to set up an SSH server and set up their home machine so they could access it remotely.
If they had dabbled with Debian, Fedora, SuSe or something similar, I'd tell them to install Arch and set up some "interesting things", like a mail server or a nis server.
If they were using Arch or Gentoo at home, I'd just personally show them the important things about our system and have them wingman with me for a few days.
I'd say my current skill level is a mixture of those three. For example, I don't know how to deploy a web service which can send out email for users to e.g. reset passwords. So I don't know anything about email. On the other hand, I've been trying to hone my skills by hardening a Debian server using iptables. On my third hand, while I could set up a box at home that can be SSH'd remotely, I'm not yet confident I know all the best practices. I think the best SSH practices are: change the default SSH port, disable root login, and disable password-based login (use a password-protected keyfile instead).
Beyond that, what is interesting to me is being able to set up dozens or hundreds of systems. Doing this by hand is fraught with error, so it seems like I should learn about virtualization + deployment systems. I've heard good things about Ansible and Salt, but I've also heard Salt considered security an afterthought, which didn't sound good.
It's sounding like my best bet is just to try things, but I want to set things up correctly from a security perspective.
I should also enhance my knowledge of networking... perhaps by spending a few weeks on OCW material regarding the networking stack. How packets are routed, the details of TCP, that sort of thing.
You're welcome. If you want to deploy and maintain many machines, then maybe FAI[1] might be worth a look. It allows you to maintain a consistent state over an arbitrary number of machines running a Debian-based distribution, with _and without_ virtualization. We used it to run about 40 user-facing desktop machines and about the same number of cluster nodes. You basically have a central server that contains configuration, configuration-modifying scripts and package configurations. It is possible to define classes of machines, and one machine can belong to multiple classes, so you can have a part of the configuration identical on all machines and then other parts only on some of them.
If they are using gentoo, you should be finding someone else. Gentoo users are typically the most dangerous combination of profoundly ignorant, yet absurdly overconfident in their abilities. Seeing a bunch of autotools and gcc output scroll by does not teach you anything. But the mistaken reputation as an "advanced" distro makes people think that by using gentoo, they are therefore "advanced".
There's something to be said for the installer being a random liveCD and documentation for manually installing & configuring a system.
If you go though the handbook properly (and potentially enough times until you don't need it to install), then the amount of inherent linux usage and admin knowledge you can pick up is just phenomenal -for example I love the xkcd[1] even if it stopped applying when I started using gentoo.
I would expect a gentoo user to be comfortable on the command line, which doesn't hold true for a lot of other desktop users. That said, isn't it immense desktop linux has gotten to the point where the barriers to entry are grandma level low :)
It's also probably fair to say that every userbase has it's vocal idiots...
[1] http://www.xkcd.com/1168/
I have never seen a gentoo users with any more knowledge or experience than ubuntu, mint, mandrake, etc users. They are in fact almost exclusively people who used a "noob" distro, then switched to gentoo to feel "advanced" even though nobody with any unix knowledge would waste their time with gentoo.
No, specifically gentoo users can. The distro literally serves no real purpose, nobody with any unix experience would consider using it. It is quite literally the distro for people who don't know what they are doing, but want to feel "advanced" by watching stuff they don't understand scroll by.
Papertrail is great for this...you can of course setup syslog to route to a central server and just be logged in tmux / screen on that particular machine to read off the stream of logs (I prefer papertrail though + saved searches and hipchat "pings" when saved searches are matched on incoming events).
General devops / sysops/ sysadmin knowledge can be had through a variety of means - I got most of my knowledge from simply reading the FreeBSD manual and making a lot of mistakes with my own servers.
In late 2008 when I was in the Marines and deployed to Iraq I was following too closely behind the vehicle in front while crossing a wadi and we hit an IED (the first of 3 that day).
Nobody was killed, but we had a few injured. Thankfully the brunt of it hit the MRAP in front of us. If it hit my vehicle (HMMWV, flat bottom) instead I probably wouldn't be here.
That was the first major operation on my first deployment, too. Hello, world!
My takeaway? Shit just got real.
We ended up stranded that night after the 3rd IED strike (our "rescuers" said it was too dangerous to get us). It was the scariest day of my life, but in similar future situations it was different. I still felt fear and the reality of the existential threat, but I accepted it. It was almost liberating. Strange.
I deployed for another year after that (to Afghanistan that time). After Afghanistan I left the Corps and started my company. Because if it fails, what's the worst that can happen? Lulz.
This really puts some of the boneheaded moves I've made in my career in perspective. One thing that's always kept me pretty even keeled after a blowup is to take a breath and tell myself that no matter how bad I've screwed up, I'm still here, still breathing, and there (most likely) is some way out of the hole I've dug, no matter how painful.
Depending on the industry, that might not be the case though. Thanks for your service.
One summer in college, I got an internship at a company that made health information systems. After fixing bugs in PHP scripts for a couple weeks, I was granted access to their production DB. (Hey, they were short on talent.) This database stored all kinds of stuff, including the operating room schedules for various hospitals. It included who was being operated on, when, what operation they were scheduled for, and important information such as patient allergies, malignant hyperthermia, etc.
I was a little sleepy one morning and accidentally connected to prod instead of testing. I thought, "That's weird, this UPDATE shouldn't have taken so long-oh shit." I'd managed to clear all allergy and malignant hyperthermia fields. For all I knew, some anesthesiologist would kill a patient because of my mistake. I was shaking. I immediately found the technical lead, pulled him from a meeting, and told him what happened. He'd been smart enough to set up hourly DB snapshots and query logs. It only took five minutes to restore from a snapshot and replay all the logs, not including my UPDATE.
Afterwards, my access to prod was not revoked. We both agreed I'd learned a valuable lesson, and that I was unlikely to repeat that mistake. The tech lead explained the incident to the higher-ups, who decided to avoid mentioning anything to the affected hospitals.
If it's any consolation, the company is no longer in business.
Just remember when you screw things up: Your mistake probably won't get anyone killed, so don't panic too much.
You didn't screw up here. The entire infrastructure, org chart, and policies that allowed you to accidentally modify a production database containing critical medical information screwed up.
Blaming yourself here is like blaming yourself for being hurt after being told to drive a car with no seatbelt or brakes.
Sure there's plenty of blame to spread around, but I still would have felt terrible if someone had been hurt or killed.
What system would you put in place to prevent this? The issue was that I connected to prod when I thought I was connecting to a test DB. We each had different credentials for prod vs everything else, but the SQL client remembered my username and password. Anyone with prod access could have made the same mistake.
* Keep the prod DBs in an isolated VPN that requires a separate login outside of the SQL client. Stay logged out of that except when you explicitly need production access. This keeps you from casually messing with production.
* Don't save production credentials in your SQL client - uncheck the box or whatever you have to do. Probably a good idea for security anyway.
* Some clients will let you change UI for each DB. I know that SQL Server Management Studio will let you change tab and editor background color. So maybe make prod all red (or pink, or something else annoying).
* Only give a few people production logins and require them to audit everything before they run it. Actually, I'm surprised this wasn't already the case for a company dealing with health info.
In a case like that, where clicking on the wrong thing could result in death? I would never allow production database access for anything other than the running app.
I'd have an emergency procedure, sure, one where in some dire circumstance somebody could poke a hole in the firewall, change the database configuration, open a sealed envelope, and then look at/change the real data.
But in normal circumstances, anybody who really needed to see prod data would look at a read-only copy. (Or better, would look at an identity-scrambled version of it.) Any anybody who needed to change it would write a bit of code to do the work and take it through the normal review and push process.
In the past, I've set up big MOTD style messages that say "PROD" in fancy ASCII graffiti when I ssh/connect a DB client/whatever to production. I think I will set one of those up now for my current setup.
Also, sort of related, I'm using MacOS, and in the back of my head I've wanted to create a tool that will change the color of the menu bar (at the top of the screen) to, say, bright yellow, when I'm connected to the VPN so that I don't accidentally visit a porn site while still connected to work.
That said, neither of these systems is even close to fail-proof :)
Maybe you could set a translucent menu bar, then script something to change the top 22px of the desktop background based on the VPN connection status. It's hacky, but it'd work.
Another option would be to configure your routes. At a previous job, I set up my home router to connect to the VPN and route 10.* to the VPN interface. Setting this up isn't easy, but it's oh-so-convenient. Reading http://wiki.openwrt.org/doc/howto/vpn.client.pptp will start you on the right track.
Be careful though. This gives anyone on your home network access to work. It almost certainly violates security policy. I only did this because I knew I'd just be chastised if I got caught. (Same goes for running rogue APs at work.)
Most VPNs can be configured to only route certain subnets over them (so all work related networks, for example) instead of everything. This is very simple to do with OpenVPN; can't speak as to the Mac builtin solution.
This is essentially what I do - Black on White for production, White on Black for development. If I'm running development commands on a Black on White screen, something doesn't feel right. It isn't a life-or-death application, so this is enough.
And that's why you can't connect directly from my desktop networks to the production environments, but you can connect to the dev facilities. Firewalls: they're not just to protect against outside threats.
In my work, developers only have read access to production servers (for checking logs etc). If you want to make a change to a production system, you need to go formally request it through OTRS. So this sort of situation can't really arise. You can of course still cock-up live systems through asking the sysadmins to do something stupid, but then the problem is stupidity, not carelessness.
He could have easily revoked any UPDATE and DELETE commands from your privs list. INSERT, CREATE, SELECT is (usually) plenty fine and any database migrations that need to happen should typically be reviewed by him then run by him.
He's not entirely innocent. It's more like blaming himself for being hurt after being told to drive a car with no seatbelt or brakes, while knowing that the car has neither, fully understanding what could happen, agreeing to it anyway, then driving 80 mph down a residential street while still groggy from waking up.
Regardless, I am very glad they are now out of business. :)
Uh, has anyone on this thread heard of HIPAA? I'm pretty sure having a summer intern get full access to actual patient data shouldn't be possible under a properly implemented set of HIPAA processes, and the same goes for the accidental UPDATE.
The story reminds me of the day I was introduced to "BEGIN TRANS", "COMMIT" and "ROLLBACK" when someone upgraded the Sybase console and helpfully changed the default setting so we didn't need those pesky semi-colons to finish a query any more. The result was:
DELETE * FROM TABLE x
131054 rows deleted
WHERE a = "foo"
>> Malformed query <<
Phone starts to ring a few seconds later as all the users saw their morning's work disappear.
This stuff is way too easy for us noobs. Thank goodness that with modern technology we've found ways to make sure it doesn't happen any more... :-)
Did a similar thing, but in a less critical domain (warehouse management). Updated the status of all packages to "NEW", which would have meant that everyone who ever ordered something from that company would have gotten another delivery for free, provided the articles were in stock.
We were able to restore the data pretty quickly, but we had to interrupt warehouse workflow for a few minutes. They were surprisingly accommodating, almost amused by my mistake.
A local Subway franchise was the very first company that hired me. I was extremely young, shy, and intensely socially awkward, yet excited to join the workforce (as I had my eyes set on a Pentium processor).
When I worked at Subway, the bread dough came frozen, but you would put loaves in a proofer, proof it for a certain amount of time, and then bake it. My first shift, however, got busy and I left several trays in the proofer for a very, very long time. Consequently, they rose to roughly the size of loaves of bread, as opposed to the usual buns.
It was my very first shift alone at any job in my life, so I did the most logical thing I could think of and put the massive buns in the oven. They cooked up nicely enough and I thought I was saved. Until I tried to cut into one.
Back in that day, Subway used to cut those silly u-shaped gouges out of their buns. In retrospect, I think this was most likely a bizarre HR technique designed to weed out the real dummies, but at the time I was oblivious (likely because I was one of the dummies they should have weeded out). When I ran out of the normal bread, I grabbed one of my monstrosities, tried to cut into it, and discovered that it was not only rock hard, but the loaf broke apart as I tried to cut it.
That night, my severe shyness and social awkwardness had their first run-in with beasts known as angry customers. I was scared I would get fired, so I promptly made new buns, but spent the rest of my shift trying to get rid of my blunder. I discovered some really interesting things about people that night. First, you'd be surprised how incredibly nice customers are if you are straight up with them. Some customers I never met before met the big, crumbly buns as an adventure and, in doing so, helped me sell all the ruined buns.
In the end, I came clean (and didn't get fired). That horrible night was a huge event in the dismantling of my shell. It taught me an awful lot about ethics. And frankly, that brief experience in food service forever changed how I deal with staff in similar types of jobs.
This reminds me of reject analysis week as a radiography student. People would be hiding their crap films (film and chemistry people!) up their tops, behind shelves, basically anywhere. Now days the clever ones know how to dick with the server. I have never deleted films for this reason, but have deleted films to keep incidents quiet.. (Boss must not know I got a chunk of steel in my hand prior to a shift in MRI etc)
I was testing disaster recovery for the database cluster I was managing. Spun up new instances on AWS, pulled down production data, created various disasters, tested recovery.
Surprisingly it all seemed to work well. These disaster recovery steps weren't heavily tested before. Brilliant! I went to shut down the AWS instances. Kill DB group. Wait. Wait... The DB group? Wasn't it DB-test group...
I'd just killed all the production databases. And the streaming replicas. And... everything... All at the busiest time of day for our site.
Panic arose in my chest. Eyes glazed over. It's one thing to test disaster recovery when it doesn't matter, but when it suddenly does matter... I turned to the disaster recovery code I'd just been testing. I was reasonably sure it all worked... Reasonably...
Less than five minutes later, I'd spun up a brand new database cluster. The only loss was a minute or two of user transactions, which for our site wasn't too problematic.
My friends joked later that at least we now knew for sure that disaster recovery worked in production...
Lesson: When testing disaster recovery, ensure you're not actually creating a disaster in production.
I wrote a piece of code controlling an assembly line machine. These machines require manual operation, and would come with a light curtain, which detects when someone places their hand near the moving parts, and should temporarily stop the machine.
A relatively minor bug in the software that I wrote caused the safety curtain to stop triggering when a certain condition was met. We discovered this bug after an operator was injured by one of these machines. Her hand needed something like 14 stitches.
Lessons learnt:
1. Event-driven code is hard.
2. There's no difference between a 'relatively minor' bug and a major one. The damage is still the same.
We just recently reviewed the Therac-25 case study as my organization is working towards ISO 13485 certification. I wonder whether the OP's organization was using ISO development practices.
Another lesson your company should have learned is that a safety-critical system like this should not be left to software. Sure, monitor the curtain by software and send errors, but hardware should immediately stop the machine when the light curtain is broken.
Classic forgetting the full WHERE-part of a manual UPDATE-query on a production system. The worst part is you know you fucked up the nanosecond you hit enter, but it's already too late. Lesson learned? Avoid doing things manually even if a non-technical co-worker insists something needs to be changed right away. And if you do: wrap it in a transaction so you can rollback, leave in a syntax error that you'll only remove when you're done typing the query.
I was hired by my college to build a grade management system in my second-to-last year there. I was in a hurry due to a lunch meeting with other IT staff at the University, forgot to add the where clause, and suddenly every single student was a Computational-Science major (mine).
Funny part of the story was that the moment it happened I uttered "oh shit." My boss, who sat beside me, said "what'd you do?", and about 15 IT staff from other departments walked into the office to go out for lunch. I'm sure I was an interesting shade of red.
I had to explain what I did in front of all these people. My boss laughed out loud, brought the system offline, and simply said: "well, after lunch we get to test our backup process." We went for lunch.
Two valuable lessons I learned...
People make mistakes, that isn't a problem, it's how they respond that's important.
Don't try and solve hard problems when emotions are running high. If shit is going down in production, the most important thing to do is to breathe, and get a glass of water. That little bit of time helps a lot.
This is why, while I hate Oracle and everything they represent as a company, I kind of like their database because of the flashback feature. You can do
SELECT * FROM table AS OF TIMESTAMP some_timestamp;
and that is pretty practical. It works online, no restore, no nothing, and while it only works as long as the old data are in logs, on a production system, you should have the spare space to have some history. Theres also FLASHBACK TABLE tab to BEFORE DROP but that shouldn't happen, right?
Of course, you should probably do every update of production data in a transaction, check the result and then commit, and if you want to be sure, you can do UPDATE ... RETURNING to check what's changing. Autocommit on manual access to production is pretty crazy. But still, flashback is useful.
I usually allocate a large part of the free space to FRA. In the production system I use right now (about 2TB of data, 50GB changes/day), I can go back a couple of weeks if needed. Fortunately everything is stable now, but that flashback was quite useful a few times.
Reading all of these makes me think, the admin tool for your database of choice should probably put you inside a transaction by default, and require you to explicitly commit changes. For the madmen, it could still have an auto-commit mode, but should be opt-in rather than the default.
I've done similar and now I almost always write a select first and then only after I've verified I'm getting the rows that I expect do I update my query to an update/delete.
In this case though, wouldn't you have to COMMIT before the actual update happens ? Usually in production, it is not a good idea to have auto COMMIT on.
Been there done that. Usually I always work inside a transaction, and carefully examine the results before typing that all important 'commit'. But a "simple" change at 4:55 and me in a hurry to get home....
This is why you have SET SQL_SAFE_UPDATES=1; (or equivalent) in your DB shell startup. It only takes one UPDATE users SET password='foo'; to learn why...
I did this in a production database (thought it was a QA environment) and brought trading on the mortgage desk of an investment bank to a grinding halt on September 14th, 2008.
The DBAs saved my 23 year old ass that day. I make it a point to send them beer on 9/14 every year.
Yep, I typically do SUPDATE just for fun that way. And only do that AFTER building the where clause with a SELECT * FROM foo WHERE ... so that I always start with the clause when making the update. Might be paranoid but it always seems to work out for me that way.
I run Correlated.org, which is the basis for the upcoming book "Correlated: Surprising Connections Between Seemingly Unrelated Things" (July 2014, Perigee).
I had had some test tables sitting around in the database for a while and decided to clean them up. I stupidly forgot to check the status of my backups; because of an earlier error, they were not being correctly saved.
So, I had a bunch of tables with similar names:
users_1024
users_1025
users_1026
I decided to delete them all in one big swoop.
Guess what got deleted along with them? The actual users table (which I've since renamed to something that does not even contain "users" in it).
So, how do you recover a users table when you've just deleted it and your backup has failed?
Well, I happened to have all of my users' email addresses stored in a separate mailing list table, but that table did not store their associated user IDs.
So I sent them all an email, prompting them to visit a password reset page.
When they visited the page, if their user ID was stored in a cookie -- and for most of them, it was -- I was able to re-associate their user ID with their email address, prompt them to select a new password, and essentially restore their account activity.
There was a small subset of users who did not have their user IDs stored in a cookie, though.
Here's how I tackled that problem:
Because the bulk of a user's activity on the site involves answering poll questions, I prompted them to select some poll questions that they had answered previously, and that they were certain they could answer again in the same way. I was then able to compare their answers to the list of previous responses and narrow down the possibilities. Once I had narrowed it down to a single user, I prompted them to answer a few more "challenge" questions from that user's history, to make sure that the match was correct. (Of course, that type of strategy would not work for a website where you have to be 100% sure, rather than, say, 98% sure, that you've matched the correct person to the account.)
Ha, nice one. Whenever I start a new job, the first thing I do is create a backup of the database, because I have done something similar before. Backup everything the first day, onto your own machine.
Not the worst, but certainly most infamous thing I've done: I was testing a condition in a frontend template which, if met, left a <!-- leo loves you --> comment in the header HTML of all the sites we served. Unfortunately the condition was always met and I pushed the change without thinking. This was back in the day when bandwidth was precious and extraneous HTML was seriously frowned upon. We didn't realize it was in production for a week, at which point several engineers actually decided to leave it in as a joke. Then someone higher up found out and browbeat me into removing it, citing bandwidth and disk space costs.
Now, if you go to a CNET site and view source, there's a <!-- Chewie loves you --> comment. I like to think of that as an homage to my original fuckup.
I once worked for a company that schedules advertising before films. This wasn't in the US and the company had a monopoly over all of the ads shown across the country. It was my first programming job and done during university holidays, so I was there for a couple of months and then back to university. Toward the end of the following year I get a phone call: something was wrong with the system, it was allowing agents to overbook advertising slots. I diagnosed the problem over the phone and they put a fix in but management decided it was too late for the company to go back and cancel all of the ads that were already booked. This was not surprising as it was the most money they'd ever made. Conveniently, the parent company owned the cinemas so they did a deal where they just showed all of the ads that were booked.
Because of me, one December, everyone in the country who went to the cinema got to watch anywhere between 30 and 45 minutes of ads before the main presentation started.
Lesson learned: write more tests, monitor everything.
Haha. It was quite a long time ago otherwise I would have remembered the usual maximum booking time. I wouldn't be surprised if they exported the bug, given its success.
I remember going to the cinema in the UK once in December, and being shocked by the ads. For a brief moment, I thought that I had been a victim of the tale.
I bet > 66% of these are something to do with databases. :-)
My story (though I wasn't directly responsible): we were delivering our software to an obscure government agency. Based on our recommendation, they had ordered a couple of SGI boxes. I wrote the installation script, which copied stuff off the CD, etc. Being a tcsh afficianado, I decided to write it in tcsh with the shebang line
#!/usr/local/bin/tcsh
Anyways: we send them the CD. Some dude on the other side logs in as root, mounts the CD, and tries to run "installme.csh". "command not found" comes the response.
So he peeks at the script, and sees that it's a shell script. He knows enough of unix that "shell == bash". So he runs "bash installme.csh" . A few minutes go by, and lots of errors. So he reboots; now the system won't come up.
The genius that he is, he decides to try the CD on the second SGI box. Same results.
In the script, the first few lines were something like:
set HOME = "/some/location"
/bin/rm -rf $HOME/*
Hint: IRIX didn't ship with /usr/local/bin/tcsh. And guess what's the value of "HOME" in bash?
In sh and derived shells, it sets the arguments ($1, $2, and so on). In this case you end up with $1 being ‘HOME’, $2 being ‘=’, and $3 being ‘/some/location’.
We were storing payment details sent from a PHP system into a Ruby system, I was responsible for the sending and receiving endpoints. Everything was heavily tested on the Ruby end but the PHP end was a legacy system with no testing framework. Since the details were encrypted on the Ruby end, I didn't do a full test from end to end AND unencrypt the stored results.
Turns out for two months we were storing the string '[Array]' as peoples payment details.
Takeaway: If you're doing an end to end test, make sure you go all the way to the end.
~ 2007, working in a large bioinformatics group with our own very powerful cluster, mainly used for protein folding. Example job: fold every protein from a predicted coding region in a given genome. I was mostly doing graph analysis on metabolic and genetic networks though, and writing everything in Perl.
I had a research deadline coming up in a month, but I was also about to go on a hunting trip and be incommunicado for two weeks. I had to kick off a large job (about 75,000 total tasks) but I figured spread over our 8,000 node cluster it would be okay (GPFS storage, set up for us by IBM). I kicked off the jobs as I walked out the door for the woods.
Except I had been doing all my testing of those jobs locally, and my Perl environment was configured slightly differently on the cluster, so while I was running through billions of iterations on each node I was writing the same warning to STDOUT, over and over. It filled up the disks everywhere and caused an epic I/O traffic jam that crashed every single long-running protein folding job. The disk space issues caused some interesting edge cases and it was basically a few days before the cluster would function properly and not lose data or crash jobs. The best part was that I was totally unreachable and thus no one could vent their ire, causing me to return happy and well-rested to an overworked office brimming with fermented ill-will. And I didn't get my own calculations done either, causing me to miss a deadline.
Lessons learned:
1) PRODUCTION != DEVELOPMENT ever ever ever ever
2) Big jobs should be proceeded by small but qualitatively identical test jobs
3) Don't launch any multi-day builds on a Friday
4) Know what your resource consumption will mean for your colleagues in the best and worst cases
5) Make sure any bad code you've written has been aired out before you go on vacation
6) Don't use Perl when what you really needed was Hadoop
Nice. I once needed to do reciprocal blast for the complete genomes of about 300 bacterial species. That's on the order of half a billion queries, but the work was embarrassingly parallel, and each discrete job only took about 90 seconds. I wrote a little shell script to kick them off on the cluster, and went home.
I woke up the next morning to several inbox screens' worth of messages from angry people I didn't know, demanding explanations for what I did to their jobs and their cluster. I don't think I have ever biked to the lab faster.
After multiple rounds of palm-drenching emails with the cluster sysadmins and the computational mathematics group PI (and my own boss agonizingly cc'ed), we determined the cause. The cluster sysadmins, lacking imagination for the destructive naivete of their users, had not foreseen that anyone would want to submit more than 10^4 jobs at once. That broke the scheduler, preventing other people from running jobs and me from canceling them. Meanwhile the blast jobs blew past the disk quota, leading to a Hellerian impasse where I somehow lacked the space to delete files so I could create space. I still don't fully understand it.
I believe it took a full day to get the cluster back online.
My team did something similar once. We pushed a version to UAT with all the dev debug logging still turned on. It filled up a solaris disk so badly the SA had to get a bus to the data centre and fix it in person.
Virtual goods? Fired why? Ambien is known to do these kinds of things. Was it because of the aderall and ambien combined made them thing you were a risk due to drug abuse? I don't get the firing over your description of the situation to be honest.
Second web related job at an insurance company, I was 20 years old at the time. We were heavy into online advertising, mostly banners at the time (this was right around when adwords started to get big). The company just bought out all of the MSN finance section of their site for the day-- it was a pretty big campaign ($100,000). We drove all the traffic to a landing page I had created with a short form to "Get a quote".
IT had given me permissions to push things live for quick fixes and such, I made a last minute design tweak and, you guessed it, broke something. I was checking click traffic and inbound leads and realized traffic was through the roof but leads were non-existent. This was about 45 minutes after the campaign was turned on. I jumped on the page and tested it out and got an error on submit. FUCK. I literally started to perspiration INSTANTLY.
Jumped into my form and quickly found the bug, can't recall what it was but something small and stupid, then pushed it live without telling a soul. Tested, worked, re-tested, worked. Ran some quick numbers to get a ballpark estimate on the damage I caused... several thousand.
Stood up and walked over to the two IT guys, mentioned I borked things and that I had fixed it... what should I do? I can still see the look on their faces. Shock, then smiles. Walked back to my desk and about 10 minutes later my two bosses show up (I worked for both dev & marketing managers).
They said thanks for catching the problem, not to worry. I did good for finding it myself, fixing it, and pushing it live. I was still sweating and shaking. They walk off and later that day marketing manager informs me MSN will refund us for the 45 minutes of clicks.
It took about a month before I felt competent enough to touch our forms again.
I was once in charge of running an A/B test at my work. Part of the test involved driving people to a new site using AdWords.
After the test was complete, I forgot to turn off the Adwords. (Such a silly mistake...) Nobody notices until our bill arrives from Google, and it's substantially higher than normal. When my coworker came to ask me about it, "are these your campaigns?!?" I just sank in my chair.
I think it cost the company $30k. I suppose it's not that much money in the grand scheme of things, but I felt very bad.
When I worked at ClearChannel back in 2010, we rebuilt Rush Limbaugh's site. When migrating over the billing system, I realized a flaw that granted at least 20,000 people free access to the audio archive ($7.95/month). The billing provider processed the subscriptions, but their system would only sync with our authentication database once a week with a diff of accounts added or removed in the past 7 days. You got the first 7 days free for this reason. If this process failed (e.g. due to a connectivity issue, timeout, or SQL error), all accounts after the error would not be updated. Anyone with a free trial or people who cancelled during a week with an error would get a permanent free trial. I rewrote the code to handle errors and retry on failure so that errors wouldn't happen in the future, but my downfall was running a script that updated all accounts to the correct status. Imagine angry Rush Limbaugh fans used to getting something for free now getting cut off (even though it shouldn't have been free). Management quickly made the decision to give them free access anyway, so I rolled back the change.
During a server migration for our web based file sharing system our lead engineer (at the time) forgot to ensure that all cron jobs (for cleaning up files and sending out automated emails) had been turned back on.
Queue me 7mos later reviewing the system. Realizing that critical jobs were no longer running and that our users were all essentially receiving 100% free hosting for however much storage they wanted. SOOOO i turned the jobs back on.
The lead engineer before me left no documentation of what the jobs did other than that they should be run. In my stupor i did not review the code. The jobs sent out a blast of emails warning that files would be deleted if not cleaned up or maintained. Then seconds later deleted said files...
We nuked around 70GB worth of files before we realized what happened. WELL GET THE TAPES! Turns out our lead engineer ALSO forgot to follow up w/ system engineers and the backups were pointed at the wrong storage.
No jobs lost, thankfully the manager at the time was a word smith of the highest degree and can play political baseball like a GOD.
1) after few months working in a bank, I was doing some simple admin check task via RDP to a Windows 2003 (no, maybe 2000) server, when I right-clicked the network icon and instead of clicking the properties options i clicked "disable". Just the time to say "oh sh!t" and to realise that it was the production Trading On Line machine, on a remote datacenter, during market hours, and to discover couple of minutes later that the KVM over IP was crappy and was not working. We had to call the datacenter operators to go back to the local KVM and re-enable the NIC.
Lesson 1: Better move slowly when you're on a production machine (and also have plan B and C to reach your machines is a good idea)
2) same bank, one or two years later, I was doing some testing on a new mail system that integrated also VoIP (SIP). Mail/SIP System running in a VM (I think Vmware Server at that time) in the same remote datacenter as above. So, I enable the SIP feature and after few seconds, bum, we lose the whole (production) datacenter and the connection between the local server room and the datacenter.
Panic, I look at my colleague, WTF in stereo, everything come back for few sec, bum again down.
Long story short, the issue was that that version of Netscreen firewall ScreenOS had a buggy ALG implementation for SIP that lead to core dumps.
The fun thing is that we had two of those in HA, same version of course, so they were bouncing between core dumping, rebooting slave becoming master and then core dumping again etc..
We had to ask a datacenter operator to reach the rack, disconnect one of the cables from the firewall (the one that was managing the traffic of the DMZ where that machine was hosted) and then reach the virtual host to kill the machine.
Lesson 2: you can segment your network but if everything is connected through the same device(s), sh!t can still hit the fan...
Many years ago, when I was but a fresh faced idiot, the partition that contained the mSQL database which had All The Data filled up. I moved it into /tmp because there was plenty of space.
For those who don't know, solaris uses tmpfs for /tmp. It is a virtual memory/swap based file system. Anything in /tmp is actually temporary if the machine reboots/powers off.
I like setting this up on Linux machines too. There are tons of ephemeral files that get written there, depending on your usage case, and I'd rather not waste the IO for writing pids to lockfiles. Disk is cheap, but RAM is fast and cheap. :)
My worst would have been catastrophic if I had waited one minute to make my mistake.
I was commissioning a new control system at a power plant's water treatment facility. I was fairly new to the industry and had mostly looked over the guy who did the bulk of the work's shoulder as on the job training.
This particular day the guy was out sick and we had to finalize a couple of things before we ran through the final tests.
There was an instruction to open a valve to fill a tank and it had the wrong variable linked to it. The problem was to maintain the naming standards I had to do a download to the processor to make the change. When I had been doing work in the office this was not a big deal, download the program to the processor, it stops running for a moment while it loads the new logic into memory and starts back up.
Not thinking through the implications of the processor shutting down while the process was up and running I made the code changes, hit download and about 30 seconds later an operator came running over looking like he had seen a ghost and he was pissed.
While I was making my code changes the operator was hooking up a hose to drain a rail car of some chemicals. The way the valves were configured before I made my changes was correct and would have had no consequence it I didn't touch anything. The way the valves were configured when the processor restarted would have routed the rail car's contents to the wrong tank resulting in a reaction which would have created a huge plume of highly toxic gas. The way the wind was blowing this plume would have blown directly to the largest town in the area and could have killed a ton of people.
The operator heard the valves in question changing position before he opened the valve on his hose to empty the rail car and figured something was up. When he saw the whole process had shut down he got really angry because I had ignored the protocol in place to avoid such a disaster.
I got chewed out and kicked off the site. My boss attributed my mistake to inexperience and I had to give a safety presentation on what I did wrong.
Lessons learned:
Be sure you are aware of any implications your actions have. If you are unsure or guessing about something stop what you are doing and go ask someone first.
Don't give people mission critical work on their first project and have them work unsupervised. Training is important.
Always be aware of safety requirements, especially when you are working with machinery, automated processes, chemicals or anything else that can hurt, maim or kill you.
Tangential to this, in the middle of the dot-com boom, a downsizing. Company meeting: "Here's what's happening, layoffs, etc, etc. When you go back to your desk, if you find your login is disabled, take a moment to pack your desk and come back to the Conference Room."
That's how they "notified" people. You got to play a lucky jackpot game of logging into your desktop after the meeting to see if you still had a job. Straight out of Dilbert.
I was doing HVAC work while I was in college and we were removing an old air handler from underneath a house. Just inside the crawl space, under the access door was a water pipe. My boss told me to make sure I held it down while we slid the air handler out through the hole. I lost my grip on the pipe and the air handle snapped it in two, at which point gallons of water began to gush into the crawl space.
I ran for all I was worth to the road, which in this case was about 600 feet away, to turn off the water at the water meter. I ran up and down the road in front of the house and never found the water meter. So I ran back to the house and inside and told the homeowner who promptly informed me that they used well water. She called her husband and he told us where to turn off the well pump.
It wasn't really that bad in the grand scheme of things but letting the homeowner's water gush under the house for about 15 minutes does not bode well when you are supposed to be there to fix problems not create them.
One time I tried to change a column name in a production database. I learned that when you change a column name, mysql doesn't just change a string somewhere, it creates a new table and copies all the values from the old table into the new one and when that table has millions of rows in it, it really slows down your production server.
Leading a group working in an underground bunker on a live military radar site in the Australian outback, where it rains every few years. We had to open a rooftop cable duct and when the job ran overtime we closed it up with some rags that were to hand. That night it rained.
The next morning, the bunker was full to ground level and the automatic power cutoff had failed, as the float switch was directly under the cable duct and the water pressure of the deluge and kept the float depressed. By the time the water stopped flowing the float was under a foot of mud. The powered circuits were undergoing electrolysis and eating themselves away, made worse the the site managers refusing to drain the bunker or turn off the power until a week long arse-covering evaluation had been completed.
A few hundred million dollars of front line radar was out of action for several months.
Being a naive newly graduated engineer, I wrote a completely honest report and analysis. My boss said it was one of the best reports he had read and there was no impact on my career (if anything it got me noticed by the upper echelons of the organisation).
Lessons:
1. If you tell the truth you will be respected, even if it is incriminating.
2. If there is a way for something to go wrong it can do so (slight variation of Murphy's Law). Even if it's judged to be uneconomic to take preventative action, be aware of the possibilities, so you can make a conscious decision about the risk.
First job, circa 2000, at an ISP that was run very clearly as a business and cutting corners. Not only was it critically understaffed, but management was more interested in laughing their way to the bank than management. They had me - with literally no routing protocol experience - manage a live route advertisement transition between two peering providers. Result: all customers offline, ~24 hours.
Reaction was standard: mostly to point out I did my best in unfamiliar territory and things should be sorted soon.
Take aways were: (1) less support calls than expected - users put up with things. (2) you learn when you fail (3) always have a backup
They kept me on at that job but I left pretty soon anyway as I got a 'real' (as in creative) job hacking perl-powered VPN modules for those Cobalt Raq/Qube devices, and building a Linux-related online retail venture for the same employer ... that worked great, but failed commercially.
I worked at an ISP in NY exactly around 1997-2004 we had also the Rac/Qube devices and I had to manage stuff I was not familiair with :) I learned so much by trial by fire.
I sent an email to three thousand insurance agents informing them of the cancellation of policy number 123456789, made out to Someone Funky. I learned to appreciate Microsoft Outlook's message-recall function, which got most of them. I also learned that just because you're using the test database instance doesn't mean nothing can go wrong.
Back in my younger days, I once had a project manager who was asking me to make a significant network infrastructure change but refused to tell me why the change was necessary and basically told me to do as I was told. I messaged a coworker to see if he knew what was going on, and dropped in that the PM was being a "fucking cunt." I was unaware, however, that the co-worker and the PM were troubleshooting an issue together and the PM was staring at his screen as my message came through.
The PM brought the issue to the CTO, but somehow I didn't get fired. Ended up apologizing (obviously a poor choice of words :)) and moved on. Never made that infrastructure change.
Key takeaway: if you're going to talk shit, don't do so in writing. ;)
Also, I had a friend with a similar (perhaps worse) story. His company sent every employee an e-mail about being on time, to which he pushed Reply to All and typed "FUCK YOU!" He laughed to himself and went to push the Discard button, but accidentally hit Send in the process. He was about to try to ExMerge it out of Exchange when he heard a BlackBerry vibrate and realized that no amount of ExMerge would get it off the BBs. He spent the next hour going door-to-door apologizing, and also managed to not get fired.
Damn iMessage is a shocker for this. Coworker sends out group message saying big boss wants xyz and includes big boss in the group. Someone always misses that its a group message and sends an expletive or sarcastic reply. I am also guilty.
I was in a remote meeting and failed to realise my laptop's camera was broadcasting. A roomful of people saw me, clad in horrid workout clothes, jam my finger up my itchy nose and scratch my balls.
And make sure your phone is muted. The first conference call is easy. When you have one every day for a year and it becomes so common place... sometimes you forget. I've heard some people on my team coughing obnoxiously, yell at people driving, doing the dishes, etc. Mute your shit, and tell your team mates immediately when they aren't muted.
When I first started my professional career, I was a field engineer in the oilfield, working on drilling rigs around Texas. There was some amount of computer stuff, but a lot of hardware work too. One of the things that we had to do was install a pressure sensor on the drilling mud line, which is normally pressurized to around 2k psi with water or oil-based drilling fluid.
This sounds like a simple task, but it gets complicated by the variety of pipe fittings and adapters available. Our sensors are a particular thread type, and we have to find a free slot to install them, and come up with any pipe connection converters necessary to install them there. Another tricky part is that the rig workers who actually know about all of this stuff are often not particularly eager to help out.
So on one particular job, the only free slot to install the sensor is a male pipe fitting, capped with some sort of female plug. Our sensors are male in that pipe size, so I need a female-female adapter to install it. I go looking around and come up with one, not paying too much attention to it. I install it, and everything seems to go more or less smoothly. We go on drilling with this installed for like a week or two.
One day, the rig manager comes to find me and ask me about this adapter that I used. He tells me that it is meant for drinking water lines, and is only rated to 200 psi. And had been installed on a 2000 psi line for weeks. My jaw dropped in shock - I have no idea how that adapter didn't fail, and it's entirely possible it could have hurt or killed somebody if it did.
They sent one of their guys to find an adapter that was actually rated for the pressure and replace it, and never said much else of it. No telling how much trouble I could have been in there if anything else had happened. It did make me a lot more safety-conscious.
Happened to a colleague: it was the end of the day, and we were packing up to leave. He used Ubuntu on his notebook, so he typed "shutdown -h now" on his shell prior to closing the lid. Seconds later he's groaning, having noticed it was a SSH session to the production server...
It wouldn't be a big deal, wasn't for the fact it was an EC2 instance, and back then halting the instance was equivalent to deleting it permanently. We then spent the night at the office recovering and testing the server. I think we left 3:00 AM that day.
Lesson #1: it's never a good idea to "shutdown -h now" on a shell. any shell.
Lesson #2: have the process to spin up a new production server fully automated and tested
Excellent if you have complete and thorough control of every server you touch, but if you don't, it could be dangerous to rely on. Murphy's Law means it'll be that one dang machine that gets shut down...
Personally, I'd think that training this is a lot more portable.
I once did a "shutdown -h now" on a remote server on a customer site with nobody there who knew anything about the server when I'd meant to do a reboot....
In terms of feeling bad, I once had a client who wanted to demo a multimedia project that we currently had in alpha on his Windows 3.11 laptop, but the sound drivers weren't working properly (everything else was fine). He had about an hour before he had to leave for the airport. I started monkeying with the four horsemen of the apocalypse (Windows.ini, System.ini, Autoexec.bat, and Config.sys) as I had many times before but I screwed up saving backups, bricked his machine, and couldn't fix it). In the end it was more embarrassing than anything else, but it was a facepalm stupid mistake.
The lesson from this is pretty obvious. Backup. Make sure your backup is good and safe.
My worst work-related mistake was getting into business with a friend. It cost me the friendship, a very valuable client, and a good portion of my retirement savings. I'm not sure how related it was, but a few years later my (former) friend killed himself.
And the lesson here is not to go into business with friends. Or at least to set up the business as if you're not friends.
Around 2000 my team was responsible for installing and maintaining a larger amount of servers in 19" racks in a data centre.
Most servers had those hot swap drive bays for convenient access from the front while the server was running. You only had to make sure no write operation occurred while you pulled the drive out of the bay.
So, I had to exchange a backup disk on a database server running quite a few rather large forums. The server had two disk bays: One for the live hard disk and one for the backup disk. I was absolutely sure at that time which one was the backup disk so I didn't bother to shut down the database server and incur a minimal downtime. Of course, I was wrong and blithely yanked the live disk from the drive bay.
I spent the rest of the night and most of the following day running various MySQL database table repair magic. It worked out surprisingly well but having to admit this error to our forum users was embarrassing, nonetheless.
Lesson: Appropriately label your servers and devices.
I ended up as the architect for a new live show we were putting on. You could either pre-purchase some number of minutes, or pay per minute, it was like $4.99/minute or something insane.
The billing specs kept changing, as did the specs for the show itself. New price points, more plans, change the show interface, add another option here, etc. The plan had been to do a free preview show the day before to work out the kinks. That didn't happen.
The time leading up to show start was pretty tense, lots of updates, even a few last minute changes! Then the show actually started, brief relief. The chat system built in started deleting messages, one of those last minute feature changes had screwed up automatic old-message deletion. We had a fix though, update the JS, and bounce everyone out of the show and back in so the JS updates. Fixed!
Then the CEO pointed out that the quality just kept getting worse. Turns out that while the video player had both a numeric value and a string description for the different quality levels, it assumed they were in ascending order. So once it confirmed it could stream well at a given level, it automatically tried the next, which worked! Poor quality for everyone. Fixed, and another bounce.
Then it was over, time to go home. Back in the next day to finish off the billing code. I decided to approach it like a time card system. Traverse the logs in order, recording punch in time, when someone punches out, look up their punch-in times and set that user's time spent to the difference. Remove punch-in and out from the current record so they're not used again.
Now two facts from above added up to a pretty serious bug.
1) I _set_ the time spent to the difference between the two times. Not added, set.
2) We bounced everyone from the show twice to update their JS, and video player. So everyone had multiple join/parts.
I under-billed customers by tens of thousands of dollars.
Things I learned:
- Don't just argue that you need a trial run, make sure management understands the benefits. Why, not What.
- Duplicate billing code. After that a co-worker and I wrote two separate billing parsers for things, 1 designed to be different, not efficient.
- Give yourself ways to fix problems after they crop up. The bounce killed my billing code, but not doing it would have damaged the actual product (which later became a regular feature). Wish that thing had been my idea.
Last day of work before moving to the new job: I do some cleanup and rm -fr my home directory. Seconds passed. Minutes passed. I start to think about how can it take so long.
I list the content of my home directory trying to understand which folder was so big. Then I see it. A folder usually empty. Empty because I use it as generic mount point. A mount point that the day before was attached via sshfs to the production server...
I had a strange feeling, like if I was seeing myself from behind, something crumbling inside me. And at that moment someone start to ask "what's happened to <hostname>"?
I take my courage and I say "I know it"...
That was really hard. The worst day at work in years, and during the last day too. Luckily we had a good enough backup strategy and the damage was mostly solved in a couple hours.
There I realized how much of an idiot I was to have mounted the production server on my home and I grow a little.
There's a saying in the rates market: "don't counter-trend trade the front end".
I lost $7 million dollars in minutes by being short $700 million of US 2yr notes when the levees failed during the hurricane Katrina disaster.
Although my bet that the 2y point would be under pressure in the intermediate term turned out to be true, I got carried out by fund flows as folks spazzed out to cut risk by rolling into short duration high quality paper.
To his credit, my boss, who sat across from me, said only: "wouldn't want to be short 2 years." He let me make the call, which I did, and I covered my position. (Ouch.)
My book was up considerably on the year already, but this was a huge hit, and nearing year-end. I dialed back the risk of my portfolio and traded mostly convex instruments (options) for the remainder of the year.
yup. that really happened.
it was 4-5am in the morning and I'd been working all night. I was on the server trying to set something up and was trying to blow away a folder ... I did a normal rm and that didn't work (obviously) because there was crap in the folder. So I pulled out my nuclear weapon to nuke the folder but left off the preceding ./ (which still wasn't that smart anyway) ... I sat there for a second wondering why the deletion was taking so long ... then another 30 then a minute ... then I looked at what I'd just typed again ... then I realized what had happened.
ctrl-c'ed (or d, can't remember now) out of it. then tried to find root folders
cd /etc
=> folder not found
cd /var
=> folder not found
I'm from a third world country where we laugh at Americans (sorry) for throwing up when they're nervous or having panic attacks, but at that moment, I had a full blown panic attack. I'll never forget it.
The work was a subcontract for a client who was doing work for Nike, and it was a decently sized project that was critical to the success of the firm, and I'd just blown away their live production server ...
Afer freaking out and almost crying for 5 minutes. I decided to call media temple support (we were using one of their vps servers) ... and by the biggest absolute stroke of luck they'd just backed up the entire server ... not even 2 hours prior to my madness. $100 for a full restore (I don't recall why) and would I like to do that?
HECK YES I WOULD!
so they restored the server for me. I wrote an email to the head of the small company I was doing all the work for, explaining what I had happened and telling him I'd sent over a check for $100 to cover the backup because it was my fault. He was obviously very relieved and never cashed the check I sent.
I still get chills thinking about that exact moment when I thought I'd fucked up my career and reputation for good.
The following was not actually me, but worth sharing.
They had ASIC design runs for research purposes once every three months, yielding your design on Silicon as ten 6" wafers. It gives enough parts for testing the first revision of your design. The person was carrying the wafers to a vendor for cutting into separate ICs and packaging or something. Gets to the parking lot, and where are the keys. Puts the wafers on the top of the car, finds the keys in his pockets and starts driving. Boom, the box of wafers was still on the top of the car, now on the ground. All broken. Some $100K in wafers + three months lost + bad face before the customer + ... Lesson: Don't put stuff on the top of the car!
I don't think that is the lesson, I think the lesson is that it is clearly a two person job. One person to carry the wafers, the other person to remove and hazards/ open doors/ double check everything.
It's easy to say that after the fact. There are many delicate tasks people handle as a part of their day-to-day jobs, and not every task can afford more people to help without increasing the costs.
Demonstrated SQL injection to a colleague on the live website. Bringing a sample URL up into the address bar, I explain, "You see, that ASP script takes the value of ?urlparameter and updates the record - but what if I modify urlparameter so that instead of 1, it is... (types) semicolon dash dash DROP TABLE usermaster (presses enter)"
"Shit. Well, as I have just demonstrated, it becomes possible to wipe out a million user login credentials at the touch of a button. So now we'll be needing to restore that from the backups which we don't have." Luckily, and ONLY BY CHANCE, I happened to have a copy of that table exported for other reasons from a few days back.
A long time ago while working on a *nix box logged in as root, I executed a simple "!find". Basically execute the last find. In root's history, the last find command was something like "find ... -exec rm ...". The command was run at the root of the content directory of a CMS, deleting all the content (major media website). CMS was down while backups were restored.
I now never execute ! commands as root. Actually, nowadays I simply use CTRL-r.
My first deploy at a once-top-10 photo hosting site as a developer was a change to how the DNS silo resolution worked.
Users were mapped into specific silos to separate out each level of the stack from CDN to storage to db. There was a bit of code executed at the beginning of each request that figured out if a request was on the proper subdomain for the resource being requested.
This was a feature that was always tricky to test, and when I joined the codebase didn't have any real automated tests at all. We were on a deploy schedule of every morning, first thing (or earlier, sometimes as early as 4am local time).
By the time the code made it out to all the servers, the ops team was calling frantically saying the power load on the strips and at the distribution point was near critical.
What happened: the code caused every user (well upwards of millions daily) to enter an infinite redirect, very quickly DoSing our servers. It took a second to realize where the problem was, but I quickly committed the fix and the issue was resolved.
Why it happened: a pretty simple string comparison was being done improperly, the fix was at most 1 line (I can't remember the exact fix). There was no automation, and testing it was difficult enough that we just didn't test it.
What I learned: If its complicated enough to not want to test using a browser, at least always build automation to test your assumptions. Or have some damn tests period. We built a procedure for testing those silos with a real browser as well.
I got a good bit of teasing for nearly burning down the datacenter on my very first code deploy, but ever since, its been assumed that if its your first deploy, you're going to break something. Its a rite of passage.
When trying to put our webserver-cum-database-server onto nagios, I tried to apt-get install nagios-plugins. For some reason when installing that, apt wanted to remove mysql-server. I just pressed "Y" without thinking (because, hey, it's like 99.9999999% the right thing to do). So apt dutifuly stopped and uninstalled MySQL in the middle of the day.
Within about 2 minutes CTO strolls in asking about the flood of exception emails due to each request being unable to connect to the database.
Thankfully, I was able to apt-get install mysql-server, all the data was still there, and things were back to normal within 5 minutes.
I messed up epically on an interview. It was a 3 part interview for a JS/RoR coder.
1. I passed the resume and chat portion
2. I passed the telephone questionnaire and got along great with the interviewer
3. (Fail) I scheduled my interview on a Friday at 4:30pm and there is a 30 min travel time. I left 1hr early...still it was Memorial Day weekend, so I thought the streets would be quicker than the freeway since it was at a stand still. I was so stressed that I literally had an anxiety attack and couldn't even find the address. Never happened to me before, so I'll never forget it.
This one is really embarrassing. I started a new job for a small company as the only developer with the aim of creating a new site for them. So they gave me full access to their very small technology stack that included one Mssql server.
So one of the first things I wanted to do was setup a development db for which I exported the structure from their prod db. I then proceeded to change the name of the create database statement at the top to the new dev db I wanted and ran the script.
Unfortunately the prod db name was still pretended to every drop and create table command in the script so I had just replaced their whole prod db with an empty one.
Owning up to that was one of the most embarrassing moments of my career. It was such a rookie mistake I just wanted to die. Luckily they had daily backups so I only cost their 4 man business about half a day of work but... it was enough for me to be a much more careful developer from that day forward!
it was 1998. i was young and foolish. no, seriously, i was 19. i was also SUPER convinced that it wouldn't work :) I have since learned to be waaaay less convinced since then.
Linux will indeed let you hang yourself. I maintain that every experienced Linux user has seriously messed up their system at least once. More often personal / dev / test environment than production though...
I respectfully disagree. I was a junior person who was given root level privileges on a production server. There were many layers of process and management that weren't in place. Today, I am a senior person at a multi-billion dollar company. I would never call "professional suicide" at a junior person who made a genuine mistake. I would scold them, and then I would try to figure out how we got into the situation where someone unqualified could cause us such a bad thing to happen.
Lesson #1: Don't code when you're distracted.
Some hours later, the problem manifested. The queue workers came down, and AR (which is totally dependent on them for its core functionality) immediately stopped doing the thing customers pay me money to do. My monitoring system picked up on this and attempted to call me -- which would have worked great, except my cell phone was in a box that wasn't unpacked yet.
Lesson #2a: If you're running something mission critical, and your only way to recover from failure means you have to wake up when the phone rings, make sure that phone stays on and by you.
Later that evening I felt a feeling of vague unease about my change earlier and checked my email from my iPad. My inbox was full of furious customers who were observing, correctly, that I was 8 hours into an outage. Oh dear. I ssh'ed in from the iPad, reverted my last commit, and restarted the queue workers. Queues quickly went down to zero. Problem solved right?
Lesson #3: If at all possible, avoid having to resolve problems when exhausted/distracted. If you absolutely must do it, spend ten extra minutes to make sure you actually understand what went wrong, what your recovery plan is, and how that recovery plan will interact with what went wrong first.
AR didn't use idempotent queues (Lesson #4: Always use idempotent queues), so during the outage, every 5 minutes on a cron job every person who was supposed to be contacted that day got one reminder added to the queue. Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each. As soon as I pressed queues.restart() AR delivered all of those phone calls, text messages, and emails. At once.
Very few residential phone systems or cell phones respond in a customer-pleasing manner to 40 simultaneous telephone calls. It was a total DDOS on my customers' customers.
I got that news at 3 AM in the morning Japan time, at my new apartment, which didn't have Internet sufficient to run my laptop and development environment to see e.g. whose phones I had just blown up. Ogaki has neither Internet cafes nor taxis available at 3 AM in the morning. As a result, I had to put my laptop in a bag and walk across town, in the freezing rain, to get back to my old apartment, which still had a working Internet connection.
By the time I had completed the walk of shame I was drenched, miserable, and had magnified the likely impact that this had on customers' customers in my own mind. Then I got to my old apartment and checked email. The first one was, as you might expect, rather irate. And I just lost it. Broke down in tears. Cried for a good ten minutes. Called my father to explain what had happened, because I knew that I had to start making apology calls and wasn't sure prior to talking to him that I'd be able to do it without my voice breaking.
The end result? Lost two customers, regained one because he was impressed by my apology. The end users were mostly satisfied with my apologies. (It took me about two hours on the phone, as many of them had turned off their phones when they blew up.)
You'd need a magnifying glass to detect it ever happened, looking on any chart of interest to me. The software got modestly better after I spent a solid two weeks on improved fault tolerance and monitoring.
Lesson the last: It's just a job/business. The bad days are usually a lot less important in hindsight than they seem in the moment.