Hacker News new | past | comments | ask | show | jobs | submit login
Devops Horror Stories (statuspage.io)
153 points by stevenklein on Oct 31, 2013 | hide | past | favorite | 96 comments



My Devops horror stories, one sentence each:

- Somebody deployed new features on a Friday at 5pm.

- Fifteen hundred machines running mod_perl.

- Supporting Oracle - TWICE.

- It turns out your entire infrastructure is dependent on a single 8U Sun Solaris machine from 15 years ago, and nobody knows where it is.

- Troubleshooting a bug in a site, view source.... and see SQL in the JS.


I built a system where our developers can do instant deployments of any of our software packages (and instant point-in-time rollbacks), and then do zero-downtime restarts of services.

Now we deploy dozens of times a day and I never get called on a Friday night because someone did something stupid.

Edit: I do get called when I did something stupid and it broke the deployment system. But that's gotten much rarer lately.


Would love to hear more about this system, sounds really cool!


I've built a similar system, around 5 year ago. Users were able to deploy any version to any cluster from a nice UI. Basically you could select any version, click "Install" button and follow the logs in real time.

Behind the scenes it was a decentralized continuous delivery system. Very cool stuff, highly automated. Reduced a lot of work and sped up development cycles from months to minutes. Served quite a large software development organization (1000+). I think we had 1000 servers in 5 datacenters around the globe.

Nowdays I'm working on an open source version of that system, it's still missing few critical features but hopefully I'll get the first release out next spring.

btw, I'm looking for projects/contacts that would be interested in trying out how the system would fit their needs.


Hey Mikko; I'm a sysadmin at a research university, and I'd be very curious in at least "picking your brain" about your tool. I can't make any promises about actual usage, but I always love to see a novel approaches to relevant problems.

Do you have any sort of github/project page?


Hi

Cool, sysadmin at a research university sounds like a nice position to be at.

Yes, it's already on github. Unfortunately, since it's missing those critical features it's not easy to see how the whole system is going to work. If you to talk just drop me an email at gmail. mikko.apo is the account.


I really hate the idea that deploying on a Friday afternoon is a bad idea. It's only bad when you have shit developers or shit processes that don't catch broken code.

Personally, I think it's better to release at 5pm on a Friday. Once people stay late a few times to fix their broken shit they'll be smarter about not checking in crap.


> It's only bad when you have shit developers or shit processes that don't catch broken code.

Or when the bug is only triggered in specific user profiles.

Or when all the devs went on a retreat in the mountains with no cell service.

Or when a dev makes a mistake (which we know never happens to even the best devs)

Or when the only developer that knows which one of the 1000 changes that were pushed could be the one breaking, turned his phone off.

Or when a flaw is discovered in the process for the first time (which we know never happens because everyone's process is perfect, until it isn't)

Or how change management's requirement that the fix be tested and verified by all affected teams might have people staying a few hours after 5pm on a Friday when they just want to get their weekend started.

Or how 10 different people from 10 different teams might need to be called and kept to work until 2am because the change can't be pulled because the database was already modified and the old client data is already expired from cache and a refresh would destroy the frontend servers.

Or another reason.


Yes, this! "Good code" and a CI box and deployment automation and some chef recipies don't spell ultimate success.

It drives me nuts when people tell me off for saying 'yeah yeah, no, automating our entire infrastructure of 5 servers isn't really worth it right now', like I'm some unprofessional bozo.

I pretty much have experience with all but one or two of your suggested scenarios, and by now I have no patience for annoying software developers who think that using chef or puppet somehow sufficiently embiggens them to run ops on their own (of course dev ops is almost a political assault on existsing ops guys, not merely a nice new solution to existing problems).

Sigh. This is why I don't work on teams these days (if I can help it).

EDIT: Though I also agree with the sub-parent, that deploying on 5pm is fine in certain teams and certain projects, the most important thing is are the guys pushing to do the deploy going to own the deployment? Are they going to hang around for another 60 minutes to check everything is OK? Are they going to be available at 10pm or on Saturday if something goes wrong and are they going to own it? If the answer is no, then nope, don't do it.


I'd agree that in a perfect scenario, you should be able to push code at any time confidently. But for many companies and projects this is not really available. As well, in many organizations the person who has to fix broken stuff is not the same as who develops and pushes code. I'm not saying that's a good thing, but it is reality for many people.

Even if it only happens once in your career, once you've had a dev push out code at 5pm friday night, jet out the door and hit the bar, meanwhile you (the sysadmin/ops on call) get woken up at 1am by site down alert, and have to debug/rollback the changes while the dev who pushed them is unreachable, you learn to really avoid friday evening pushes. Fool me once...


Amen Brother

Its sad that most places dont have a proper technical copy (with a full copy of live data) to do full tests on TDD is all very well but you need to test the entire system.


Yeah, because all problems are foreseeable and only ever caused by crap code... right.

No matter how great your processes and your code are, no test can catch everything that can go wrong in a live environment, and doubly not if your system interfaces with anything third party.


It's not always the person who pushed it on a friday that ends up fixing it, though. They can be unreachable, without a computer, etc etc. It's just easier to change less during hours you have less people on hand, is all.


Maybe it's me, but I have no problem staying late on a Friday to fix my screw-up. However, I'm terrified of having to fix something Monday morning while everyone else is watching.

But the real reason we deploy weekday mornings is so everyone is on deck and we can get outside help if required. When I was doing system integration, the problem was never in my code, it was the vendor's. Testing can only get you so close to the real world.


Yeah, with "real" CI which some people seem to hate, checking in bad code becomes the problem, which seems way better than just waiting to deploy.


- It turns out your entire infrastructure is dependent on a single 8U Sun Solaris machine from 15 years ago, and nobody knows where it is.

How did you locate it? Measuring ping latency from other machines?


I'd imagine tracing routing, and then MAC address tables would be rather a lot faster, and more accurate.


Indeed :-)


Automated tests trigger automated security, all admins are banned, user groups are automatically notified that certain admins are no longer admins.


I read the last two lines, threw up under my desk and blacked out.


We won't deploy any code even after Wed afternoon. Don't get ourselves in any troubles.


fun stuff


> - Troubleshooting a bug in a site, view source.... and see SQL in the JS.

this is why I refuse to do "View Source" on the HealthCare.gov website. I'm afraid of what I might see.


I'm just gonna leave this right here:

        if ('en' === 'en') {
            $('#desktop-nav .middle').append('<a class="liteac-login topnav myprofile" href="/marketplace/auth/global/en_US/myProfile#landingPage">My Profile</a>');
            $('.mobile-nav-right').append('<a class="mobile-right-bottom liteac-login" href="/marketplace/auth/global/en_US/myProfile#landingPage">My Profile</a>');
        } else {
            $('#desktop-nav .middle').append('<a class="liteac-login topnav myprofile" href="/marketplace/auth/global/es_MX/myProfile#landingPage">My Profile</a>');
            $('.mobile-nav-right').append('<a class="mobile-right-bottom liteac-login" href="/marketplace/auth/global/es_MX/myProfile#landingPage">My Profile</a>');            
        }


Ah, now I see why it has 500 million lines of source code.


and costs 200M$+

because, you know, static HTML and some CRUD/lookup logic behind-the-scenes is just that hard


Seen this pattern many times before. One of those 'en' strings is the current user's language being written into the source, the other is hardcoded. If your server-side templating engine is impotent and only supports variable interpolation without conditionals, this approach is easier than pulling the right JS snippet from somewhere else.


Temporarily mounted an NFS volume to a folder under /tmp.

Forgot about tmpwatch, a default entry in the RHEL cron table to clear out old temp files.

4AM the next morning, recursive deletion on anything wiuth a change time older than n days.


/mnt and /media exist for reasons. And root_squash and ...

Why no, I've NEVER accidentally deleted whole file systems, I have completely earned superiority here.

Delete /proc and /dev on a running server. Thankfully not really disastrous but damn if people don't notice right away.

Thanks for the tmpwatch info btw.


Long ago I was trying to solve some thorny problem that I now forget and I thought it might be a good idea to uninstall and reinstall glibc. I knew it was a bad idea but I was at the point where I was past caring and figured why not see what happens. Turns out, in order to uninstall glibc there's a confirmation prompt, and instead of just 'y' or 'n' you have to type a whole sentence that's something like "yes, I understand this is a really bad idea". Well, I did that, and it was a really bad idea, the system was, unsurprisingly, basically unusable after that.


Would have used /media but was thinking, if, say I forgot to unmount it or someone looked at a disk free or whatever, that it would be obvious that it was there temporarily.

Obviously that was incorrect, but the reasoning was, I think, sound.


I like /mnt/scratch/ for that. If I was using systems with lots of others, I'd make it clearer with /mnt/tmp/.


You can't delete /proc, it is a pseudo filesystem the kernel creates. They aren't really files. Deleting /dev is a real pita however


With udev deleting /dev shouldn't be a problem I thought?


udev has been deprecated. devtmpfs, a pseudo filesystem just like tmpfs has sort of replaced it. Gotta love Linux, they change things just for fun :)


I did something similar. I moved some files around between a university computer and an SFTP mount (mounted through the ubuntu UI). When I leave those computers, I always execute "rm -rf ~" because they are restored to some image every time they boot anyway and I'd rather not leave anything personal behind. It was only when I started seeing "access denied" in my terminal that I realised that ubuntu had mounted the entire server on the other end in some hidden directory in my home folder, and that I was deleting every file on it for which I had write permissions. Luckily a quick CTRL+C saved my own files but I'm not sure the same could be said about a few students who were unlucky and had "loose" permissions in their home directories.


So tmpwatch traverses filesystems. Is that a bug? (thinks: "-xdev").

Edit: Reading tmpwatch.c: "Try hard not to go onto a different device". Perhaps old bug?


Actually yes, I think it was patched to not traverse file systems afterward. It was around 2005 when this happened.


[deleted]


He says a folder _under_ /tmp/


Whoa.


Tape Archive System: write a tape, restore it again, and do MD5sum against the original data. Then we know it can be restored correctly, and the original data is deleted.

Should be bullet proof?

Alas, the 'write to tape' scripts I'd inherited didn't warn if they couldn't load a tape into the drive.

There was a tape jammed in the drive, so the tape robot was refusing to load any new tapes, but kept on writing and restoring from the same tape over and over again.

Stupidly, we didn't do any 'check a tape from 3 weeks ago' for a while.

Lost quite a bit of data. We still have the md5sums though... Still gets shivers thinking about it


I know one large company where contract operators managed to destroy every copy of a very large companys payroll by loading tape after tape onto a malfunctioning tape deck.


Yes, this was why I was paranoid about the whole write/restore/compare process.

I, being mainly a software guy, didn't consider the hardware robot as something that might fail w/o error.

Now, the process looks like:

Check drive is empty. Load Tape. Write Tape Unload. Load tape "42" from anther slot. Write 'slartibartfast' to that tape. Unload. Load original tape. Restore & Compare. Unload. Load tape "42". Restore, and make sure all it has is 'slartibartfast'.

This seems to me to have removed most of the possible silent-failure situations. If anyone can think of part of this algorithm that might fail, let me know!


We launched our brand new service into production pointing the backend at our dev instance, at the office. The entire internet showed up at our wee little DSL connection, effectively DDOS-ing our office. We had to leave, go to a cafe with public wifi to fix it.


My worst horror story was a full server room shutdown. We killed servers, then the chillers, and then started work. About an hour after we started, we pulled our first floor tile to move some power cables. There was water under the floor! We spent the next few hours cleaning up all the water.

Apparently water kept flowing into the humidifier tray of the chiller, and the mechanical auto-shutoff never triggered. The pump didn't remove water from the tray because the power was off.

Facilities "fixed" the humidifier, but it still happened again when that circuit was cut off for work elsewhere in the building. No one caught the water overflow, and it flowed out down the conduits to the first floor. So we had flooding on 2 different floors from a single chiller.


"you can't have more than 64,000 objects in a folder in S3 - even though S3 doesn't have folders." Is this for real, or are these stories made up? All documentation I've read about S3 suggests that it does not have any file count limitations. The timeline of Togetherville suggests that this story took place between 2008 and 2010. Did S3 have a limit back then that they lifted?


There has never been an S3 limit. Some of our customers have millions of objects in a single bucket.

When you do something like this you need to make sure that you have a good distribution of keys across the name space, and you need to think twice before you decide to write code to list the entire bucket. In most use cases at this scale, metadata and indexing are handled by something other than S3.


Guiltily admit to listing an entire bucket with 3,500,000 keys, nightly.


The problem was with their temporary storage on a local filesystem, not S3. I'm sure there's some kind of limit to what S3 will allow you to store based on how they distribute data to servers, but 64k isn't it.


This was likely an inode issue on the operating system/kernel and how it represents S3 on the filesystem, especially if you're using something like S3FS.


Ah, my brain was somehow not seeing that in the article, but I see it now that I'm looking for it. Thanks!


It's still an extremely bad idea to have overly common key prefixes on S3 since it prevents balancing key distribution across their clusters.


Do you have a citation for that? I'd hope that the key distribution is a bit smarter than prefix-based.



The customer.io story seems like a great example of why NOT to use budget providers like OVH and Hetzner for mission-critical applications.

You get what you pay for.


Not so much an example of why not to use budget providers, more an example of why you should build highly available infrastructure. I don't believe there is any provider, "budget" or not, that guarantees a servers reliability 100% of the time.


I was alluding more to the customer support aspect of it. If a technician spends one hour troubleshooting your network problems, then they've already lost their profit for the month.


This is one thing where I find AWS shines. I'm on the lowest level of paid support, and I've had nothing but excellent service from good technicians who will try to actually replicate your problem, then contact other teams if they fail or there's follow up. Out of a dozen or so tickets, I've only had one where the response wasn't genuinely useful, and that was for an issue that may have been due to internet weather anyway.

Support is one of things that you can get along without... until you need it. Then you really, really wish you had it.


I've spent nearly 2 hours with sites down at Rackspace, which is certainly not a budget provider, because our hardware firewall crapped out and they "couldn't find" our hardware in the datacenter.

Entirely different problem from a provider that loses one internet connection and their other links can't keep up with traffic, but you can still have major problems even if you're spending thousands of dollars a month compared to hundreds.


Echoing some of the other comments... the place i worked at was one of the 5 biggest Rackspace customer, and that didn't stop them from regularly cutting traffic or bringing down servers for hours in some case. You might get better uptime overall from reputable providers, but ultimately it's all about distributed application/service architecture.


Absolutely. Removing SPOFs and moving to a service oriented architecture has been a major focus for us over the last few months.


The high-dollar ones aren't necessarily better.

At a previous job we hosted with {HAL} out of Atlanta. A NOC operator there saw/heard/smelled something that indicated to him that he should hit the Big Red Switch. So he did. This removed power to every machine in that part of the DC.

After management confirmed that there was no life-threatening emergency, they started bringing everything back up. Only to have machines start going down again 20 minutes later, as their local UPSes ran out of juice. Someone had to walk around to every cage and recycle them all manually.


Yeah they had a terrible experience!

The idea of using DC hosting providers is because the uptime, environment and usually strong network connectivity. Given the poor run they had I think the developer would have had a better uptime if he had hosted this at home on a server.


Other than the bridge problem with OVH a month back, OVH has been fairly stable network-wise for me for the past year or so. And I rely on the network a ton since I run distributed crawlers.


You can hardly be surprised when OVH or Hetzner go down, just consider the price. Putting every server in one location is just stupid... as always the best way to fight downtimes is to spread servers across multiple providers & DC's.


Reminds me of a downtime report with my previous shell provider. They lost all Internet connection because someone had broken into a junction under a nearby freeway and cut all the fiberoptics and cables -- while preparing a robbery (presumably trying to cut alarm and/or off-site connections to cctv).

Turned out the two "redundant" providers of fiber both had fiber going through that junction...


Yeah, I was kind off worries when they talked about OVH and Hetzner... they are notoriously bad... If you have to out-source your servers, at least pick a company that does it's job well (like LeaseWeb for instance).


Going down is fine. Ignoring service problems is not fine.


One thing I've learned: the real value of replication is how easy it makes it to handle strange events without getting stressed out.

It's 3AM. You're being paged with a high latency alert in one datacenter. You run one command to drain traffic out of that datacenter. The latency graph starts looking normal again. You go back to bed at 3:05. You look at the logs and figure out what went wrong tomorrow morning.


TODO: Monday Morning: T1 install will be complete. Tuesday: Test/bootup period. Wednesday: Sales start Thursday: Sales continue, TV ad goes live Friday: Champagne!

Reality: Monday Morning: T1 did not get installed. Tuesday: Emergency ISDN solution (stolen from Chiropractors next door) Wednesday: Modem rack catches fire Thursday: TV ad goes Live Friday: T1 goes live. Champagne.


this is why given a choice between theory/plans/estimates/schedules or, say... reality and iterating and observing what-actually-happens ... I always prefer the latter. in software engineering, in human relationships, and in the physical world around me in general.


Well, sure, as logical people we know that you can't predict failures and that it's always better to play it by ear. Unknown unknowns and all that.

I have worked at several places where salespeople have sold a feature without even asking if it was POSSIBLE, much less created/deployed/tested. "We just sold [Feature X], we told them it'd be ready by [date pulled out of thin air]."


Are there any open source load balancing solutions like what Amazon ELB does? Say, install the load balancer on to one or two Amazon VPS, proxy traffic to third party VPS/dedicated servers, Linode, OVH, etc. Wonder how feasible this approach is?


It's not about the load balancing software itself, say HAProxy or Nginx, but with ELB AWS autoscales and handles failover between availability zones. You could certainly handle spinning up your own LB instances, managing DNS/Elastic IPs to handle failovers, etc. It would be far more expensive in setup time, management time and EC2 bill, than ELB which is practically free, starting at less than $20/month.


Among different datacenters this is done most often in DNS in order to avoid the increased latency and singe-point-of-failure.


At least Amazon doesn't lose your servers.

http://www.informationweek.com/server-54-where-are-you/65055...


Eh, this tale of ultimate unattended service reminds me of my favorite Daniel Boone the frontiersman quote: "I can't say as ever I was lost, but I was bewildered once for three days."


In an early phase of MIT's EECS transition from Multics (going away, Honeywell sucks) to UNIX(TM) on MicroVAX IIs, i.e. some users, but not as many as latter.

# kill % 1

Instead of %1. So I zapped the initializer, parent of everything else, logging everyone out without warning.

I had more than enough capital to avoid anything more than the deserved ribbing, but it was my Crowning Moment of Awesome devop lossage; harsh but minor screwups in the decade previous had trained me to be very careful.

I've avoided being handed the horrors of many other posters by primarily being a programmer. You full timers earn my respect.

ADDED: Ah, one big consequential goof, related to my not being a full time sysadmin but knowing more than anyone else in my startup. Buying a Cheswick and Bellovin style Gauntlet Firewall from TIS ... not realizing they'd just been bought by Network Associates, who promptly fired anyone who knew anything about supporting that product.... (At that time I didn't even know about iptable's predecessor, although given it was a Microsoft shop....)

I was fired from that job in part because I was the least worst sysadmin in the company, totally consumed with a big programming and database migration effort (Microsoft Jet -> DB2 -> DB2 on a real server), and gave opinions that others sometimes accepted and implemented without due diligence. E.g. I said "this is a competent ISP", not "you should also use their brand new email system" (which I didn't even know existed) ... visibility all the way up to the CEO is of course not always good....


A few devops horror stories:

- Someone on the hardware team deleted several VMs that were being used as build machines, there were no backups. That wasted around 2 days to get things back to normal.

- During a show I volunteer for: a scissor lift drove over a just run (several hundred foot) ethernet line and severed it, they had to run a new line.

- PCs running windows being set up, as point-of-sale systems, to run with static IPs on the internet, without a firewall running. Disavowed all responsibility and left them on their own for that. They would have run them unpatched too without intervention.

- Someone checked a private key into the repository. Plan of action: obliterate from all branches everywhere, delete from all build drops (which contain source listings too), track down all build drop backups on tape and restore-delete-then-recreate them. Luckily I handed that job off to someone else.


Coworker says "I'm going to do some clean-up on the server." Two minutes later, "Oh crap." He had wiped out /var/lib. And tell you what, the server kept working. We didn't dare rebooting it though.

Another fun one was coming in one morning, and cleaning up after somebody used some foul PHP provisioning scripts on a customer system and had the unfortunate idea to use a function called "archive". Turned out the function didn't so much "archive" as "delete". Henceforth deletion, especially unintended deletion, was known as "shotgun archival".


alt.sysadmin.recovery lives on! albeit in a web app. wonder if usenet is still alive...


There are still some good NNTP clients out there, but I think Google Groups is the primary interface these days:

https://groups.google.com/forum/#!forum/alt.sysadmin.recover...


Non ex transverso sed deorsum


In a script: sudo chmod -R apache:apache . /

Note that space? I didn't.


The first two stories are notable in how they reflect the terrible practices of the teller.

"Our distributed application produces the same type of error after the same period of time in totally different data centers. We have no idea why, but moving data centers seems to help, so we just keep doing it. #YOLO"

"We've built a product on a data store and library we don't understand even the highest-level constraints of. That ignorance bit us in the ass at peak load. We patched over the problem and continue gleefully into the future. #YOLO"

These stories should be embarrassing, but they're seemingly being celebrated, or at least laughed about. Am I off base?


your first characterization seems incorrect (did you read the story? it wasn't application errors), and your second characterization is hyperbolic at best. calling it a high-level constraint doesn't mean it's common, nor obvious.

calling them "terrible practices" is redundant, all devops horror stories can be characterized as exposing terrible practices if you're simply looking at the post-hoc view. it's a feature, not a bug, to make light of them. they're laughed about, but with the intent that they're not made again.


"Oh, just use keys * to work out what's there."

"No, wait, don't…!"

<site down>


Mine was simple: I did a middle-mouse-button paste of "init 6" into a root window of our main Solaris server that hosted about 100 users, mid-day. Boss shrugged it off, stuff happens.

But that's because it was properly configured so a reboot was smooth and didn't have any snags or affect other systems once back online. At another data center across the hall, if their main server needed to be rebooted (not accidentally!), it was 3 days of troubleshooting to get it back up. I learned that after the boss hired one of their admins - not surprisingly, a big mistake.


(worst) update table set column = 'blah' WITHOUT a where clause (thank god for backups)

(2nd worst) delete from table where created < 'old_date' WITHOUT an account (thank god again for backups)

Lesson learned, always backup and write the WHERE clause first


It is possible to tell psql to always issue an implicit BEGIN so you also have to COMMIT before your change becomes permanent.

This has saved me from paying the price for that particular class of mistake on a number of occasions.


Let my cofounder near the backups. Whoops.

Had a friend who recently took down his nic over ssh; he claimed he managed to get back in using some sort of serial over lan magic but I suspect he really just got someone on the other end to help.


Any dibs as to what happened for customer.io at Linode/Hetzner?


service network stop


ifdown eth0


So they hosted at Hetzner and OVH, both extremely cheap hosters, and were surprised that things did not go smooth?

Extremely professional.


Academic devops horror story in one word: Ruby.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: