
Devops Horror Stories - stevenklein
https://www.statuspage.io/devops-horror-stories
======
peterwwillis
My Devops horror stories, one sentence each:

\- Somebody deployed new features on a Friday at 5pm.

\- Fifteen hundred machines running mod_perl.

\- Supporting Oracle - TWICE.

\- It turns out your entire infrastructure is dependent on a single 8U Sun
Solaris machine from 15 years ago, and nobody knows where it is.

\- Troubleshooting a bug in a site, view source.... and see SQL in the JS.

~~~
ecopoesis
I really hate the idea that deploying on a Friday afternoon is a bad idea.
It's only bad when you have shit developers or shit processes that don't catch
broken code.

Personally, I think it's better to release at 5pm on a Friday. Once people
stay late a few times to fix their broken shit they'll be smarter about not
checking in crap.

~~~
peterwwillis
> It's only bad when you have shit developers or shit processes that don't
> catch broken code.

Or when the bug is only triggered in specific user profiles.

Or when all the devs went on a retreat in the mountains with no cell service.

Or when a dev makes a mistake (which we know _never_ happens to even the best
devs)

Or when the only developer that knows which one of the 1000 changes that were
pushed could be the one breaking, turned his phone off.

Or when a flaw is discovered in the process for the first time (which we know
_never_ happens because everyone's process is perfect, until it isn't)

Or how change management's requirement that the fix be tested and verified by
all affected teams might have people staying a few hours after 5pm on a Friday
when they just want to get their weekend started.

Or how 10 different people from 10 different teams _might_ need to be called
and kept to work until 2am because the change can't be pulled because the
database was already modified and the old client data is already expired from
cache and a refresh would destroy the frontend servers.

Or another reason.

~~~
ryan-allen
Yes, this! "Good code" and a CI box and deployment automation and some chef
recipies don't spell ultimate success.

It drives me nuts when people tell me off for saying 'yeah yeah, no,
automating our entire infrastructure of 5 servers isn't really worth it right
now', like I'm some unprofessional bozo.

I pretty much have experience with all but one or two of your suggested
scenarios, and by now I have no patience for annoying software developers who
think that using chef or puppet somehow sufficiently embiggens them to run ops
on their own (of course dev ops is almost a political assault on existsing ops
guys, not merely a nice new solution to existing problems).

Sigh. This is why I don't work on teams these days (if I can help it).

EDIT: Though I also agree with the sub-parent, that deploying on 5pm is fine
in certain teams and certain projects, the most important thing is are the
guys pushing to do the deploy going to own the deployment? Are they going to
hang around for another 60 minutes to check everything is OK? Are they going
to be available at 10pm or on Saturday if something goes wrong and are they
going to own it? If the answer is no, then nope, don't do it.

------
nailer
Temporarily mounted an NFS volume to a folder under /tmp.

Forgot about tmpwatch, a default entry in the RHEL cron table to clear out old
temp files.

4AM the next morning, recursive deletion on anything wiuth a change time older
than n days.

~~~
mhurron
/mnt and /media exist for reasons. And root_squash and ...

Why no, I've _NEVER_ accidentally deleted whole file systems, I have
completely earned superiority here.

Delete /proc and /dev on a running server. Thankfully not really disastrous
but damn if people don't notice right away.

Thanks for the tmpwatch info btw.

~~~
nailer
Would have used /media but was thinking, if, say I forgot to unmount it or
someone looked at a disk free or whatever, that it would be obvious that it
was there temporarily.

Obviously that was incorrect, but the reasoning was, I think, sound.

~~~
hga
I like /mnt/scratch/ for that. If I was using systems with lots of others, I'd
make it clearer with /mnt/tmp/.

------
deckiedan
Tape Archive System: write a tape, restore it again, and do MD5sum against the
original data. Then we know it can be restored correctly, and the original
data is deleted.

Should be bullet proof?

Alas, the 'write to tape' scripts I'd inherited didn't warn if they couldn't
load a tape into the drive.

There was a tape jammed in the drive, so the tape robot was refusing to load
any new tapes, but kept on writing and restoring from the same tape over and
over again.

Stupidly, we didn't do any 'check a tape from 3 weeks ago' for a while.

Lost quite a bit of data. We still have the md5sums though... _Still gets
shivers thinking about it_

~~~
walshemj
I know one large company where contract operators managed to destroy every
copy of a very large companys payroll by loading tape after tape onto a
malfunctioning tape deck.

~~~
deckiedan
Yes, this was why I was paranoid about the whole write/restore/compare
process.

I, being mainly a software guy, didn't consider the hardware robot as
something that might fail w/o error.

Now, the process looks like:

Check drive is empty. Load Tape. Write Tape Unload. Load tape "42" from anther
slot. Write 'slartibartfast' to that tape. Unload. Load original tape. Restore
& Compare. Unload. Load tape "42". Restore, and make sure all it has is
'slartibartfast'.

This seems to me to have removed most of the possible silent-failure
situations. If anyone can think of part of this algorithm that might fail, let
me know!

------
splitrocket
We launched our brand new service into production pointing the backend at our
dev instance, at the office. The entire internet showed up at our wee little
DSL connection, effectively DDOS-ing our office. We had to leave, go to a cafe
with public wifi to fix it.

------
caw
My worst horror story was a full server room shutdown. We killed servers, then
the chillers, and then started work. About an hour after we started, we pulled
our first floor tile to move some power cables. There was water under the
floor! We spent the next few hours cleaning up all the water.

Apparently water kept flowing into the humidifier tray of the chiller, and the
mechanical auto-shutoff never triggered. The pump didn't remove water from the
tray because the power was off.

Facilities "fixed" the humidifier, but it still happened again when that
circuit was cut off for work elsewhere in the building. No one caught the
water overflow, and it flowed out down the conduits to the first floor. So we
had flooding on 2 different floors from a single chiller.

------
rdw
"you can't have more than 64,000 objects in a folder in S3 - even though S3
doesn't have folders." Is this for real, or are these stories made up? All
documentation I've read about S3 suggests that it does not have any file count
limitations. The timeline of Togetherville suggests that this story took place
between 2008 and 2010. Did S3 have a limit back then that they lifted?

~~~
jeffbarr
There has never been an S3 limit. Some of our customers have millions of
objects in a single bucket.

When you do something like this you need to make sure that you have a good
distribution of keys across the name space, and you need to think twice before
you decide to write code to list the entire bucket. In most use cases at this
scale, metadata and indexing are handled by something other than S3.

~~~
aidos
Guiltily admit to listing an entire bucket with 3,500,000 keys, nightly.

------
ch4ch4
The customer.io story seems like a great example of why NOT to use budget
providers like OVH and Hetzner for mission-critical applications.

You get what you pay for.

~~~
jamescun
Not so much an example of why not to use budget providers, more an example of
why you should build highly available infrastructure. I don't believe there is
any provider, "budget" or not, that guarantees a servers reliability 100% of
the time.

~~~
ch4ch4
I was alluding more to the customer support aspect of it. If a technician
spends one hour troubleshooting your network problems, then they've already
lost their profit for the month.

~~~
vacri
This is one thing where I find AWS shines. I'm on the lowest level of paid
support, and I've had nothing but excellent service from good technicians who
will try to actually replicate your problem, then contact other teams if they
fail or there's follow up. Out of a dozen or so tickets, I've only had one
where the response wasn't genuinely useful, and that was for an issue that may
have been due to internet weather anyway.

Support is one of things that you can get along without... until you need it.
Then you really, really wish you had it.

------
Fizzadar
You can hardly be surprised when OVH or Hetzner go down, just consider the
price. Putting every server in one location is just stupid... as always the
best way to fight downtimes is to spread servers across multiple providers &
DC's.

~~~
e12e
Reminds me of a downtime report with my previous shell provider. They lost all
Internet connection because someone had broken into a junction under a nearby
freeway and cut all the fiberoptics and cables -- while preparing a robbery
(presumably trying to cut alarm and/or off-site connections to cctv).

Turned out the two "redundant" providers of fiber both had fiber going through
that junction...

------
jrockway
One thing I've learned: the real value of replication is how easy it makes it
to handle strange events without getting stressed out.

It's 3AM. You're being paged with a high latency alert in one datacenter. You
run one command to drain traffic out of that datacenter. The latency graph
starts looking normal again. You go back to bed at 3:05. You look at the logs
and figure out what went wrong tomorrow morning.

------
fit2rule
TODO: Monday Morning: T1 install will be complete. Tuesday: Test/bootup
period. Wednesday: Sales start Thursday: Sales continue, TV ad goes live
Friday: Champagne!

Reality: Monday Morning: T1 did not get installed. Tuesday: Emergency ISDN
solution (stolen from Chiropractors next door) Wednesday: Modem rack catches
fire Thursday: TV ad goes Live Friday: T1 goes live. Champagne.

~~~
mkramlich
this is why given a choice between theory/plans/estimates/schedules or, say...
reality and iterating and observing what-actually-happens ... I always prefer
the latter. in software engineering, in human relationships, and in the
physical world around me in general.

~~~
wizzard
Well, sure, as logical people we know that you can't predict failures and that
it's always better to play it by ear. Unknown unknowns and all that.

I have worked at several places where salespeople have sold a feature without
even asking if it was POSSIBLE, much less created/deployed/tested. "We just
sold [Feature X], we told them it'd be ready by [date pulled out of thin
air]."

------
cygwin98
Are there any open source load balancing solutions like what Amazon ELB does?
Say, install the load balancer on to one or two Amazon VPS, proxy traffic to
third party VPS/dedicated servers, Linode, OVH, etc. Wonder how feasible this
approach is?

~~~
jmccree
It's not about the load balancing software itself, say HAProxy or Nginx, but
with ELB AWS autoscales and handles failover between availability zones. You
could certainly handle spinning up your own LB instances, managing DNS/Elastic
IPs to handle failovers, etc. It would be far more expensive in setup time,
management time and EC2 bill, than ELB which is practically free, starting at
less than $20/month.

------
patmcguire
At least Amazon doesn't lose your servers.

[http://www.informationweek.com/server-54-where-are-
you/65055...](http://www.informationweek.com/server-54-where-are-you/6505527)

~~~
hga
Eh, this tale of ultimate unattended service reminds me of my favorite Daniel
Boone the frontiersman quote: " _I can 't say as ever I was lost, but I was
bewildered once for three days._"

------
hga
In an early phase of MIT's EECS transition from Multics (going away, Honeywell
sucks) to UNIX(TM) on MicroVAX IIs, i.e. some users, but not as many as
latter.

# kill % 1

Instead of %1. So I zapped the initializer, parent of everything else, logging
everyone out without warning.

I had more than enough capital to avoid anything more than the deserved
ribbing, but it was my Crowning Moment of Awesome devop lossage; harsh but
minor screwups in the decade previous had trained me to be very careful.

I've avoided being handed the horrors of many other posters by primarily being
a programmer. You full timers earn my respect.

ADDED: Ah, one big consequential goof, related to my not being a full time
sysadmin but knowing more than anyone else in my startup. Buying a Cheswick
and Bellovin style Gauntlet Firewall from TIS ... not realizing they'd just
been bought by Network Associates, who promptly fired anyone who knew anything
about supporting that product.... (At that time I didn't even know about
iptable's predecessor, although given it was a Microsoft shop....)

I was fired from that job in part because I was the least worst sysadmin in
the company, totally consumed with a big programming and database migration
effort (Microsoft Jet -> DB2 -> DB2 on a real server), and gave opinions that
others sometimes accepted and implemented without due diligence. E.g. I said
"this is a competent ISP", not "you should also use their brand new email
system" (which I didn't even know existed) ... visibility all the way up to
the CEO is of course not always good....

------
InclinedPlane
A few devops horror stories:

\- Someone on the hardware team deleted several VMs that were being used as
build machines, there were no backups. That wasted around 2 days to get things
back to normal.

\- During a show I volunteer for: a scissor lift drove over a just run
(several hundred foot) ethernet line and severed it, they had to run a new
line.

\- PCs running windows being set up, as point-of-sale systems, to run with
static IPs on the internet, without a firewall running. Disavowed all
responsibility and left them on their own for that. They would have run them
unpatched too without intervention.

\- Someone checked a private key into the repository. Plan of action:
obliterate from all branches everywhere, delete from all build drops (which
contain source listings too), track down all build drop backups on tape and
restore-delete-then-recreate them. Luckily I handed that job off to someone
else.

------
mercurial
Coworker says "I'm going to do some clean-up on the server." Two minutes
later, "Oh crap." He had wiped out /var/lib. And tell you what, the server
kept working. We didn't dare rebooting it though.

Another fun one was coming in one morning, and cleaning up after somebody used
some foul PHP provisioning scripts on a customer system and had the
unfortunate idea to use a function called "archive". Turned out the function
didn't so much "archive" as "delete". Henceforth deletion, especially
unintended deletion, was known as "shotgun archival".

------
perlpimp
alt.sysadmin.recovery lives on! albeit in a web app. wonder if usenet is still
alive...

~~~
tvon
There are still some good NNTP clients out there, but I think Google Groups is
the primary interface these days:

[https://groups.google.com/forum/#!forum/alt.sysadmin.recover...](https://groups.google.com/forum/#!forum/alt.sysadmin.recovery)

------
nl
In a script: sudo chmod -R apache:apache . /

Note that space? I didn't.

------
redbad
The first two stories are notable in how they reflect the terrible practices
of the teller.

"Our distributed application produces the same type of error after the same
period of time in totally different data centers. We have no idea why, but
moving data centers seems to help, so we just keep doing it. #YOLO"

"We've built a product on a data store and library we don't understand even
the highest-level constraints of. That ignorance bit us in the ass at peak
load. We patched over the problem and continue gleefully into the future.
#YOLO"

These stories should be embarrassing, but they're seemingly being celebrated,
or at least laughed about. Am I off base?

~~~
scootklein
your first characterization seems incorrect (did you read the story? it wasn't
application errors), and your second characterization is hyperbolic at best.
calling it a high-level constraint doesn't mean it's common, nor obvious.

calling them "terrible practices" is redundant, all devops horror stories can
be characterized as exposing terrible practices if you're simply looking at
the post-hoc view. it's a feature, not a bug, to make light of them. they're
laughed about, but with the intent that they're not made again.

------
peterstjohn
"Oh, just use keys * to work out what's there."

"No, wait, don't…!"

<site down>

------
liquidcool
Mine was simple: I did a middle-mouse-button paste of "init 6" into a root
window of our main Solaris server that hosted about 100 users, mid-day. Boss
shrugged it off, stuff happens.

But that's because it was properly configured so a reboot was smooth and
didn't have any snags or affect other systems once back online. At another
data center across the hall, if their main server needed to be rebooted (not
accidentally!), it was 3 days of troubleshooting to get it back up. I learned
that after the boss hired one of their admins - not surprisingly, a big
mistake.

------
cpt1138
(worst) update table set column = 'blah' WITHOUT a where clause (thank god for
backups)

(2nd worst) delete from table where created < 'old_date' WITHOUT an account
(thank god again for backups)

Lesson learned, always backup and write the WHERE clause first

~~~
mst
It is possible to tell psql to always issue an implicit BEGIN so you also have
to COMMIT before your change becomes permanent.

This has saved me from paying the price for that particular class of mistake
on a number of occasions.

------
allworknoplay
Let my cofounder near the backups. Whoops.

Had a friend who recently took down his nic over ssh; he claimed he managed to
get back in using some sort of serial over lan magic but I suspect he really
just got someone on the other end to help.

------
porker
Any dibs as to what happened for customer.io at Linode/Hetzner?

------
joeshred
service network stop

~~~
InclinedPlane
ifdown eth0

------
iSnow
So they hosted at Hetzner and OVH, both extremely cheap hosters, and were
surprised that things did not go smooth?

Extremely professional.

------
neumann
Academic devops horror story in one word: Ruby.

