I love the sanity check part. But really, you're keeping someone on hand to drag your ass out after you pass out while you sniff up toxic fumes.
In most datacenters I've seen, I'd probably be willing to do a run through with IR cam/temp probe, or just visual inspection, with a handheld 1211, especially if I had a respirator, if it were just "smell of smoke". Clear view and path to two exits, someone at the EPO switch, etc.
The "big scary things" are battery plant and generator plant, and any kind of subfloor or ductwork. As long as the fire isn't in any of those, it's far less of a big deal. I probably wouldn't EPO a room for a server on fire, either -- just kill the rack, which takes slightly longer.
I've been in places where "smell of smoke" was a fucknozzle smoking a cigar or burning leaves outdoors outside an air intake, and another where it was a smoker's coat being put on an air handler.
If you're not ready to use your DR plan it probably means your DR plan is inadequate to begin with. Why the hell do fire drills? Even cruise ships do drills. God forbid they pull their passengers away from that very important game of Texas Hold'em.
I probably wouldn't EPO a room for a server on fire, either -- just kill the rack, which takes slightly longer.
You fail to understand how fires start or why they spread. I mean why the hell do datacenters spend millions of dollars on fire supression when an IR cam and a handheld extenguisher is just as good, right?
The fire suppression exists for two reasons. One, is to get code exemptions to be allowed to run wiring in ways which would otherwise require licensed electricians to do every wiring job, and prohibit people from being in the facility. Two is to detect small fires early, and to prevent their spread, as well as to protect facilities from catastrophic facility-wide fire.
Servers are just not that high a fire risk, particularly when de-energized. Generally inside a self-contained metal chassis, less than 100 pounds each, metal/plastic, etc. The power supply is the most likely component to start a fire, and contains a max of maybe 250g of capacitors and other components. The risk of one server catching on fire is low, and the risk of it rapidly spreading to anything else is low, so yes, I'd be comfortable pulling a single burning server out of a de-energized rack.
Also, in big or purpose-built facilities, those components most likely to be fire risks (batteries/power handlers, and generators) are in separate rooms, separated by firewalls from the datacenter. A fire in the battery room is going to be dealt with by sealing that room and powering it off, dumping suppression agent, and bringing out the FD immediately.
Life safety is much more important than business continuity, but a lot of people have jobs where they accept a non-zero risk of physical harm to do their jobs. It's certainly not reasonable to demand a datacenter tech go into a burning building to rescue a database server or something, but approximately zero datacenter staff I know would have a problem with assuming the level of risk I would to find problems. (it's probably a bigger deal for employers to actually discourage risk-taking by employees, particularly when it's risk-taking to save themselves effort, like single-person racking large UPSes or very large servers, etc.)
Well, US flag passenger ships (among others) are required to hold Fire and Emergency drills at least once every week. But your point stands: it's only by having a plan, and executing that plan even if it's prefaced by "this is a drill..." is crucial if you want to have a hope of things going the right way in an actual emergency.
Here's the thing, though. If your people are properly trained in what to do, and how to use the equipment, then shutting down power to the rack and extinguishing fire on a single server with a fire extinguisher may be a reasonable course of action. But that contingency should have been considered ahead of time and be part of the emergency plan.
The time to decide what to do is not during the emergency.
As for walking around a smoky room looking for the source, that's nuts. I spent one long day (way too short, though) time at Military Sealift Command Firefighting School. One of the first things they do is put you in a room full of smoke and make you count out loud. After about 30 seconds you feel like you're going to pass out -- that gets the point across much better than lectures ever will.
Our colo went down. Fire system triggered it (no actual fire was harmed in the triggering, however). On Thanksgiving. When I'd volunteered for on-call seeing as my co-worker had family to attend to and I didn't.
He thanked me afterward.
Our cage was reasonably straightforward to bring up, once power was restored. The colo facility as a whole took a few days to bring all systems up, apparently some large storage devices really don't like being Molly-switched.
And which part of Oakland? That can be a pretty broad range of risk. :-)
Edit: It helps that you're the CEO of your company. The risk profile changes a little bit. :-)
Also, you should look up the agony, people would rather shoot themselves than take the "treatment"
I don't think the situation was that bad. The one really unforgivable thing was shoddy electrical work in shower trailers (I think ~10 contractors and soldiers were fatally electrocuted while showering while in Iraq! I certainly got 230v a couple times and went through the reporting process, and actually got MPs and a friend from Contracting to turn it into a bigger issue.)
It definitely happens.
Whether it was a redundant mains power system blowing (taking down the main PDU), spoiled diesel, failed generator cutover, UPS fire, smoke detector-triggered shutdown (associated with power management), a really bizarre IPV6 ping / router switch flapping issue, load balancer failures based on an SSL cipher-implementation bug (triggered an LB reboot and ~15s outage at random intervals), etc., etc., etc.
Just piling redundancy on your stack doesn't make it more reliable. You've got to engineer it properly, understand the implications, and actually monitor and come to know the actual outcomes. Oh, and cascade failures.
Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go. As an example, the rate of engine failure in twin-engine planes is greater than for single-engine planes. It's obvious if you think about it: there are now two points of failure instead of one. Why have two-engined planes? Because you can still fly on one engine (pilots: no nitpicking!).
What redundancy does do is let you recover from failure without catastrophe (provided you've set it up properly as per the parent).
It depends on what you're protecting against, how you're protecting against it, and how you've deployed those defenses.
Chained defenses, generally, decrease reliability. Parallel defenses generally increase it.
E.g.: Putting a router, an LB, a caching proxy, an app server tier, and a database back-end tier (typical Web infrastructure) in series (a chain) introduces complexity and SPOFs to a service. You can duplicate elements at each stage of the infrastructure, but might well consider a multi-DC deployment, as you're still subject to DC-wide outages (I've encountered several of these) and a great deal of complexity and cost.
Going multi-DC doesn't increase capital requirements by much, and may or may not be more expensive than 2x the build-out in a single DC. It though raises issues of code and architecture complexity.
In several cases, we were experiencing issues that would have pervaded despite redundant equipment. E.g.: the load balancer SSL bug we encountered was present on all instances of multiple generations of the product. Providing two LBs would simply have insured that as the triggering cipher was requested, both LBs would have failed and rebooted. Something of an end-run around our Maginot line, as it were.
The fact is, there aren't that many flammable things in a data center. Nobody is going to die because they wandered up and down the aisles after they "thought they smelled something burning," in the absence of any visible smoke.
The guidelines in the highest-voted answer on the SO page make sense to me. 1: If you actually see smoke or fire in any significant amount, evacuate. 2: Make someone else aware of what's going on before doing anything else. 3: Keep your escape options open. 4: Think about how much time you can safely spend "guessing", and don't exceed it. 5: Don't second-guess your own common sense. You aren't paid to be a fireman or a hero.
"Not being prepared for them is unforgivable" - would mean that 99% of business do not deserve forgiveness.
It just doesn't make sense to have that kind of redundancy for such a rare event for all but a very, very small minority of businesses. (Telecoms, Google, Stock Exchange, 911, etc...)
Always plan for the single event because no amount of money will keep a single site running.
Of course they are idiots, they should never have powered their own mains of before they got the new one installed.
But it was still pretty funny.
This story is giving the Japanese engineer in me apoplexy.
Don't be afraid to call the emergency number. They'll know what to do and walk you through it.
Under no circumstance should you enter a room filled with smoke. Smoke inhalation is incredibly dangerous.
To re-iterate, a lot (most?) of the people who die in fires actually die from smoke inhalation than from getting burned by the fire/flames.
Would people at risk of a inhalation injury actually pass through you often? You're in a burn unit, so I'd assume that means you mostly get burn victims, and inhalation injuries would be pointed somewhere else?
Inhalation injury is a subset of burn trauma. The flow control is "Ambulance inbound from fire" -> ER calls trauma alert -> trauma team meets the ambulance(s) at the ER door. Those with inhalation injury are sorted from there.
The article says "smell", not "smoke", but I don't think inhaling it was a good idea in any event.
When I went to the Midwest Reprap festival, for some reason power was being a bit flakey. The guy who's sitting across me had 2 printers using ATX power supplies.
Suddenly, the PSU sparks out and then outgasses a stream of smoke pillar 4 foot in diameter for the next 2-3 minutes. Near the top of the building perhaps 3 stories up, it looks like a mini-Hiroshima with mushroom cloud going.
And that was from a dinky ATX PSU.
Luckily energy management outpaced student equipment rate, and laptops now last a good part of the day without needing a plug.
What you really do is you evacuate the room and then release the CO2/other fire suppression system which shouldn't require you to shut anything down.
Then you went to the room and check whatever equipment isn't functioning -- if you smell a fire again you might have to press the big red button, but there is no reason to panic just because of a fire, so long as you can put it out.
The problem with that is that you only get one shot with the agent (although everywhere I've seen has a gas/clean agent with dry-pipe water as the true final protective measure). If there's electricity still going in, and you use your one shot with the facility clean agent, you might not actually put out the fire, or it could re-ignite, and then you either have no fire protection or only water. The cost of filling a datacenter with water, especially one important to have gas and water backup in the first place, is huge, even relative to an EPO pull.
If it's an "OK situation" they will be able to talk you through what needs to be checked, and they'll likely know it better than you if you don't have a proper in-case-of checklist prepared before hand.
If the smell turns into something worse, then you'll get them faster on-site.
You do not try to debug the fire. There are people who are good at not dying while trying to do that. You are not one of them.
You do not try to avoid Big Red Buttoning because your bosses are idiots and they might come down on your hard for it: while your bosses are probably idiots, the first thing they'll tell you about Big Red Buttons is that nobody has ever gotten fired for pressing the Big Red Button, because everyone agrees that Big Red Buttons exist to get pushed and you never want to not have it pushed because someone was worried about getting fired. Big Red Buttons are costly affairs, sure. That's why we have redundant systems, insurance, and other various things that suggest we're responsible professionals.
I'm pretty sure that axiom is not present in all companies and cultures. Hence, the debate.
Oakland FD came out and used their IR camera to check the heat from the ceilings nearby. Hilariously they found a hot water pipe (running between bathroom and kitchen) and almost axed the ceiling open (turning $1500 in damage into $3k+), but their captain was smart and figured it out from another angle.
Really tempted to hack an EOS 5Dm3 into an IR camera next. Not so much for fires as night vision, but it would be useful for fires too. I'm not sure how useful an IR camera is at detecting heat, since things which aren't yet on fire are not quite so infrared, though.
I usually use a Fluke IR temp meter when cooking and to find hot wires/etc. in the datacenter, though.
In a "real" datacenter, you should have smoke sensors which would map where heat/smoke is coming from (since you have controlled airflow, it should be obvious which rack or small group of racks was the source -- it doesn't just exhaust into the whole room). But it's pretty clear this wasn't a "real" datacenter by their lack of protocols for handling fire, it was some office server thing.
> But it's pretty clear this wasn't a "real" datacenter by their lack of protocols for handling fire, it was some office server thing.
Probably. What's the right protocol though? In this case, it was apparently clear that something minor was amiss, nothing that would justify shutting down the whole thing. In any case, flooding the room with inert gas would probably not have made much of a difference, as it looks like the battery was never actually burning.
The right thing to do in a real datacenter it to check which of your ~hundreds of laser VESDA sensors first tripped, and investigate in that area :) Presumably you have floor air supply, ceiling air return, so the first thing to trip should be a ceiling sensor near your fire. If no floor sensors trip, I wouldn't be super afraid to go in there, and if it's only a small number of them it's not a big fire.
You don't want the dry pipe to go off for sure, and you don't want the FM-200 either, but the consoles should be reporting the smoke alarm to you way before a human would smell it "filling the whole room", and they don't generally discharge either for very small events (at least everywhere I've seen).
In an office (some open plan, some cubicles, some conference rooms and offices, etc.), with a few racks of equipment, and maybe some lab space, it's a lot more similar to the scary hidden residential fire problem. :( Your risks in trying to uncover the problem are actually higher than in the datacenter because then you don't have the amazing gas system and a dry pipe backup to save you if it turns into a big fire while you're there, and it's not as designed for easy egress, and probably doesn't even have real EPO. I wonder if there's a firefighter on HN who would know the real answer to this case.
I definitely agree that I'd be more concerned about a house fire, but the rule that we enforced to our people and the vendors, as well as the vendors working for us (not to mention the guidance that we received from our customers) was that nothing in that datacenter is worth potentially losing anyone's life. That having been said, I have Toucan Sam'd in a datacenter to try and find the source of an odd odor before, but never alone, and only to find out what to secure power to. I wouldn't sit there and try to fight it with a fire extinguisher.
In general the purpose of a handheld extinguisher is to fight tiny fires as well as to help you escape a bigger fire. The thing I'd be most afraid of would be someone walking around trying to find a small fire, only to discover a big fire, have egress blocked, and need to figure out a solution. Or, coming across an actual person who is on fire or otherwise in danger (even if you'd expect virtually no personal risk for property, I think most people would accept substantial personal risk to save a person, particularly a coworker).
You can get the same guts for $15 (http://www.amazon.com/Accuracy-Non-Contact-Infared-Temperatu...)
It tells you the surface temperature of whatever you point it at, and has a convenient laser for aiming. When cooking, I use it to see if a pain is hot enough yet (e.g. to sear meat), rather than relying on the "smoke point of various oils" test. You can also use it to see how close water is to a boil, although I'm not sure if it is measuring surface temperature, some slight penetration into the water, or the pan bottom (although, arguably, these should be fairly close in water).
It's also useful in something like a fusebox to find hot/overloaded circuits. It's essentially a 1x1 pixel themal imager, while a 100x100 thermal imager costs much more.
You still want a probe thermometer (for measuring meat internal temperature) such as http://amzn.com/B0000CF5MT, and if you have kids/sick people/etc., probably the internal-temperature kind (the ear/IR kind are the least gross).
The laser also came in pretty handy for entertaining the most recent foster puppy.
Oh wow, it exists:
Costs only $150 to make?
Here is a little bit different approach;
If you are in the room and smelled the burn, that means something is already happened and you are dealing with its result and possible side effects, and that gives you possibly enough time before shutting down everything or getting out of the room. Your chance of not being harmed by this situation is high at least for 5-10 minutes.
In this case, having a termal check would not help you a lot since burned hardware is most probably not functioning anymore and might be colder than the regular servers. The other option would be that it is still working but not causing any fire yet so heat is not much different than the usual.
Now, smell is your only evidence,
I am hoping you guys have air conditioner in the server room. Put it on the max level so that the smell will not be so strong everywhere. This can help you identify where the smell is coming from. Before you check the smell, get out of the server room, breath as much as fresh air you can, so that your realisation will be sharper when you get back to server room. Having your colleague with makes the process faster.
This would be my first reaction to these kind of situations.
It is of course costly to turn off whole system but don't forget that it is not important than you!
Dropping a whole server room without seeing any smoke or fire is silly. Do you pull a fire alarm if someone smokes a cigarette indoors?
Doesn't that change the entire question!
They have a TIC (thermal imaging camera) that can detect heat/overheating sources pretty quickly. Plus, it's kind of nice to have them on hand in case a smell progresses to a fire.
A friend of a friend was a hero and shutdown his datacenter cleanly/recovered some hard drives during a situation like this. He got severe lung damage (not from fire).
If you wait to the point that an SLA smells, it has probably expanded and caused internal damage to your UPS/server rack, albeit minor if you can manage to get it out without dismantling anything.
If you start shutting down areas of the datacentre that appear to be closest to the smoke, then you will have a better chance of locating the issue in the fastest possible timeframe, with minimum disruption.
On top of this, if you then have critical infrastructure that you must keep running, then you keep your failover servers in different areas and failover to that equipment.
I'm not a server or datacentre guy in any way, but doesn't this seem sensible?
When UPS batteries vent it has a distinctive odor. It's very pungent and sulfuric, but it doesn't smell like a fire or melting silicon. Any experienced operations guy has smelled it before.
Additionally in most fire suppression systems the Big Red Button is the abort button. A well designed room will dump itself when it detects smoke after a short evacuation alarm. It's precisely designed to keep people from screwing around with a real fire. They must make the active decision to stop fire suppression rather than start fire suppression.
Don't hit the big red button just because you smell burning.
Next day we find out the breaker panel next door had a short that blew out several breakers. Smell was vented into the server room.
So, not always your room, could be something else just as or more dangerous.
1) shut down all machines, unplug all UPSes, open every case
I made my 12 year-old read the story (along with pictures) of a girl his age who was trapped in a burning house after he set off the smoke alarms at midnight by melting bits of plastic in his room.
Hopefully he learned something that night.
My son set off the smoke detectors by melting plastic in his bedroom with a cigarette lighter.
To demonstrate the danger of what could come from this, I had him read a news report of a girl who was screwing around with fire and ended up being badly burned over most of her body.
Don't forget Capacitor Plague. I still see it regularly.
Have a plan, be safe.