
A Loud Sound Shut Down a Bank's Data Center for 10 Hours - okket
http://motherboard.vice.com/read/a-loud-sound-just-shut-down-a-banks-data-center-for-10-hours
======
threepipeproblm
This didn't surprise me, only because the video of the guy screaming at the
hard drive has been stuck in my head for years --
[https://www.youtube.com/watch?v=tDacjrSCeq4](https://www.youtube.com/watch?v=tDacjrSCeq4)

~~~
okket
"the guy" is Brendan Gregg, you might know him from his work on Linux
performance tools.

[http://www.brendangregg.com/linuxperf.html](http://www.brendangregg.com/linuxperf.html)

~~~
nickpsecurity
I only knew him from screaming at hard disks. Now I know he was behind some
great work on performance tools. Thanks for the link. :)

------
gupi
While being directly affected by the outage (could not make card payments on
Saturday), I have noticed that none of the communications team was clearly
aware of the situation. Here are some facts: \- the first public announcements
- FB news and tweet confirming the issue have been posted at ~17.25-1735 local
time, after four and a half hours \- there was a lot of wasted energy on
excuses and social-media damage control, while no clear explanation and status
being posted \- at 17.41 there was a FB response stating that "one
unfunctional server" is the cause of the downtime; all services were down at
that time \- the corporate communication director declares for press "we know
about the issue, we don't know the cause of it" \- after almost 11 hours,
problem is solved

Now the nice part: next day, the bank issues a press release stating that
malfunction appeared after a programmed [fire suppression] test.

While accidents can happen, this looks more like "fail to plan is plan to
fail" issue, as well as very bad communication.

------
random3
Both Siemens and IBM seem to know a good deal about the problem and the ways
to mitigate it: [http://www.datacenterjournal.com/inert-gas-data-center-
fire-...](http://www.datacenterjournal.com/inert-gas-data-center-fire-
protection-and-hard-disk-drive-damage/)

The irony is that these systems are maintained by IBM :)

------
a3n
Risking an ignorant question - will this accelerate adoption of SSDs? Or is
there something about data centers that makes them lean toward spinners?

The issue in the article is that as spinners have become more dense, the track
tolerance has decreased. Those small tolerance high density disks must already
be expensive, and if they further can't reliably survive the environment
(which includes fire suppression), then replacement with SSDs starts to look
mandatory on the near horizon.

~~~
astrodust
HDD technology has maxed out, they're struggling to get past 8TB. SDD
meanwhile is already at 60TB and scaling upwards rapidly.

It's only a matter of time before SSD prices cut below HDD and all that
spinning rust is thrown in the garbage.

~~~
benaadams
Except SSDs aren't very good for long term storage as their data will decay if
left unpowered for 6 months.

~~~
astrodust
If you're going to believe every rumor you read you'll be convinced of
anything. SSD devices are very durable _except_ if they're subjected to
repeated and extreme levels of heating/cooling.

Also I'm not sure what datacentres you work in but I've never heard of one
being turned off for six months. You can spin down drives when they're not in
use, but that's mostly to save energy. An idle SSD uses almost no power,
there's no reason to spin it down.

~~~
benaadams
No you wouldn't switch an in use disk of in a data centre; but you would for
archival cold storage; such as legal evidence: (From:
[https://blog.korelogic.com/blog/2015/03/24](https://blog.korelogic.com/blog/2015/03/24)
)

> Digital evidence storage for legal matters is a common practice. As the use
> of Solid State Drives (SSD) in consumer and enterprise computers has
> increased, so too has the number of SSDs in storage increased. When most, if
> not all, of the drives in storage were mechanical, there was little chance
> of silent data corruption as long as the environment in the storage
> enclosure maintained reasonable thresholds. The same is not true for SSDs.

...

> For client application SSDs, the powered-off retention period standard is
> one year while enterprise application SSDs have a powered-off retention
> period of three months. These retention periods can vary greatly depending
> on the temperature of the storage area that houses SSDs.

...

> The standards change dramatically when you consider JEDEC's standards for
> enterprise class drives. The storage standard for this class of drive at the
> same operating temperature as the consumer class drive drops from 2 years
> under optimal conditions to 20 weeks. Five degrees of temperature rise in
> the storage environment drops the data retention period to 10 weeks.
> Overall, JEDEC lists a 3-month period of data retention as the standard for
> enterprise class drives.

> A check of various drive manufacturers, in this case Samsung, Intel, and
> Seagate, shows that their ratings for data retention of their consumer class
> drives are what would be expected for JEDEC's enterprise class drive
> standards. All three quote a nominal 3-month retention time period. Most
> likely, the manufacturers are being conservative; however, it demonstrates
> the potential variability the manufacturers associate with data retention on
> any SSD in storage.

~~~
astrodust
As SSD explodes in capacity and HDD struggles to keep up look for "archival
grade SSD" to emerge as a serious product.

I know there's a ridiculous amount of offline data that must be maintained and
tapes aren't always a practical solution.

Manufacturers are being _very_ conservative when it comes to retention times.
I haven't heard of anyone losing data because they left their SSD powered off
for two long, but if you have any anecdotes or reports to share, by all means.

Early SSDs were temperamental, flaky, and would burn out quickly. The current
generation is durable, almost impossible to burn out, and seems to hold data
for years even when powered off. It's like the bad rap that plasma screens got
for burning in even when that problem was addressed by the manufacturers.

------
88e282102ae2e5b
> The site is currently offline and the bank relies solely on its backup data
> center, located within a couple of miles’ proximity.

> “Moreover, to ensure full integrity of the data, we’ve made an additional
> copy of our database before restoring the system,” ING’s press release
> reads.

What? Am I to interpret this to mean that ING has a _single_ backup of its
data under normal conditions?

------
Semiapies
In this case, a _really damn loud_ noise, at _over_ 130 db. That's a _badly_
set up fire suppression system.

~~~
j3097736
Oxygen-breathing beings are not supposed to be in a room being filled with
Halon

~~~
velox_io
True, but there is a chance of humans being in the room when a fire breaks
out. That's why they breathing apparatus scattered around (hopefully).

~~~
NeutronBoy
I don't think of the DCs I've seen have breathing apparatus around - they rely
on strictly enforced signin sheets (so when the alarm goes off, a person
outside the room immediately looks to see how many people are in the DC), and
a 30-45 second delay to evacuate the DC before the suppression system
activates.

~~~
danbruc
Halon is not too toxic and you only need a concentration of less than 10 % for
it to be effective. And I even once read that Halon helps breathing, i.e. in
the presence of Halon humans can survive at a lower oxygen concentration than
they normally could but I am unable to find a source right now. All in all it
is probably not to bad to be in a room flooded with Halon and the fumes of
burning stuff are probably the larger hazard.

------
BuffaloBagel
Yelled at mine and the front fell off.
[https://youtu.be/NtvNSg69P7g](https://youtu.be/NtvNSg69P7g)

------
velox_io
So was the problem the sound of the gas coming out resonating with the drives?

If so it's quite a big co-incidence, and goes to show how hard it is to
mitigate against every event.

~~~
T0T0R0
Everything in the article sounds like trigger-happy speculation at best. Their
equipment was brought down during a fire drill, but reading the article, it
doesn't seem like anyone actually says or knows why.

    
    
      According to people familiar with the system, the 
      pressure at ING Bank's data center was higher than 
      expected, and produced a loud sound when rapidly 
      expelled through tiny holes.
    
      The bank monitored the sound and it was very loud, 
      a source familiar with the system told us. “It was 
      as high as their equipment could monitor, over 
      130dB”.
    
      Sound means vibration, and this is what damaged the 
      hard drives.
    

Those are the words of the author, in that last sentence. It's simply
journalism at this point. It's not something worth construing as a technical
assessment.

They (the staff at Vice) just want to scoop the article, and get Vice into the
action. Was it a siren that was too loud? Was it pressure differential,
triggering a head crash in a bernoulli box?

Since we're at least two degrees removed from the actual events, and we'll
probably never get direct information from a postmortem report, this article,
to me, reads as: Data Center Outage in Eastern Europe, Reason Unknown

------
Nexxxeh
At this point, will they just scrap all the hard drives? Is this sort of thing
covered by warranty, or is it "you broke it, not our problem"?

------
chris_wot
Don't shout at your JBODs, they don't like it.

[https://m.youtube.com/watch?v=tDacjrSCeq4](https://m.youtube.com/watch?v=tDacjrSCeq4)

~~~
astrodust
That's literally in the article.

~~~
chris_wot
So good I repeated it.

