
The dangers of resetting everything at once - luu
http://rachelbythebay.com/w/2015/05/02/lockstep/
======
IvyMike
For those who aren't catching the "May 1 + 248 days" reference:
[http://arstechnica.com/information-
technology/2015/05/01/boe...](http://arstechnica.com/information-
technology/2015/05/01/boeing-787-dreamliners-contain-a-potentially-
catastrophic-software-bug/)

~~~
zatkin
What is she referring to when she says 208 days in the parenthesized
paragraph? (After the 2nd paragraph.)

~~~
robotmlg
The 208 day bug:
[https://www.novell.com/support/kb/doc.php?id=7009834](https://www.novell.com/support/kb/doc.php?id=7009834)

The 49.7 day bug: [https://support.microsoft.com/en-
us/kb/216641](https://support.microsoft.com/en-us/kb/216641)

~~~
eridal
Interesting to read.

Apparently it's caused by uptime being stored in a 32-bit integer which simply
overflows at that point in time

[https://www.ibm.com/developerworks/community/blogs/anthonyv/...](https://www.ibm.com/developerworks/community/blogs/anthonyv/entry/497_the_number_of_the_it_beast2?lang=en)

------
grecy
I work at a telco with _a lot_ of aging systems and servers.

A couple of years ago we had a total power failure of our major data center
and central office, including the backup gen so literally everything failed
and an entire province of Canada (and most of two more) were without any
telco, data, wireless.

It was eye opening and extremely scary to see what happens when _everything_
is turned off at once and you need to bring stuff back up. Some of the
equipment had never been turned off, and certainly not all at once.

During the next weeks we encountered many, many catch-22 scenarios like
"Before we turn on System A, it needs system B to be online. B needs C, C
needs A." Oh.

For reference of how severe this is, paramedics were posting up at major
intersections in towns and communities in case people needed help - and the
radio stations in half of Canada were telling people what to do. When even 911
doesn't work you have problems.

~~~
serve_yay
This is how "if it ain't broke, don't fix it" calcifies. People point to old
systems like this (when they are working, that is) all the time as proof that
developers are too obsessed with shiny new things. And I'm not saying that
isn't true, but there is such a thing as being too conservative.

~~~
grecy
> _there is such a thing as being too conservative_

In my experience, a telco is the definition of "too conservative".

------
flashman
Considering Boeing said no airframes even came close to 248 days uptime, I
would consider this a non-issue.

~~~
kens
Also, the affected planes are now required by law to have power deactivated at
an interval of no more than 120 days, so they will never get anywhere near 248
days.
[https://www.federalregister.gov/articles/2015/05/01/2015-100...](https://www.federalregister.gov/articles/2015/05/01/2015-10066/airworthiness-
directives-the-boeing-company-airplanes)

~~~
brudgers
I suspect that the Federal Regulations also requires the planes not to fail at
248 days (or words to that effect). Specifications are not the thing
specified.

------
GeorgeHahn
Forekast event so we don't forget to look out for 787s falling from the sky:
[https://forekast.com/events/5547023b666b770ea12b0000](https://forekast.com/events/5547023b666b770ea12b0000)

~~~
iyn
Never heard of Forekast, interesting idea. There is also similar(?) site -
PredictionBook ([http://predictionbook.com/](http://predictionbook.com/)).

------
perlgeek
Instead of rebooting all the affected systems, create a monitoring check that
alerts the ops team a few days or weeks before the error condition (system
will hang/reboot, SSL cert expires) comes to pass.

Then it's at their discretion what do (for example, decide when to reboot).

~~~
techsupporter
> alerts the ops team

"Everybody" knows that no good Web 3.0 company has an ops team. We're too
expensive and too easily replaced with a handful of shell scripts or can just
be fobbed off as another responsibility of the devs who wrote the code. (After
all, if the code was perfect, there wouldn't need to be an ops team, right?)
Silly rabbit, ops teams are for lumbering companies who "Just Don't Get It."

Signed,

Slightly miffed ops person who thinks he does a lot for his company but feels
woefully underappreciated in the new IT.

~~~
kjs3
Meh...you want underappreciated? Try being in infosec. Those new-fangled
companies don't think there's a thing wrong with "We'll worry about security
after we get acquired...", regardless of how sensitive the data their holding
is.

~~~
Silhouette
And sadly that will continue to be the case until data protection regulators
have real teeth, at which point due diligence before any acquisition should
obviously include a thorough audit of these areas. A potential acquisition
target that hasn't looked after its data properly and is a large regulatory
action waiting to happen should then, rightly, be unlikely to exit
successfully until they get their house in order.

~~~
kjs3
Gosh, I'm apparently old fashioned here. I would think that "I'm a startup
that handles the sensitive information of our users" would immediately segue
to "we should take prudent efforts to secure that data", not "fuck security
till I'm mandated to do it by regulators, cause fuck those users until there's
an exit". Get off my lawn and all that.

~~~
Silhouette
_I would think that "I'm a startup that handles the sensitive information of
our users" would immediately segue to "we should take prudent efforts to
secure that data"_

I would hope for that, too. That's certainly how my businesses operate.

Sadly, reading sites frequented by the start-up community, including HN,
taught me long ago that plenty of entrepreneurial types will feel absolutely
no guilt about skipping things like security and privacy safeguards if it gets
them significantly more/quicker money. They just hope that they will be able
to handle any PR fall-out if it ever becomes necessary, and it's one more risk
to manage, nothing more.

If something really bad happens, their back-up plan is simply to fold the
business and start a new one. They'll write off the loss without much regard
to the customers who had supported them or any damage that might be caused to
those customers by the leakage of that sensitive information. In short, your
second characterisation is all too realistic.

I think this is almost inevitable as long as the start-up culture is focussed
around either having an outside shot at being the next Google/Facebook/Apple,
having a realistic chance of being acquired by the current
Google/Facebook/Apple within a fairly short period, or throwing it all away
and starting again. By its nature, this business attracts gamblers. Lacking
any meaningful penalty for not taking proper precautions, not just for the
start-up but also for the founders/leadership of the start-up and their
investors, the odds are more in favour of those who cut corners. Looking out
for your customers can even be a direct competitive disadvantage.

To change the culture, you need to change the attitude of either the founders
or the funders. The former would take something like piercing the corporate
shield and making the officers of a company personally responsible for
negligent data leaks, probably not just in monetary terms but also something
they can't shake like barring them from being officers of any other company
for some significant period afterwards (thus killing the dump-it-and-start-
over strategy). The latter just needs a direct financial penalty severe enough
to make cutting the corners at the risk of user data not a good bet, which in
practice is probably much easier to achieve, and without the negative side
effect of making honest but nervous founders more reluctant to take a risk on
starting a business.

------
spectre256
This is hard won advice.

I've worked at places where we managed lots of systems, and we weren't quite
organized enough to jot down when known issues would crop up far in the future
and remember them. SSL certs specifically bit us once or twice.

~~~
vacri
Stick an SSL expiry warning into your alerting system. We have one in our
nagios system - checking once a day, it gives a 45-day warning for an
impending expiry.

As long as your alerting system allows custom alerts, you'll always be able to
run such a check.

~~~
bigiain
This assumes "an alerting system".

Not guaranteed a valid assumption. (In fact, anecdotal evidence suggests it's
the exception rather than the rule, at lest for businesses below a certain
size.)

~~~
jacquesm
That's a very big bad red flag then. If you run even a single service you need
something that monitors that service and that something needs to be on a
different bit of hardware and needs to be able to reach you even when it can
no longer reach its own network uplink.

~~~
Silhouette
_If you run even a single service you need something that monitors that
service and that something needs to be on a different bit of hardware and
needs to be able to reach you even when it can no longer reach its own network
uplink._

I think that's too much of a generalisation. If you're talking about an
established public service, that you're charging real money for, where
something that actually matters will be affected by even minor downtime, sure.
But if you're talking about a small team or individual, running a new service
that does something simple to help someone do something else, you probably
have many higher prioritises than that level of monitoring and alerting, but
you might still get messed around by something like all your certs expiring
overnight.

------
hueving
The heartbleed reference doesn't make much sense. Allowing active certs to
expire has little to do with heartbleed. Either people have process in place
to replace expiring certificates or they don't.

It may have produced a global phenomenon where people observe multiple expired
certificates around that date; however, there is absolutely no difference to
an end-user of a particular site whether the cert is expired in April or
October. They are both as equally bad.

In response to heartbleed, a system admin would have gained no advantage by
waiting to react and would have been exposing his/her users to MiTM attacks by
waiting longer.

~~~
bigiain
It makes sense to me - how many of the top stories here do you suppose are
HeartBleed cert replacement issues (and look at the dates, and see who looks
like they updated their certs days or weeks before the rest of us found out
about it):

[https://hn.algolia.com/?query=ssl%20expired&sort=byDate&pref...](https://hn.algolia.com/?query=ssl%20expired&sort=byDate&prefix=false&page=0&dateRange=all&type=story)

I suspect the real issue is that in "emergency situations" people get the
important shit done (and replace certs immediately), but don't always do the
non-emergency process type stuff that'd normally get done when doing important
shit by the schedule (updating reminders about cert expiry dates).

My guess is Instagram, senate.gov. gmail, and Docker (amongst others) are
going to have ops people wandering around over the next few months saying
"Hey, we just got the ssl cert update reminder, but someone's already renewed
the cert months ago and didn't update 'the system(tm)'. What gives, people?"

------
stevewilhelm
> We'll also assume they have a bug which makes the machine lock up after 300
> days of uptime. Nobody knows about this yet, but the bug exists.

> So here's the trick: any time you see an announcement on date X of something
> bad that happens after item Y has been up for more than Z days, calculate
> what X + Z is and make a note in your calendar. That's the first possible
> date you should see a cluster of events beginning.

What? Doesn't this assume the announcement date X was when the bug was
introduced into the software (in the example the Linux OS) and installed on
your servers?

~~~
bigiain
No, She's assuming on date X a whole bunch of people are going to kneejerk
react and reboot things (like Boeing 787s) without enough thought or putting
any process in place to actually manage or mitigate the problem, then on date
X+Z all those things are going to crash at the same time (possibly into many
smoking holes in the ground).

~~~
stevewilhelm
Ah. my confusion stems from this line.

> So here's the trick: any time you see an announcement on date X of something
> bad that happens after item Y has been up for more than Z days, calculate
> what X + Z is and make a note in your calendar.

I thought she was making this suggestion to the 'kneejerk' admins, not us
clever admins watching for failures externally.

------
Zardoz84
Interesting. Were I work, We have scheduled a reboot every night some develop
Tomcat servers. At same time, we apply automatic database scheme update and
automatic updated from the latest version from our local repository. Also,
users don't touch any of these servers, so not should be any problem. But I
see that here is a mind-share that rebooting Tomcat server pass X time is a
good idea as with time they begin to do funny things, but we don't do that.

------
amelius
Just reboot your systems on a regular basis, and time-interleaved.

------
weems
Isn't it supposed to be 2020-02-03 00:00:00?

